Felix - I'm not sure I understand your example about pipeline models, could you elaborate? I'm talking about the `findSynonyms` methods, which AFAIK have nothing to do with pipeline models.
Joseph - Cool, thanks, I'll PR something in the next few days (and reopen SPARK-17629 <https://issues.apache.org/jira/browse/SPARK-17629>) On Fri, Jan 6, 2017 at 12:33 AM, Joseph Bradley <jos...@databricks.com> wrote: > We returned a DataFrame since it is a nicer API, but I agree forcing RDD > operations is not ideal. I'd be OK with adding a new method, but I agree > with Felix that we cannot break the API for something like this. > > On Thu, Jan 5, 2017 at 12:44 PM, Felix Cheung <felixcheun...@hotmail.com> > wrote: > >> Given how Word2Vec is used the pipeline model in the new ml >> implementation, we might need to keep the current behavior? >> >> >> https://github.com/apache/spark/blob/master/examples/src/ >> main/scala/org/apache/spark/examples/ml/Word2VecExample.scala >> >> >> _____________________________ >> From: Asher Krim <ak...@hubspot.com> >> Sent: Tuesday, January 3, 2017 11:58 PM >> Subject: Re: ml word2vec finSynonyms return type >> To: Felix Cheung <felixcheun...@hotmail.com> >> Cc: <manojkumarsivaraj...@gmail.com>, Joseph Bradley < >> jos...@databricks.com>, <dev@spark.apache.org> >> >> >> >> The jira: https://issues.apache.org/jira/browse/SPARK-17629 >> >> Adding new methods could result in method clutter. Changing behavior of >> non-experimental classes is unfortunate (ml Word2Vec was marked >> Experimental until Spark 2.0). Neither option is great. If I had to pick, I >> would rather change the existing methods to keep the class simpler moving >> forward. >> >> >> On Sat, Dec 31, 2016 at 8:29 AM, Felix Cheung <felixcheun...@hotmail.com> >> wrote: >> >>> Could you link to the JIRA here? >>> >>> What you suggest makes sense to me. Though we might want to maintain >>> compatibility and add a new method instead of changing the return type of >>> the existing one. >>> >>> >>> _____________________________ >>> From: Asher Krim <ak...@hubspot.com> >>> Sent: Wednesday, December 28, 2016 11:52 AM >>> Subject: ml word2vec finSynonyms return type >>> To: <dev@spark.apache.org> >>> Cc: <manojkumarsivaraj...@gmail.com>, Joseph Bradley < >>> jos...@databricks.com> >>> >>> >>> >>> Hey all, >>> >>> I would like to propose changing the return type of `findSynonyms` in >>> ml's Word2Vec >>> <https://github.com/apache/spark/blob/branch-2.1/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala#L233-L248> >>> : >>> >>> def findSynonyms(word: String, num: Int): DataFrame = { >>> val spark = SparkSession.builder().getOrCreate() >>> spark.createDataFrame(wordVectors.findSynonyms(word, >>> num)).toDF("word", "similarity") >>> } >>> >>> I find it very strange that the results are parallelized before being >>> returned to the user. The results are already on the driver to begin with, >>> and I can imagine that for most usecases (and definitely for ours) the >>> synonyms are collected right back to the driver. This incurs both an added >>> cost of shipping data to and from the cluster, as well as a more cumbersome >>> interface than needed. >>> >>> Can we change it to just the following? >>> >>> def findSynonyms(word: String, num: Int): Array[(String, Double)] = { >>> wordVectors.findSynonyms(word, num) >>> } >>> >>> If the user wants the results parallelized, they can still do so on >>> their own. >>> >>> (I had brought this up a while back in Jira. It was suggested that the >>> mailing list would be a better forum to discuss it, so here we are.) >>> >>> Thanks, >>> -- >>> Asher Krim >>> Senior Software Engineer >>> >>> >> >> > > > -- > > Joseph Bradley > > Software Engineer - Machine Learning > > Databricks, Inc. > > [image: http://databricks.com] <http://databricks.com/> > -- Asher Krim Senior Software Engineer