It took me a while, but I finally got around this: https://github.com/apache/spark/pull/16811/files
On Fri, Jan 6, 2017 at 4:03 AM, Asher Krim <ak...@hubspot.com> wrote: > Felix - I'm not sure I understand your example about pipeline models, > could you elaborate? I'm talking about the `findSynonyms` methods, which > AFAIK have nothing to do with pipeline models. > > Joseph - Cool, thanks, I'll PR something in the next few days (and reopen > SPARK-17629 <https://issues.apache.org/jira/browse/SPARK-17629>) > > On Fri, Jan 6, 2017 at 12:33 AM, Joseph Bradley <jos...@databricks.com> > wrote: > >> We returned a DataFrame since it is a nicer API, but I agree forcing RDD >> operations is not ideal. I'd be OK with adding a new method, but I agree >> with Felix that we cannot break the API for something like this. >> >> On Thu, Jan 5, 2017 at 12:44 PM, Felix Cheung <felixcheun...@hotmail.com> >> wrote: >> >>> Given how Word2Vec is used the pipeline model in the new ml >>> implementation, we might need to keep the current behavior? >>> >>> >>> https://github.com/apache/spark/blob/master/examples/src/mai >>> n/scala/org/apache/spark/examples/ml/Word2VecExample.scala >>> >>> >>> _____________________________ >>> From: Asher Krim <ak...@hubspot.com> >>> Sent: Tuesday, January 3, 2017 11:58 PM >>> Subject: Re: ml word2vec finSynonyms return type >>> To: Felix Cheung <felixcheun...@hotmail.com> >>> Cc: <manojkumarsivaraj...@gmail.com>, Joseph Bradley < >>> jos...@databricks.com>, <dev@spark.apache.org> >>> >>> >>> >>> The jira: https://issues.apache.org/jira/browse/SPARK-17629 >>> >>> Adding new methods could result in method clutter. Changing behavior of >>> non-experimental classes is unfortunate (ml Word2Vec was marked >>> Experimental until Spark 2.0). Neither option is great. If I had to pick, I >>> would rather change the existing methods to keep the class simpler moving >>> forward. >>> >>> >>> On Sat, Dec 31, 2016 at 8:29 AM, Felix Cheung <felixcheun...@hotmail.com >>> > wrote: >>> >>>> Could you link to the JIRA here? >>>> >>>> What you suggest makes sense to me. Though we might want to maintain >>>> compatibility and add a new method instead of changing the return type of >>>> the existing one. >>>> >>>> >>>> _____________________________ >>>> From: Asher Krim <ak...@hubspot.com> >>>> Sent: Wednesday, December 28, 2016 11:52 AM >>>> Subject: ml word2vec finSynonyms return type >>>> To: <dev@spark.apache.org> >>>> Cc: <manojkumarsivaraj...@gmail.com>, Joseph Bradley < >>>> jos...@databricks.com> >>>> >>>> >>>> >>>> Hey all, >>>> >>>> I would like to propose changing the return type of `findSynonyms` in >>>> ml's Word2Vec >>>> <https://github.com/apache/spark/blob/branch-2.1/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala#L233-L248> >>>> : >>>> >>>> def findSynonyms(word: String, num: Int): DataFrame = { >>>> val spark = SparkSession.builder().getOrCreate() >>>> spark.createDataFrame(wordVectors.findSynonyms(word, >>>> num)).toDF("word", "similarity") >>>> } >>>> >>>> I find it very strange that the results are parallelized before being >>>> returned to the user. The results are already on the driver to begin with, >>>> and I can imagine that for most usecases (and definitely for ours) the >>>> synonyms are collected right back to the driver. This incurs both an added >>>> cost of shipping data to and from the cluster, as well as a more cumbersome >>>> interface than needed. >>>> >>>> Can we change it to just the following? >>>> >>>> def findSynonyms(word: String, num: Int): Array[(String, Double)] = { >>>> wordVectors.findSynonyms(word, num) >>>> } >>>> >>>> If the user wants the results parallelized, they can still do so on >>>> their own. >>>> >>>> (I had brought this up a while back in Jira. It was suggested that the >>>> mailing list would be a better forum to discuss it, so here we are.) >>>> >>>> Thanks, >>>> -- >>>> Asher Krim >>>> Senior Software Engineer >>>> >>>> >>> >>> >> >> >> -- >> >> Joseph Bradley >> >> Software Engineer - Machine Learning >> >> Databricks, Inc. >> >> [image: http://databricks.com] <http://databricks.com/> >> > > > > -- > Asher Krim > Senior Software Engineer >