The jira: https://issues.apache.org/jira/browse/SPARK-17629
Adding new methods could result in method clutter. Changing behavior of non-experimental classes is unfortunate (ml Word2Vec was marked Experimental until Spark 2.0). Neither option is great. If I had to pick, I would rather change the existing methods to keep the class simpler moving forward. On Sat, Dec 31, 2016 at 8:29 AM, Felix Cheung <felixcheun...@hotmail.com> wrote: > Could you link to the JIRA here? > > What you suggest makes sense to me. Though we might want to maintain > compatibility and add a new method instead of changing the return type of > the existing one. > > > _____________________________ > From: Asher Krim <ak...@hubspot.com> > Sent: Wednesday, December 28, 2016 11:52 AM > Subject: ml word2vec finSynonyms return type > To: <dev@spark.apache.org> > Cc: <manojkumarsivaraj...@gmail.com>, Joseph Bradley < > jos...@databricks.com> > > > > Hey all, > > I would like to propose changing the return type of `findSynonyms` in ml's > Word2Vec > <https://github.com/apache/spark/blob/branch-2.1/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala#L233-L248> > : > > def findSynonyms(word: String, num: Int): DataFrame = { > val spark = SparkSession.builder().getOrCreate() > spark.createDataFrame(wordVectors.findSynonyms(word, num)).toDF("word", > "similarity") > } > > I find it very strange that the results are parallelized before being > returned to the user. The results are already on the driver to begin with, > and I can imagine that for most usecases (and definitely for ours) the > synonyms are collected right back to the driver. This incurs both an added > cost of shipping data to and from the cluster, as well as a more cumbersome > interface than needed. > > Can we change it to just the following? > > def findSynonyms(word: String, num: Int): Array[(String, Double)] = { > wordVectors.findSynonyms(word, num) > } > > If the user wants the results parallelized, they can still do so on their > own. > > (I had brought this up a while back in Jira. It was suggested that the > mailing list would be a better forum to discuss it, so here we are.) > > Thanks, > -- > Asher Krim > Senior Software Engineer > >