Re: ml word2vec finSynonyms return type

Asher Krim Tue, 03 Jan 2017 23:59:03 -0800

The jira: https://issues.apache.org/jira/browse/SPARK-17629


Adding new methods could result in method clutter. Changing behavior of
non-experimental classes is unfortunate (ml Word2Vec was marked
Experimental until Spark 2.0). Neither option is great. If I had to pick, I
would rather change the existing methods to keep the class simpler moving
forward.


On Sat, Dec 31, 2016 at 8:29 AM, Felix Cheung <felixcheun...@hotmail.com>
wrote:

> Could you link to the JIRA here?
>
> What you suggest makes sense to me. Though we might want to maintain
> compatibility and add a new method instead of changing the return type of
> the existing one.
>
>
> _____________________________
> From: Asher Krim <ak...@hubspot.com>
> Sent: Wednesday, December 28, 2016 11:52 AM
> Subject: ml word2vec finSynonyms return type
> To: <dev@spark.apache.org>
> Cc: <manojkumarsivaraj...@gmail.com>, Joseph Bradley <
> jos...@databricks.com>
>
>
>
> Hey all,
>
> I would like to propose changing the return type of `findSynonyms` in ml's
> Word2Vec
> <https://github.com/apache/spark/blob/branch-2.1/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala#L233-L248>
> :
>
> def findSynonyms(word: String, num: Int): DataFrame = {
>   val spark = SparkSession.builder().getOrCreate()
>   spark.createDataFrame(wordVectors.findSynonyms(word, num)).toDF("word",
> "similarity")
> }
>
> I find it very strange that the results are parallelized before being
> returned to the user. The results are already on the driver to begin with,
> and I can imagine that for most usecases (and definitely for ours) the
> synonyms are collected right back to the driver. This incurs both an added
> cost of shipping data to and from the cluster, as well as a more cumbersome
> interface than needed.
>
> Can we change it to just the following?
>
> def findSynonyms(word: String, num: Int): Array[(String, Double)] = {
>   wordVectors.findSynonyms(word, num)
> }
>
> If the user wants the results parallelized, they can still do so on their
> own.
>
> (I had brought this up a while back in Jira. It was suggested that the
> mailing list would be a better forum to discuss it, so here we are.)
>
> Thanks,
> --
> Asher Krim
> Senior Software Engineer
>
>

Re: ml word2vec finSynonyms return type

Reply via email to