Re: ml word2vec finSynonyms return type

Felix Cheung Thu, 05 Jan 2017 12:45:08 -0800

Given how Word2Vec is used the pipeline model in the new ml implementation, we 
might need to keep the current behavior?

https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/Word2VecExample.scala

_____________________________
From: Asher Krim <[email protected]<mailto:[email protected]>>
Sent: Tuesday, January 3, 2017 11:58 PM
Subject: Re: ml word2vec finSynonyms return type
To: Felix Cheung <[email protected]<mailto:[email protected]>>
Cc: <[email protected]<mailto:[email protected]>>, 
Joseph Bradley <[email protected]<mailto:[email protected]>>, 
<[email protected]<mailto:[email protected]>>

The jira: https://issues.apache.org/jira/browse/SPARK-17629

Adding new methods could result in method clutter. Changing behavior of 
non-experimental classes is unfortunate (ml Word2Vec was marked Experimental 
until Spark 2.0). Neither option is great. If I had to pick, I would rather 
change the existing methods to keep the class simpler moving forward.

On Sat, Dec 31, 2016 at 8:29 AM, Felix Cheung 
<[email protected]<mailto:[email protected]>> wrote:
Could you link to the JIRA here?

What you suggest makes sense to me. Though we might want to maintain 
compatibility and add a new method instead of changing the return type of the 
existing one.

_____________________________
From: Asher Krim <[email protected]<mailto:[email protected]>>
Sent: Wednesday, December 28, 2016 11:52 AM
Subject: ml word2vec finSynonyms return type
To: <[email protected]<mailto:[email protected]>>
Cc: <[email protected]<mailto:[email protected]>>, 
Joseph Bradley <[email protected]<mailto:[email protected]>>

Hey all,

I would like to propose changing the return type of `findSynonyms` in ml's 
Word2Vec<https://github.com/apache/spark/blob/branch-2.1/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala#L233-L248>:

def findSynonyms(word: String, num: Int): DataFrame = {
  val spark = SparkSession.builder().getOrCreate()
  spark.createDataFrame(wordVectors.findSynonyms(word, num)).toDF("word", 
"similarity")
}

I find it very strange that the results are parallelized before being returned 
to the user. The results are already on the driver to begin with, and I can 
imagine that for most usecases (and definitely for ours) the synonyms are 
collected right back to the driver. This incurs both an added cost of shipping 
data to and from the cluster, as well as a more cumbersome interface than 
needed.

Can we change it to just the following?

def findSynonyms(word: String, num: Int): Array[(String, Double)] = {
  wordVectors.findSynonyms(word, num)
}

If the user wants the results parallelized, they can still do so on their own.

(I had brought this up a while back in Jira. It was suggested that the mailing 
list would be a better forum to discuss it, so here we are.)

Thanks,
--
Asher Krim
Senior Software Engineer
[http://cdn2.hubspot.net/hub/137828/file-223457316-png/HubSpot_User_Group_Images/HUG_lrg_HS.png?t=1477096082917]

Re: ml word2vec finSynonyms return type

Reply via email to