Felix - I'm not sure I understand your example about pipeline models, could
you elaborate? I'm talking about the `findSynonyms` methods, which AFAIK
have nothing to do with pipeline models.

Joseph - Cool, thanks, I'll PR something in the next few days (and reopen
SPARK-17629 <https://issues.apache.org/jira/browse/SPARK-17629>)

On Fri, Jan 6, 2017 at 12:33 AM, Joseph Bradley <jos...@databricks.com>
wrote:

> We returned a DataFrame since it is a nicer API, but I agree forcing RDD
> operations is not ideal.  I'd be OK with adding a new method, but I agree
> with Felix that we cannot break the API for something like this.
>
> On Thu, Jan 5, 2017 at 12:44 PM, Felix Cheung <felixcheun...@hotmail.com>
> wrote:
>
>> Given how Word2Vec is used the pipeline model in the new ml
>> implementation, we might need to keep the current behavior?
>>
>>
>> https://github.com/apache/spark/blob/master/examples/src/
>> main/scala/org/apache/spark/examples/ml/Word2VecExample.scala
>>
>>
>> _____________________________
>> From: Asher Krim <ak...@hubspot.com>
>> Sent: Tuesday, January 3, 2017 11:58 PM
>> Subject: Re: ml word2vec finSynonyms return type
>> To: Felix Cheung <felixcheun...@hotmail.com>
>> Cc: <manojkumarsivaraj...@gmail.com>, Joseph Bradley <
>> jos...@databricks.com>, <dev@spark.apache.org>
>>
>>
>>
>> The jira: https://issues.apache.org/jira/browse/SPARK-17629
>>
>> Adding new methods could result in method clutter. Changing behavior of
>> non-experimental classes is unfortunate (ml Word2Vec was marked
>> Experimental until Spark 2.0). Neither option is great. If I had to pick, I
>> would rather change the existing methods to keep the class simpler moving
>> forward.
>>
>>
>> On Sat, Dec 31, 2016 at 8:29 AM, Felix Cheung <felixcheun...@hotmail.com>
>> wrote:
>>
>>> Could you link to the JIRA here?
>>>
>>> What you suggest makes sense to me. Though we might want to maintain
>>> compatibility and add a new method instead of changing the return type of
>>> the existing one.
>>>
>>>
>>> _____________________________
>>> From: Asher Krim <ak...@hubspot.com>
>>> Sent: Wednesday, December 28, 2016 11:52 AM
>>> Subject: ml word2vec finSynonyms return type
>>> To: <dev@spark.apache.org>
>>> Cc: <manojkumarsivaraj...@gmail.com>, Joseph Bradley <
>>> jos...@databricks.com>
>>>
>>>
>>>
>>> Hey all,
>>>
>>> I would like to propose changing the return type of `findSynonyms` in
>>> ml's Word2Vec
>>> <https://github.com/apache/spark/blob/branch-2.1/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala#L233-L248>
>>> :
>>>
>>> def findSynonyms(word: String, num: Int): DataFrame = {
>>>   val spark = SparkSession.builder().getOrCreate()
>>>   spark.createDataFrame(wordVectors.findSynonyms(word,
>>> num)).toDF("word", "similarity")
>>> }
>>>
>>> I find it very strange that the results are parallelized before being
>>> returned to the user. The results are already on the driver to begin with,
>>> and I can imagine that for most usecases (and definitely for ours) the
>>> synonyms are collected right back to the driver. This incurs both an added
>>> cost of shipping data to and from the cluster, as well as a more cumbersome
>>> interface than needed.
>>>
>>> Can we change it to just the following?
>>>
>>> def findSynonyms(word: String, num: Int): Array[(String, Double)] = {
>>>   wordVectors.findSynonyms(word, num)
>>> }
>>>
>>> If the user wants the results parallelized, they can still do so on
>>> their own.
>>>
>>> (I had brought this up a while back in Jira. It was suggested that the
>>> mailing list would be a better forum to discuss it, so here we are.)
>>>
>>> Thanks,
>>> --
>>> Asher Krim
>>> Senior Software Engineer
>>>
>>>
>>
>>
>
>
> --
>
> Joseph Bradley
>
> Software Engineer - Machine Learning
>
> Databricks, Inc.
>
> [image: http://databricks.com] <http://databricks.com/>
>



-- 
Asher Krim
Senior Software Engineer

Reply via email to