[jira] [Comment Edited] (SPARK-6065) Optimize word2vec.findSynonyms speed

Manoj Kumar (JIRA) Tue, 31 Mar 2015 12:40:36 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-6065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14389244#comment-14389244
 ]


Manoj Kumar edited comment on SPARK-6065 at 3/31/15 7:31 PM:
-------------------------------------------------------------

[~josephkb] I would like to work on this.

Does this involve storing a pre-computed distance matrix, with each row storing 
the cosine distance with respect to all other words?

And in terms of API, what do we name the helper function that computes this 
matrix (which ideally should be called once before multiple calls to 
findSynonyms)?


was (Author: mechcoder):
[~josephkb] I would like to work on this.

Does this involve storing a pre-computed distance matrix, with each row storing 
the cosine distance with respect to all other words?

And in terms of API, what do we name the helper function that computes this 
matrix (which ideally should be called before multiple calls to findSynonyms)?

> Optimize word2vec.findSynonyms speed
> ------------------------------------
>
>                 Key: SPARK-6065
>                 URL: https://issues.apache.org/jira/browse/SPARK-6065
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.2.0
>            Reporter: Joseph K. Bradley
>
> word2vec.findSynonyms iterates through the entire vocabulary to find similar 
> words.  This is really slow relative to the [gcode-hosted word2vec 
> implementation | https://code.google.com/p/word2vec/].  It should be 
> optimized by storing words in a datastructure designed for finding nearest 
> neighbors.
> This would require storing a copy of the model (basically an inverted 
> dictionary), which could be a problem if users have a big model (e.g., 100 
> features x 10M words or phrases = big dictionary).  It might be best to 
> provide a function for converting the model into a model optimized for 
> findSynonyms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-6065) Optimize word2vec.findSynonyms speed

Reply via email to