[ https://issues.apache.org/jira/browse/SPARK-6065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14389244#comment-14389244 ]
Manoj Kumar edited comment on SPARK-6065 at 3/31/15 7:31 PM: ------------------------------------------------------------- [~josephkb] I would like to work on this. Does this involve storing a pre-computed distance matrix, with each row storing the cosine distance with respect to all other words? And in terms of API, what do we name the helper function that computes this matrix (which ideally should be called once before multiple calls to findSynonyms)? was (Author: mechcoder): [~josephkb] I would like to work on this. Does this involve storing a pre-computed distance matrix, with each row storing the cosine distance with respect to all other words? And in terms of API, what do we name the helper function that computes this matrix (which ideally should be called before multiple calls to findSynonyms)? > Optimize word2vec.findSynonyms speed > ------------------------------------ > > Key: SPARK-6065 > URL: https://issues.apache.org/jira/browse/SPARK-6065 > Project: Spark > Issue Type: Improvement > Components: MLlib > Affects Versions: 1.2.0 > Reporter: Joseph K. Bradley > > word2vec.findSynonyms iterates through the entire vocabulary to find similar > words. This is really slow relative to the [gcode-hosted word2vec > implementation | https://code.google.com/p/word2vec/]. It should be > optimized by storing words in a datastructure designed for finding nearest > neighbors. > This would require storing a copy of the model (basically an inverted > dictionary), which could be a problem if users have a big model (e.g., 100 > features x 10M words or phrases = big dictionary). It might be best to > provide a function for converting the model into a model optimized for > findSynonyms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org