[ https://issues.apache.org/jira/browse/SPARK-6065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14487974#comment-14487974 ]
Manoj Kumar commented on SPARK-6065: ------------------------------------ Sorry for taking too much time to get back to this. I'm not sure if using data structures like KDTree or BallTree is good because I've read somewhere that trees are not the best for high dimensional data. (scikit-learn defaults to a brute search if the metric provided is cosine). We could probably use algos like Locality Sensitive Hashing, but it might be overkill. WDYT? > Optimize word2vec.findSynonyms speed > ------------------------------------ > > Key: SPARK-6065 > URL: https://issues.apache.org/jira/browse/SPARK-6065 > Project: Spark > Issue Type: Improvement > Components: MLlib > Affects Versions: 1.2.0 > Reporter: Joseph K. Bradley > > word2vec.findSynonyms iterates through the entire vocabulary to find similar > words. This is really slow relative to the [gcode-hosted word2vec > implementation | https://code.google.com/p/word2vec/]. It should be > optimized by storing words in a datastructure designed for finding nearest > neighbors. > This would require storing a copy of the model (basically an inverted > dictionary), which could be a problem if users have a big model (e.g., 100 > features x 10M words or phrases = big dictionary). It might be best to > provide a function for converting the model into a model optimized for > findSynonyms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org