[ https://issues.apache.org/jira/browse/SPARK-2510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075951#comment-14075951 ]
Xiangrui Meng commented on SPARK-2510: -------------------------------------- Had an offline discussion with [~liquanpei] and checked the C implementation of word2vec. It is not embarrassingly parallel because it frequently updates the global vectors, which is okay for multithreading but bad for distributed. We are thinking about making stochastic updates within each partition and then merging the vectors. Averaging works for SGD but I doubt whether it would work here. More to investigate. > word2vec: Distributed Representation of Words > --------------------------------------------- > > Key: SPARK-2510 > URL: https://issues.apache.org/jira/browse/SPARK-2510 > Project: Spark > Issue Type: New Feature > Components: MLlib > Reporter: Liquan Pei > Assignee: Liquan Pei > Original Estimate: 672h > Remaining Estimate: 672h > > We would like to add parallel implementation of word2vec to MLlib. word2vec > finds distributed representation of words through training of large data > sets. The Spark programming model fits nicely with word2vec as the training > algorithm of word2vec is embarrassingly parallel. We will focus on skip-gram > model and negative sampling in our initial implementation. -- This message was sent by Atlassian JIRA (v6.2#6252)