[ 
https://issues.apache.org/jira/browse/SPARK-2510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075951#comment-14075951
 ] 

Xiangrui Meng commented on SPARK-2510:
--------------------------------------

Had an offline discussion with [~liquanpei] and checked the C implementation of 
word2vec. It is not embarrassingly parallel because it frequently updates the 
global vectors, which is okay for multithreading but bad for distributed. We 
are thinking about making stochastic updates within each partition and then 
merging the vectors. Averaging works for SGD but I doubt whether it would work 
here. More to investigate.

> word2vec: Distributed Representation of Words
> ---------------------------------------------
>
>                 Key: SPARK-2510
>                 URL: https://issues.apache.org/jira/browse/SPARK-2510
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Liquan Pei
>            Assignee: Liquan Pei
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> We would like to add parallel implementation of word2vec to MLlib. word2vec 
> finds distributed representation of words through training of large data 
> sets. The Spark programming model fits nicely with word2vec as the training 
> algorithm of word2vec is embarrassingly parallel. We will focus on skip-gram 
> model and negative sampling in our initial implementation. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to