Hi,
I've been experimenting with the Spark Word2Vec implementation in the
MLLib package.
It seems to me that only the preparatory steps are actually performed in
a distributed way, i.e. stages 0-2 that prepare the data. In stage 3
(mapPartitionsWithIndex at Word2Vec.scala:312), only one node seems to
be working, using one CPU.

I suppose this is related to the discussion in [1], essentially stating
that the original algorithm allows for multi-threading, but not for
distributed computation due to frequent internal communication.

To my understanding, this issue has not been fully resolved in Spark,
has it? I just wonder whether I am interpreting the current situation
correctly.

Thanks!
Carsten

[1] https://issues.apache.org/jira/browse/SPARK-2510

-- 
Carsten Schnober
Doctoral Researcher
Ubiquitous Knowledge Processing (UKP) Lab
FB 20 / Computer Science Department
Technische Universität Darmstadt
Hochschulstr. 10, D-64289 Darmstadt, Germany
phone [+49] (0)6151 16-6227, fax -5455, room S2/02/B111
schno...@ukp.informatik.tu-darmstadt.de
www.ukp.tu-darmstadt.de

Web Research at TU Darmstadt (WeRC): www.werc.tu-darmstadt.de
GRK 1994: Adaptive Preparation of Information from Heterogeneous Sources
(AIPHES): www.aiphes.tu-darmstadt.de
PhD program: Knowledge Discovery in Scientific Literature (KDSL)
www.kdsl.tu-darmstadt.de




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Word2Vec-distributed-tp23758.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to