Hi, I've been experimenting with the Spark Word2Vec implementation in the MLLib package. It seems to me that only the preparatory steps are actually performed in a distributed way, i.e. stages 0-2 that prepare the data. In stage 3 (mapPartitionsWithIndex at Word2Vec.scala:312), only one node seems to be working, using one CPU.
I suppose this is related to the discussion in [1], essentially stating that the original algorithm allows for multi-threading, but not for distributed computation due to frequent internal communication. To my understanding, this issue has not been fully resolved in Spark, has it? I just wonder whether I am interpreting the current situation correctly. Thanks! Carsten [1] https://issues.apache.org/jira/browse/SPARK-2510 -- Carsten Schnober Doctoral Researcher Ubiquitous Knowledge Processing (UKP) Lab FB 20 / Computer Science Department Technische Universität Darmstadt Hochschulstr. 10, D-64289 Darmstadt, Germany phone [+49] (0)6151 16-6227, fax -5455, room S2/02/B111 schno...@ukp.informatik.tu-darmstadt.de www.ukp.tu-darmstadt.de Web Research at TU Darmstadt (WeRC): www.werc.tu-darmstadt.de GRK 1994: Adaptive Preparation of Information from Heterogeneous Sources (AIPHES): www.aiphes.tu-darmstadt.de PhD program: Knowledge Discovery in Scientific Literature (KDSL) www.kdsl.tu-darmstadt.de -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Word2Vec-distributed-tp23758.html Sent from the Apache Spark User List mailing list archive at Nabble.com.