yuhao yang created SPARK-11898:
----------------------------------

             Summary: Use broadcast for the global tables in Word2Vec
                 Key: SPARK-11898
                 URL: https://issues.apache.org/jira/browse/SPARK-11898
             Project: Spark
          Issue Type: Improvement
          Components: MLlib
    Affects Versions: 1.5.2
            Reporter: yuhao yang


syn0Global and sync1Global in word2vec are quite large objects with size (vocab 
* vectorSize * 8), yet they are passed to worker using basic task serialization.

Use broadcast can greatly improve the performance. My benchmark shows that, for 
1M vocabulary and default vectorSize 100, changing to broadcast can help,
1. decrease the worker memory consumption by 45%.
2. decrease running time by 40%.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to