[GitHub] [spark] srowen commented on issue #26722: [SPARK-24666][ML] Fix infinity vectors produced by Word2Vec when numIterations are large

GitBox Mon, 02 Dec 2019 06:58:30 -0800

srowen commented on issue #26722: [SPARK-24666][ML] Fix infinity vectors 
produced by Word2Vec when numIterations are large
URL: https://github.com/apache/spark/pull/26722#issuecomment-560431648
 
 
   Ah right, disregard my previous comment. Am I right that the original 
implementation, being single-threaded, computes just one updated vector per 
word per iteration? and in the Spark implementation, it comes up with several, 
because the word may appear in multiple partitions. Then adding them doesn't 
make sense. It would make sense to average them. That's not quite the same as 
dividing by number of partitions, as the word may not appear in all partitions. 
You could accumulate a simple count in reduceByKey then divide through the sum 
by count?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srowen commented on issue #26722: [SPARK-24666][ML] Fix infinity vectors produced by Word2Vec when numIterations are large

Reply via email to