srowen commented on issue #26722: [SPARK-24666][ML] Fix infinity vectors produced by Word2Vec when numIterations are large URL: https://github.com/apache/spark/pull/26722#issuecomment-560431648 Ah right, disregard my previous comment. Am I right that the original implementation, being single-threaded, computes just one updated vector per word per iteration? and in the Spark implementation, it comes up with several, because the word may appear in multiple partitions. Then adding them doesn't make sense. It would make sense to average them. That's not quite the same as dividing by number of partitions, as the word may not appear in all partitions. You could accumulate a simple count in reduceByKey then divide through the sum by count?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org