[GitHub] spark issue #19372: [SPARK-22156][MLLIB] Fix update equation of learning rat...
Github user nzw0301 commented on the issue: https://github.com/apache/spark/pull/19372 Thank you for your kindful reviews! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19372: [SPARK-22156][MLLIB] Fix update equation of learning rat...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19372 Merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19372: [SPARK-22156][MLLIB] Fix update equation of learning rat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19372 **[Test build #3943 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3943/testReport)** for PR 19372 at commit [`2ea3f18`](https://github.com/apache/spark/commit/2ea3f18bb002c38f387e01c31c6fb3ce2cb37231). * This patch passes all tests. * This patch **does not merge cleanly**. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19372: [SPARK-22156][MLLIB] Fix update equation of learning rat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19372 **[Test build #3943 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3943/testReport)** for PR 19372 at commit [`2ea3f18`](https://github.com/apache/spark/commit/2ea3f18bb002c38f387e01c31c6fb3ce2cb37231). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19372: [SPARK-22156][MLLIB] Fix update equation of learning rat...
Github user nzw0301 commented on the issue: https://github.com/apache/spark/pull/19372 I updated the results of word2vec example based on this PR in the first comment. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19372: [SPARK-22156][MLLIB] Fix update equation of learning rat...
Github user nzw0301 commented on the issue: https://github.com/apache/spark/pull/19372 Thank you for your reviews, @LowikC. Like this? ```scala val totalWordsCounts = numIterations * trainWordsCount + 1 val numWordsProcessedInPreviousIterations = (k - 1) * trainWordsCount alpha = learningRate * (1 - (numPartitions * wordCount.toDouble + numWordsProcessedInPreviousIterations) / totalWordsCounts) if (alpha < learningRate * 0.0001) alpha = learningRate * 0.0001 logInfo("wordCount = " + (wordCount + numWordsProcessedInPreviousIterations) + ", alpha = " + alpha) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19372: [SPARK-22156][MLLIB] Fix update equation of learning rat...
Github user LowikC commented on the issue: https://github.com/apache/spark/pull/19372 Looks good to me. Maybe @nzw0301 could split the formula for readability? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19372: [SPARK-22156][MLLIB] Fix update equation of learning rat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19372 **[Test build #3939 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3939/testReport)** for PR 19372 at commit [`90735a9`](https://github.com/apache/spark/commit/90735a9da11cfbff4ac5133d2f01c6719626f306). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19372: [SPARK-22156][MLLIB] Fix update equation of learning rat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19372 **[Test build #3939 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3939/testReport)** for PR 19372 at commit [`90735a9`](https://github.com/apache/spark/commit/90735a9da11cfbff4ac5133d2f01c6719626f306). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19372: [SPARK-22156][MLLIB] Fix update equation of learning rat...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19372 What do you think @LowikC ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19372: [SPARK-22156][MLLIB] Fix update equation of learning rat...
Github user nzw0301 commented on the issue: https://github.com/apache/spark/pull/19372 Thank you for your comment, @LowikC. You are right, my PR code is incorrect. Correct update formula is ```scala alpha = learningRate * (1 - numPartitions * wordCount.toDouble + (k - 1) * trainWordsCount / (numIterations * trainWordsCount + 1)) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19372: [SPARK-22156][MLLIB] Fix update equation of learning rat...
Github user LowikC commented on the issue: https://github.com/apache/spark/pull/19372 I think the PR is incorrect: - the original C code decreases the learning rate linearly from `starting_alpha` to 0, across all iterations new_alpha = `starting_alpha` * (1 - progress), progress = `word_count_actual` / (numIterations * numWordsPerIteration + 1) Note that `word_count_actual` counts the number of words processed so far, in all iterations. - in the PR, word_count_actual is called `wordCount`, but `wordCount` is reset at the beginning of each iteration (see https://github.com/nzw0301/spark/blob/e2a7d393e141405f658a68f99bc4a1f53816db95/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L365). - the current Spark code is also different: it decreases the learning rate linearly from `learningRate` to 0, in one iteration. At the beginning of the next iteration, the value is again `learningRate` - the "right" formula (to keep the behavior of the original C code) should be: ``` val numWordsToProcess = numIterations * trainWordsCount + 1 val numWordsProcessedInPreviousIterations = (k - 1) * trainWordsCount val numWordsProcessedInCurrentIteration = numPartitions * wordCount.toDouble val progress = (numWordsProcessedInPreviousIterations + numWordsProcessedInCurrentIteration) / numWordsToProcess alpha = learningRate * (1 - progress) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org