[GitHub] spark issue #19372: [SPARK-22156][MLLIB] Fix update equation of learning rat...
Github user nzw0301 commented on the issue: https://github.com/apache/spark/pull/19372 Thank you for your kindful reviews! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19372: [SPARK-22156][MLLIB] Fix update equation of learn...
Github user nzw0301 commented on a diff in the pull request: https://github.com/apache/spark/pull/19372#discussion_r142332793 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -368,11 +371,12 @@ class Word2Vec extends Serializable with Logging { var wc = wordCount if (wordCount - lastWordCount > 1) { lwc = wordCount - // TODO: discount by iteration? - alpha = -learningRate * (1 - numPartitions * wordCount.toDouble / (trainWordsCount + 1)) + alpha = learningRate * +(1 - (numPartitions * wordCount.toDouble + numWordsProcessedInPreviousIterations) / + totalWordsCounts) if (alpha < learningRate * 0.0001) alpha = learningRate * 0.0001 - logInfo("wordCount = " + wordCount + ", alpha = " + alpha) + logInfo("wordCount = " + (wordCount + numWordsProcessedInPreviousIterations) + --- End diff -- @srowen Done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19372: [SPARK-22156][MLLIB] Fix update equation of learning rat...
Github user nzw0301 commented on the issue: https://github.com/apache/spark/pull/19372 I updated the results of word2vec example based on this PR in the first comment. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19372: [SPARK-22156][MLLIB] Fix update equation of learning rat...
Github user nzw0301 commented on the issue: https://github.com/apache/spark/pull/19372 Thank you for your reviews, @LowikC. Like this? ```scala val totalWordsCounts = numIterations * trainWordsCount + 1 val numWordsProcessedInPreviousIterations = (k - 1) * trainWordsCount alpha = learningRate * (1 - (numPartitions * wordCount.toDouble + numWordsProcessedInPreviousIterations) / totalWordsCounts) if (alpha < learningRate * 0.0001) alpha = learningRate * 0.0001 logInfo("wordCount = " + (wordCount + numWordsProcessedInPreviousIterations) + ", alpha = " + alpha) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19372: [SPARK-22156][MLLIB] Fix update equation of learn...
Github user nzw0301 commented on a diff in the pull request: https://github.com/apache/spark/pull/19372#discussion_r141609260 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -368,9 +368,9 @@ class Word2Vec extends Serializable with Logging { var wc = wordCount if (wordCount - lastWordCount > 1) { lwc = wordCount - alpha = -learningRate * - (1 - numPartitions * wordCount.toDouble / (numIterations * trainWordsCount + 1)) + alpha = learningRate * +(1 - numPartitions * wordCount.toDouble + (k - 1) * trainWordsCount / --- End diff -- oh... Thanks! I fixed it --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19372: [SPARK-22156][MLLIB] Fix update equation of learning rat...
Github user nzw0301 commented on the issue: https://github.com/apache/spark/pull/19372 Thank you for your comment, @LowikC. You are right, my PR code is incorrect. Correct update formula is ```scala alpha = learningRate * (1 - numPartitions * wordCount.toDouble + (k - 1) * trainWordsCount / (numIterations * trainWordsCount + 1)) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19372: [MLLIB] Fix update equation of learning rate in Word2Vec...
Github user nzw0301 commented on the issue: https://github.com/apache/spark/pull/19372 Thank you for your comment, @srowen. I'll create an isuen on JIRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19372: [MLLIB] Fix update equation of learning rate in W...
GitHub user nzw0301 opened a pull request: https://github.com/apache/spark/pull/19372 [MLLIB] Fix update equation of learning rate in Word2Vec.scala ## What changes were proposed in this pull request? Current equation of learning rate is incorrect when `numIterations` > `1`. This PR is based on [original C code](https://github.com/tmikolov/word2vec/blob/master/word2vec.c#L393). cc: @mengxr ## How was this patch tested? manual tests I modified [this example code](https://spark.apache.org/docs/2.1.1/mllib-feature-extraction.html#example). ### `numIteration=1` Code ```scala import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel} val input = sc.textFile("data/mllib/sample_lda_data.txt").map(line => line.split(" ").toSeq) val word2vec = new Word2Vec() val model = word2vec.fit(input) val synonyms = model.findSynonyms("1", 5) for((synonym, cosineSimilarity) <- synonyms) { println(s"$synonym $cosineSimilarity") } ``` Result ``` 0 0.3267880082130432 2 0.21420614421367645 3 0.19923636317253113 9 0.1063166931271553 4 0.0397246889770031 ``` ### `numIteration=5` Code ```scala import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel} val input = sc.textFile("data/mllib/sample_lda_data.txt").map(line => line.split(" ").toSeq) val word2vec = new Word2Vec() word2vec.setNumIterations(5) val model = word2vec.fit(input) val synonyms = model.findSynonyms("1", 5) for((synonym, cosineSimilarity) <- synonyms) { println(s"$synonym $cosineSimilarity") } ``` Result ``` 2 0.9803512096405029 0 0.9774332642555237 3 0.9450059533119202 4 0.9394038319587708 9 -0.7876168489456177 ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/nzw0301/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19372.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19372 commit e2a7d393e141405f658a68f99bc4a1f53816db95 Author: Kento NOZAWA <k_...@klis.tsukuba.ac.jp> Date: 2017-09-27T17:04:03Z Update equation of lr --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org