GitHub user nzw0301 opened a pull request:

    https://github.com/apache/spark/pull/19372

    [MLLIB] Fix update equation of learning rate in Word2Vec.scala

    ## What changes were proposed in this pull request?
    
    Current equation of learning rate is incorrect when `numIterations` > `1`.
    This PR is based on [original C 
code](https://github.com/tmikolov/word2vec/blob/master/word2vec.c#L393).
    
    cc: @mengxr
    
    ## How was this patch tested?
    
    manual tests
    
    I modified [this example 
code](https://spark.apache.org/docs/2.1.1/mllib-feature-extraction.html#example).
    
    ### `numIteration=1`
    
    #### Code
    
    ```scala
    import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel}
    
    val input = sc.textFile("data/mllib/sample_lda_data.txt").map(line => 
line.split(" ").toSeq)
    
    val word2vec = new Word2Vec()
    
    val model = word2vec.fit(input)
    
    val synonyms = model.findSynonyms("1", 5)
    
    for((synonym, cosineSimilarity) <- synonyms) {
      println(s"$synonym $cosineSimilarity")
    }
    ```
    
    #### Result
    
    ```
    0 0.3267880082130432
    2 0.21420614421367645
    3 0.19923636317253113
    9 0.1063166931271553
    4 0.0397246889770031
    ```
    
    ### `numIteration=5`
    
    #### Code
    
    ```scala
    import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel}
    
    val input = sc.textFile("data/mllib/sample_lda_data.txt").map(line => 
line.split(" ").toSeq)
    
    val word2vec = new Word2Vec()
    word2vec.setNumIterations(5)
    
    val model = word2vec.fit(input)
    
    val synonyms = model.findSynonyms("1", 5)
    
    for((synonym, cosineSimilarity) <- synonyms) {
      println(s"$synonym $cosineSimilarity")
    }
    ```
    
    #### Result
    
    ```
    2 0.9803512096405029
    0 0.9774332642555237
    3 0.9450059533119202
    4 0.9394038319587708
    9 -0.7876168489456177
    ```


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/nzw0301/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19372.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19372
    
----
commit e2a7d393e141405f658a68f99bc4a1f53816db95
Author: Kento NOZAWA <k_...@klis.tsukuba.ac.jp>
Date:   2017-09-27T17:04:03Z

    Update equation of lr

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to