Github user LowikC commented on the issue:

    https://github.com/apache/spark/pull/19372
  
    I think the PR is incorrect:
    - the original C code decreases the learning rate linearly from 
`starting_alpha` to 0, across all iterations 
    new_alpha = `starting_alpha` * (1 - progress), progress = 
`word_count_actual` / (numIterations * numWordsPerIteration + 1)
    Note that `word_count_actual` counts the number of words processed so far, 
in all iterations.
    
    - in the PR, word_count_actual is called `wordCount`, but `wordCount` is 
reset at the beginning of each iteration (see 
https://github.com/nzw0301/spark/blob/e2a7d393e141405f658a68f99bc4a1f53816db95/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L365).
    
    - the current Spark code is also different: it decreases the learning rate 
linearly from `learningRate` to 0, in one iteration. At the beginning of the 
next iteration, the value is again `learningRate`
    
    - the "right" formula (to keep the behavior of the original C code) should 
be:
    ```
    val numWordsToProcess = numIterations * trainWordsCount + 1
    val numWordsProcessedInPreviousIterations = (k - 1) * trainWordsCount
    val numWordsProcessedInCurrentIteration = numPartitions * wordCount.toDouble
    val progress = (numWordsProcessedInPreviousIterations + 
numWordsProcessedInCurrentIteration) / numWordsToProcess
    alpha = learningRate * (1 - progress)
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to