[GitHub] spark issue #19372: [SPARK-22156][MLLIB] Fix update equation of learning rat...

2017-10-07 Thread nzw0301
Github user nzw0301 commented on the issue:

https://github.com/apache/spark/pull/19372
  
Thank you for your kindful reviews!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19372: [SPARK-22156][MLLIB] Fix update equation of learning rat...

2017-10-07 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/19372
  
Merged to master


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19372: [SPARK-22156][MLLIB] Fix update equation of learning rat...

2017-10-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19372
  
**[Test build #3943 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3943/testReport)**
 for PR 19372 at commit 
[`2ea3f18`](https://github.com/apache/spark/commit/2ea3f18bb002c38f387e01c31c6fb3ce2cb37231).
 * This patch passes all tests.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19372: [SPARK-22156][MLLIB] Fix update equation of learning rat...

2017-10-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19372
  
**[Test build #3943 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3943/testReport)**
 for PR 19372 at commit 
[`2ea3f18`](https://github.com/apache/spark/commit/2ea3f18bb002c38f387e01c31c6fb3ce2cb37231).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19372: [SPARK-22156][MLLIB] Fix update equation of learning rat...

2017-10-02 Thread nzw0301
Github user nzw0301 commented on the issue:

https://github.com/apache/spark/pull/19372
  
I updated the results of word2vec example based on this PR in the first 
comment.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19372: [SPARK-22156][MLLIB] Fix update equation of learning rat...

2017-10-02 Thread nzw0301
Github user nzw0301 commented on the issue:

https://github.com/apache/spark/pull/19372
  
Thank you for your reviews, @LowikC.

Like this?

```scala
val totalWordsCounts = numIterations * trainWordsCount + 1
val numWordsProcessedInPreviousIterations = (k - 1) * trainWordsCount

alpha = learningRate *
  (1 - (numPartitions * wordCount.toDouble + 
numWordsProcessedInPreviousIterations) /
totalWordsCounts)
if (alpha < learningRate * 0.0001) alpha = learningRate * 0.0001
logInfo("wordCount = " + (wordCount + 
numWordsProcessedInPreviousIterations) +
  ", alpha = " + alpha)
```



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19372: [SPARK-22156][MLLIB] Fix update equation of learning rat...

2017-10-02 Thread LowikC
Github user LowikC commented on the issue:

https://github.com/apache/spark/pull/19372
  
Looks good to me.
Maybe @nzw0301 could split the formula for readability?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19372: [SPARK-22156][MLLIB] Fix update equation of learning rat...

2017-09-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19372
  
**[Test build #3939 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3939/testReport)**
 for PR 19372 at commit 
[`90735a9`](https://github.com/apache/spark/commit/90735a9da11cfbff4ac5133d2f01c6719626f306).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19372: [SPARK-22156][MLLIB] Fix update equation of learning rat...

2017-09-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19372
  
**[Test build #3939 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3939/testReport)**
 for PR 19372 at commit 
[`90735a9`](https://github.com/apache/spark/commit/90735a9da11cfbff4ac5133d2f01c6719626f306).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19372: [SPARK-22156][MLLIB] Fix update equation of learning rat...

2017-09-30 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/19372
  
What do you think @LowikC ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19372: [SPARK-22156][MLLIB] Fix update equation of learning rat...

2017-09-28 Thread nzw0301
Github user nzw0301 commented on the issue:

https://github.com/apache/spark/pull/19372
  
Thank you for your comment, @LowikC.
You are right, my PR code is incorrect.

Correct update formula is

```scala
alpha = learningRate *
  (1 - numPartitions * wordCount.toDouble + (k - 1) * trainWordsCount /
(numIterations * trainWordsCount + 1))
```



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19372: [SPARK-22156][MLLIB] Fix update equation of learning rat...

2017-09-28 Thread LowikC
Github user LowikC commented on the issue:

https://github.com/apache/spark/pull/19372
  
I think the PR is incorrect:
- the original C code decreases the learning rate linearly from 
`starting_alpha` to 0, across all iterations 
new_alpha = `starting_alpha` * (1 - progress), progress = 
`word_count_actual` / (numIterations * numWordsPerIteration + 1)
Note that `word_count_actual` counts the number of words processed so far, 
in all iterations.

- in the PR, word_count_actual is called `wordCount`, but `wordCount` is 
reset at the beginning of each iteration (see 
https://github.com/nzw0301/spark/blob/e2a7d393e141405f658a68f99bc4a1f53816db95/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L365).

- the current Spark code is also different: it decreases the learning rate 
linearly from `learningRate` to 0, in one iteration. At the beginning of the 
next iteration, the value is again `learningRate`

- the "right" formula (to keep the behavior of the original C code) should 
be:
```
val numWordsToProcess = numIterations * trainWordsCount + 1
val numWordsProcessedInPreviousIterations = (k - 1) * trainWordsCount
val numWordsProcessedInCurrentIteration = numPartitions * wordCount.toDouble
val progress = (numWordsProcessedInPreviousIterations + 
numWordsProcessedInCurrentIteration) / numWordsToProcess
alpha = learningRate * (1 - progress)
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org