[GitHub] spark issue #19372: [SPARK-22156][MLLIB] Fix update equation of learning rat...

2017-10-07 Thread nzw0301
Github user nzw0301 commented on the issue:

https://github.com/apache/spark/pull/19372
  
Thank you for your kindful reviews!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19372: [SPARK-22156][MLLIB] Fix update equation of learn...

2017-10-03 Thread nzw0301
Github user nzw0301 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19372#discussion_r142332793
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -368,11 +371,12 @@ class Word2Vec extends Serializable with Logging {
 var wc = wordCount
 if (wordCount - lastWordCount > 1) {
   lwc = wordCount
-  // TODO: discount by iteration?
-  alpha =
-learningRate * (1 - numPartitions * wordCount.toDouble / 
(trainWordsCount + 1))
+  alpha = learningRate *
+(1 - (numPartitions * wordCount.toDouble + 
numWordsProcessedInPreviousIterations) /
+  totalWordsCounts)
   if (alpha < learningRate * 0.0001) alpha = learningRate * 
0.0001
-  logInfo("wordCount = " + wordCount + ", alpha = " + alpha)
+  logInfo("wordCount = " + (wordCount + 
numWordsProcessedInPreviousIterations) +
--- End diff --

@srowen Done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19372: [SPARK-22156][MLLIB] Fix update equation of learning rat...

2017-10-02 Thread nzw0301
Github user nzw0301 commented on the issue:

https://github.com/apache/spark/pull/19372
  
I updated the results of word2vec example based on this PR in the first 
comment.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19372: [SPARK-22156][MLLIB] Fix update equation of learning rat...

2017-10-02 Thread nzw0301
Github user nzw0301 commented on the issue:

https://github.com/apache/spark/pull/19372
  
Thank you for your reviews, @LowikC.

Like this?

```scala
val totalWordsCounts = numIterations * trainWordsCount + 1
val numWordsProcessedInPreviousIterations = (k - 1) * trainWordsCount

alpha = learningRate *
  (1 - (numPartitions * wordCount.toDouble + 
numWordsProcessedInPreviousIterations) /
totalWordsCounts)
if (alpha < learningRate * 0.0001) alpha = learningRate * 0.0001
logInfo("wordCount = " + (wordCount + 
numWordsProcessedInPreviousIterations) +
  ", alpha = " + alpha)
```



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19372: [SPARK-22156][MLLIB] Fix update equation of learn...

2017-09-28 Thread nzw0301
Github user nzw0301 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19372#discussion_r141609260
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -368,9 +368,9 @@ class Word2Vec extends Serializable with Logging {
 var wc = wordCount
 if (wordCount - lastWordCount > 1) {
   lwc = wordCount
-  alpha =
-learningRate *
-  (1 - numPartitions * wordCount.toDouble / (numIterations 
* trainWordsCount + 1))
+  alpha = learningRate *
+(1 - numPartitions * wordCount.toDouble + (k - 1) * 
trainWordsCount /
--- End diff --

oh... Thanks! I fixed it


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19372: [SPARK-22156][MLLIB] Fix update equation of learning rat...

2017-09-28 Thread nzw0301
Github user nzw0301 commented on the issue:

https://github.com/apache/spark/pull/19372
  
Thank you for your comment, @LowikC.
You are right, my PR code is incorrect.

Correct update formula is

```scala
alpha = learningRate *
  (1 - numPartitions * wordCount.toDouble + (k - 1) * trainWordsCount /
(numIterations * trainWordsCount + 1))
```



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19372: [MLLIB] Fix update equation of learning rate in Word2Vec...

2017-09-28 Thread nzw0301
Github user nzw0301 commented on the issue:

https://github.com/apache/spark/pull/19372
  
Thank you for your comment, @srowen.
I'll create an isuen on JIRA.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19372: [MLLIB] Fix update equation of learning rate in W...

2017-09-27 Thread nzw0301
GitHub user nzw0301 opened a pull request:

https://github.com/apache/spark/pull/19372

[MLLIB] Fix update equation of learning rate in Word2Vec.scala

## What changes were proposed in this pull request?

Current equation of learning rate is incorrect when `numIterations` > `1`.
This PR is based on [original C 
code](https://github.com/tmikolov/word2vec/blob/master/word2vec.c#L393).

cc: @mengxr

## How was this patch tested?

manual tests

I modified [this example 
code](https://spark.apache.org/docs/2.1.1/mllib-feature-extraction.html#example).

### `numIteration=1`

 Code

```scala
import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel}

val input = sc.textFile("data/mllib/sample_lda_data.txt").map(line => 
line.split(" ").toSeq)

val word2vec = new Word2Vec()

val model = word2vec.fit(input)

val synonyms = model.findSynonyms("1", 5)

for((synonym, cosineSimilarity) <- synonyms) {
  println(s"$synonym $cosineSimilarity")
}
```

 Result

```
0 0.3267880082130432
2 0.21420614421367645
3 0.19923636317253113
9 0.1063166931271553
4 0.0397246889770031
```

### `numIteration=5`

 Code

```scala
import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel}

val input = sc.textFile("data/mllib/sample_lda_data.txt").map(line => 
line.split(" ").toSeq)

val word2vec = new Word2Vec()
word2vec.setNumIterations(5)

val model = word2vec.fit(input)

val synonyms = model.findSynonyms("1", 5)

for((synonym, cosineSimilarity) <- synonyms) {
  println(s"$synonym $cosineSimilarity")
}
```

 Result

```
2 0.9803512096405029
0 0.9774332642555237
3 0.9450059533119202
4 0.9394038319587708
9 -0.7876168489456177
```


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/nzw0301/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19372.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19372


commit e2a7d393e141405f658a68f99bc4a1f53816db95
Author: Kento NOZAWA <k_...@klis.tsukuba.ac.jp>
Date:   2017-09-27T17:04:03Z

Update equation of lr




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org