subject:"\[GitHub\] spark pull request\: \[SPARK\-12153\]\[SPARK\-7617\]\[MLlib\]add support of..."

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-22 Thread ygcao

Github user ygcao commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-187313458
  
Thanks everybody for the review and help! Cheers!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-22 Thread MLnick

Github user MLnick commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-187119717
  
@srowen thanks for merging it.

@ygcao thanks for the PR!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-22 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/10152


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-22 Thread srowen

Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-187098446
  
Merged to master


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-20 Thread srowen

Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-186546462
  
At long last I think ready to go. @mengxr any more comments? or @MLnick 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-185685961
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51483/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-185685958
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-18 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-185685501
  
**[Test build #51483 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51483/consoleFull)**
 for PR 10152 at commit 
[`a4abd40`](https://github.com/apache/spark/commit/a4abd40f5a553fa95795e9da5fff0b4ac0256188).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-18 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-185669428
  
**[Test build #51483 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51483/consoleFull)**
 for PR 10152 at commit 
[`a4abd40`](https://github.com/apache/spark/commit/a4abd40f5a553fa95795e9da5fff0b4ac0256188).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-18 Thread srowen

Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-185659555
  
Jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-16 Thread ygcao

Github user ygcao commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-185036851
  
Done! sorry for missing the comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-16 Thread ygcao

Github user ygcao commented on a diff in the pull request:

https://github.com/apache/spark/pull/10152#discussion_r53122840
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -289,24 +301,20 @@ class Word2Vec extends Serializable with Logging {
 val expTable = sc.broadcast(createExpTable())
 val bcVocab = sc.broadcast(vocab)
 val bcVocabHash = sc.broadcast(vocabHash)
-
-val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
-  new Iterator[Array[Int]] {
-def hasNext: Boolean = iter.hasNext
-
-def next(): Array[Int] = {
-  val sentence = ArrayBuilder.make[Int]
-  var sentenceLength = 0
-  while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
-val word = bcVocabHash.value.get(iter.next())
-word match {
-  case Some(w) =>
-sentence += w
-sentenceLength += 1
-  case None =>
-}
+// each partition is a collection of sentences,
+// will be translated into arrays of Index integer
+val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter 
=>
+  // Each sentence will map to 0 or more Array[Int]
+  sentenceIter.flatMap { sentence => {
+  // Sentence of words, some of which map to a word index
+  val wordIndexes = sentence.flatMap(bcVocabHash.value.get)
+  if (wordIndexes.nonEmpty) {
--- End diff --

You guys are right. didn't quite get @mengxr 's previous explanation. The 
splitting function is doing something unexpected to me, split an empty String 
will result in a non empty array.
FYI. I verified the logic using following codes to be more explicit test, 
which proved your assertions.
scala> sentences
res4: List[List[String]] = List(List(a, b, c), List(b), List(c, d))

scala> dict
res5: scala.collection.immutable.Map[String,Int] = Map(a -> 1, c -> 2)

scala> sentences.flatMap(sen=>{val 
indexes=sen.flatMap(dict.get);indexes.grouped(2).map(_.toArray)})
res6: List[Array[Int]] = List(Array(1, 2), Array(2))

scala> "".split(" ")
res7: Array[String] = Array("")

scala> "".split(" ").size
res8: Int = 1



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-15 Thread MLnick

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/10152#discussion_r52975344
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -289,24 +301,19 @@ class Word2Vec extends Serializable with Logging {
 val expTable = sc.broadcast(createExpTable())
 val bcVocab = sc.broadcast(vocab)
 val bcVocabHash = sc.broadcast(vocabHash)
-
-val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
-  new Iterator[Array[Int]] {
-def hasNext: Boolean = iter.hasNext
-
-def next(): Array[Int] = {
-  val sentence = ArrayBuilder.make[Int]
-  var sentenceLength = 0
-  while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
-val word = bcVocabHash.value.get(iter.next())
-word match {
-  case Some(w) =>
-sentence += w
-sentenceLength += 1
-  case None =>
-}
-  }
-  sentence.result()
+// each partition is a collection of sentences,
+// will be translated into arrays of Index integer
+val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter 
=>
+  // Each sentence will map to 0 or more Array[Int]
+  sentenceIter.flatMap { sentence =>
+// Sentence of words, some of which map to a word index
+val wordIndexes = sentence.flatMap(bcVocabHash.value.get)
+if (wordIndexes.nonEmpty) {
--- End diff --

@ygcao you have kept the if statement here, which I believe both @mengxr 
and @srowen have shown is not necessary.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-15 Thread ygcao

Github user ygcao commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-184528879
  
addressed the 'final' comment, and checked lint and test cases. shall we do 
the merge then? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-14 Thread MLnick

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/10152#discussion_r52867732
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -289,24 +301,20 @@ class Word2Vec extends Serializable with Logging {
 val expTable = sc.broadcast(createExpTable())
 val bcVocab = sc.broadcast(vocab)
 val bcVocabHash = sc.broadcast(vocabHash)
-
-val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
-  new Iterator[Array[Int]] {
-def hasNext: Boolean = iter.hasNext
-
-def next(): Array[Int] = {
-  val sentence = ArrayBuilder.make[Int]
-  var sentenceLength = 0
-  while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
-val word = bcVocabHash.value.get(iter.next())
-word match {
-  case Some(w) =>
-sentence += w
-sentenceLength += 1
-  case None =>
-}
+// each partition is a collection of sentences,
+// will be translated into arrays of Index integer
+val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter 
=>
+  // Each sentence will map to 0 or more Array[Int]
+  sentenceIter.flatMap { sentence => {
+  // Sentence of words, some of which map to a word index
+  val wordIndexes = sentence.flatMap(bcVocabHash.value.get)
+  if (wordIndexes.nonEmpty) {
--- End diff --

Agreed - @ygcao if we can make this final change I think we should be good 
to go


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-14 Thread MLnick

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/10152#discussion_r52867596
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -289,24 +301,20 @@ class Word2Vec extends Serializable with Logging {
 val expTable = sc.broadcast(createExpTable())
 val bcVocab = sc.broadcast(vocab)
 val bcVocabHash = sc.broadcast(vocabHash)
-
-val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
-  new Iterator[Array[Int]] {
-def hasNext: Boolean = iter.hasNext
-
-def next(): Array[Int] = {
-  val sentence = ArrayBuilder.make[Int]
-  var sentenceLength = 0
-  while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
-val word = bcVocabHash.value.get(iter.next())
-word match {
-  case Some(w) =>
-sentence += w
-sentenceLength += 1
-  case None =>
-}
+// each partition is a collection of sentences,
+// will be translated into arrays of Index integer
+val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter 
=>
+  // Each sentence will map to 0 or more Array[Int]
+  sentenceIter.flatMap { sentence => {
--- End diff --

Final (hopefully) style change - the second brace is not required (i.e. it 
should be `sentenceIter.flatMap { sentence =>`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-14 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/10152#discussion_r52843218
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -289,24 +301,20 @@ class Word2Vec extends Serializable with Logging {
 val expTable = sc.broadcast(createExpTable())
 val bcVocab = sc.broadcast(vocab)
 val bcVocabHash = sc.broadcast(vocabHash)
-
-val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
-  new Iterator[Array[Int]] {
-def hasNext: Boolean = iter.hasNext
-
-def next(): Array[Int] = {
-  val sentence = ArrayBuilder.make[Int]
-  var sentenceLength = 0
-  while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
-val word = bcVocabHash.value.get(iter.next())
-word match {
-  case Some(w) =>
-sentence += w
-sentenceLength += 1
-  case None =>
-}
+// each partition is a collection of sentences,
+// will be translated into arrays of Index integer
+val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter 
=>
+  // Each sentence will map to 0 or more Array[Int]
+  sentenceIter.flatMap { sentence => {
+  // Sentence of words, some of which map to a word index
+  val wordIndexes = sentence.flatMap(bcVocabHash.value.get)
+  if (wordIndexes.nonEmpty) {
--- End diff --

@ygcao I think you're mixing up an empty seq/iterator, with a seq/iterator 
over one "empty" element (like an empty string). We are talking about the 
former but your example above is the latter.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-13 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/10152#discussion_r52835896
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -289,24 +301,20 @@ class Word2Vec extends Serializable with Logging {
 val expTable = sc.broadcast(createExpTable())
 val bcVocab = sc.broadcast(vocab)
 val bcVocabHash = sc.broadcast(vocabHash)
-
-val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
-  new Iterator[Array[Int]] {
-def hasNext: Boolean = iter.hasNext
-
-def next(): Array[Int] = {
-  val sentence = ArrayBuilder.make[Int]
-  var sentenceLength = 0
-  while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
-val word = bcVocabHash.value.get(iter.next())
-word match {
-  case Some(w) =>
-sentence += w
-sentenceLength += 1
-  case None =>
-}
+// each partition is a collection of sentences,
+// will be translated into arrays of Index integer
+val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter 
=>
+  // Each sentence will map to 0 or more Array[Int]
+  sentenceIter.flatMap { sentence => {
+  // Sentence of words, some of which map to a word index
+  val wordIndexes = sentence.flatMap(bcVocabHash.value.get)
+  if (wordIndexes.nonEmpty) {
--- End diff --

They should be equivalent Scala code. We don't need to mix in their 
meanings. Let us try providing an example of  `sentenceIter: 
Iterator[Array[String]]` and `vocabHash: Map[String, Int]` such that

~~~scala 
sentenceIter.flatMap { sentence =>
val wordIndexes = sentence.flatMap(vocabHash.get)
wordIndexes.grouped(maxSentenceLength).map(_.toArray)
}


returns a different result from your code:

~~~scala
  sentenceIter.flatMap { sentence =>
val wordIndexes = sentence.flatMap(vocabHash.get)
if (wordIndexes.nonEmpty) {
  val sentenceSplit = wordIndexes.grouped(maxSentenceLength)
  sentenceSplit.map(_.toArray)
} else {
  None
}
 }
~~~

Essentially we are comparing the behavior when `wordIndexes` is an empty 
`Iterator[Int]`. In this case, no matter what value `maxSentenceLength` takes, 
` wordIndexes.grouped(maxSentenceLength).map(_.toArray)` returns an empty 
iterator, which will be skipped by `sentenceIter.flatMap`. So it is the same as 
`None` being skipped by `sentenceIter.flatMap` in the current implementation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-12 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/10152#discussion_r52722338
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -289,24 +301,20 @@ class Word2Vec extends Serializable with Logging {
 val expTable = sc.broadcast(createExpTable())
 val bcVocab = sc.broadcast(vocab)
 val bcVocabHash = sc.broadcast(vocabHash)
-
-val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
-  new Iterator[Array[Int]] {
-def hasNext: Boolean = iter.hasNext
-
-def next(): Array[Int] = {
-  val sentence = ArrayBuilder.make[Int]
-  var sentenceLength = 0
-  while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
-val word = bcVocabHash.value.get(iter.next())
-word match {
-  case Some(w) =>
-sentence += w
-sentenceLength += 1
-  case None =>
-}
+// each partition is a collection of sentences,
+// will be translated into arrays of Index integer
+val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter 
=>
+  // Each sentence will map to 0 or more Array[Int]
+  sentenceIter.flatMap { sentence => {
+  // Sentence of words, some of which map to a word index
+  val wordIndexes = sentence.flatMap(bcVocabHash.value.get)
+  if (wordIndexes.nonEmpty) {
--- End diff --

That is because `"".split(" ") = Array("")`, which has nothing to do with 
`grouped`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-12 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/10152#discussion_r52722467
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -289,24 +301,20 @@ class Word2Vec extends Serializable with Logging {
 val expTable = sc.broadcast(createExpTable())
 val bcVocab = sc.broadcast(vocab)
 val bcVocabHash = sc.broadcast(vocabHash)
-
-val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
-  new Iterator[Array[Int]] {
-def hasNext: Boolean = iter.hasNext
-
-def next(): Array[Int] = {
-  val sentence = ArrayBuilder.make[Int]
-  var sentenceLength = 0
-  while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
-val word = bcVocabHash.value.get(iter.next())
-word match {
-  case Some(w) =>
-sentence += w
-sentenceLength += 1
-  case None =>
-}
+// each partition is a collection of sentences,
+// will be translated into arrays of Index integer
+val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter 
=>
+  // Each sentence will map to 0 or more Array[Int]
+  sentenceIter.flatMap { sentence => {
+  // Sentence of words, some of which map to a word index
+  val wordIndexes = sentence.flatMap(bcVocabHash.value.get)
+  if (wordIndexes.nonEmpty) {
+// break wordIndexes into trunks of maxSentenceLength when has 
more
+val sentenceSplit = wordIndexes.grouped(maxSentenceLength)
+sentenceSplit.map(_.toArray)
--- End diff --

`sentenceSplit` should be an `Iterator[Array[Int]]`. So this line might be 
unnecessary.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-12 Thread ygcao

Github user ygcao commented on a diff in the pull request:

https://github.com/apache/spark/pull/10152#discussion_r52773573
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -289,24 +301,20 @@ class Word2Vec extends Serializable with Logging {
 val expTable = sc.broadcast(createExpTable())
 val bcVocab = sc.broadcast(vocab)
 val bcVocabHash = sc.broadcast(vocabHash)
-
-val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
-  new Iterator[Array[Int]] {
-def hasNext: Boolean = iter.hasNext
-
-def next(): Array[Int] = {
-  val sentence = ArrayBuilder.make[Int]
-  var sentenceLength = 0
-  while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
-val word = bcVocabHash.value.get(iter.next())
-word match {
-  case Some(w) =>
-sentence += w
-sentenceLength += 1
-  case None =>
-}
+// each partition is a collection of sentences,
+// will be translated into arrays of Index integer
+val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter 
=>
+  // Each sentence will map to 0 or more Array[Int]
+  sentenceIter.flatMap { sentence => {
+  // Sentence of words, some of which map to a word index
+  val wordIndexes = sentence.flatMap(bcVocabHash.value.get)
+  if (wordIndexes.nonEmpty) {
--- End diff --

The ppurpose for if statement is that an empty sentence(after lookup) 
should not result in an empty element. Whether we use grouped or not won't 
change the fact that it could generate empty element which could be harmful and 
wasteful for late steps.
My test proves the if statement is needed for get rid of empty element.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-12 Thread ygcao

Github user ygcao commented on a diff in the pull request:

https://github.com/apache/spark/pull/10152#discussion_r52773740
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -289,24 +301,20 @@ class Word2Vec extends Serializable with Logging {
 val expTable = sc.broadcast(createExpTable())
 val bcVocab = sc.broadcast(vocab)
 val bcVocabHash = sc.broadcast(vocabHash)
-
-val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
-  new Iterator[Array[Int]] {
-def hasNext: Boolean = iter.hasNext
-
-def next(): Array[Int] = {
-  val sentence = ArrayBuilder.make[Int]
-  var sentenceLength = 0
-  while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
-val word = bcVocabHash.value.get(iter.next())
-word match {
-  case Some(w) =>
-sentence += w
-sentenceLength += 1
-  case None =>
-}
+// each partition is a collection of sentences,
+// will be translated into arrays of Index integer
+val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter 
=>
+  // Each sentence will map to 0 or more Array[Int]
+  sentenceIter.flatMap { sentence => {
+  // Sentence of words, some of which map to a word index
+  val wordIndexes = sentence.flatMap(bcVocabHash.value.get)
+  if (wordIndexes.nonEmpty) {
+// break wordIndexes into trunks of maxSentenceLength when has 
more
+val sentenceSplit = wordIndexes.grouped(maxSentenceLength)
+sentenceSplit.map(_.toArray)
--- End diff --

Good catch! I'll double check and change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-12 Thread ygcao

Github user ygcao commented on a diff in the pull request:

https://github.com/apache/spark/pull/10152#discussion_r52823895
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -289,24 +301,20 @@ class Word2Vec extends Serializable with Logging {
 val expTable = sc.broadcast(createExpTable())
 val bcVocab = sc.broadcast(vocab)
 val bcVocabHash = sc.broadcast(vocabHash)
-
-val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
-  new Iterator[Array[Int]] {
-def hasNext: Boolean = iter.hasNext
-
-def next(): Array[Int] = {
-  val sentence = ArrayBuilder.make[Int]
-  var sentenceLength = 0
-  while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
-val word = bcVocabHash.value.get(iter.next())
-word match {
-  case Some(w) =>
-sentence += w
-sentenceLength += 1
-  case None =>
-}
+// each partition is a collection of sentences,
+// will be translated into arrays of Index integer
+val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter 
=>
+  // Each sentence will map to 0 or more Array[Int]
+  sentenceIter.flatMap { sentence => {
+  // Sentence of words, some of which map to a word index
+  val wordIndexes = sentence.flatMap(bcVocabHash.value.get)
+  if (wordIndexes.nonEmpty) {
+// break wordIndexes into trunks of maxSentenceLength when has 
more
+val sentenceSplit = wordIndexes.grouped(maxSentenceLength)
+sentenceSplit.map(_.toArray)
--- End diff --

Sorry again, we can't do the change. Compiler will complain, Iterator can't 
be used for flatMap, it expects GenTraversableOnce


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-11 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/10152#discussion_r52576421
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -272,15 +285,14 @@ class Word2Vec extends Serializable with Logging {
 
   /**
* Computes the vector representation of each word in vocabulary.
-   * @param dataset an RDD of words
+   * @param dataset a RDD of sentences,
--- End diff --

we should use "an" here.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-11 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/10152#discussion_r52575653
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -272,15 +285,14 @@ class Word2Vec extends Serializable with Logging {
 
   /**
* Computes the vector representation of each word in vocabulary.
-   * @param dataset an RDD of words
+   * @param dataset a RDD of sentences,
--- End diff --

That's right, though RDD effectively starts with a vowel sound: 
arr-dee-dee. A native speaker would certainly say "an RDD" like "an hour". In a 
similar way, people disagree over "a SQL database" vs "an SQL database" but 
it's really a disagreement over whether you say "a _sequel_ database" or "an 
_ess-cyoo-ell_ database.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-11 Thread ygcao

Github user ygcao commented on a diff in the pull request:

https://github.com/apache/spark/pull/10152#discussion_r52574236
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -272,15 +285,14 @@ class Word2Vec extends Serializable with Logging {
 
   /**
* Computes the vector representation of each word in vocabulary.
-   * @param dataset an RDD of words
+   * @param dataset a RDD of sentences,
--- End diff --

This is an interesting topic, seem r is not a vowel, not sounds like vowel 
either, why 'an'?
I found this from web:You use the article âaâ before words that start 
with a consonant sound and âanâ before words that start with a vowel sound.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-11 Thread ygcao

Github user ygcao commented on a diff in the pull request:

https://github.com/apache/spark/pull/10152#discussion_r52574384
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -551,12 +551,17 @@ class Word2VecModel private[spark] (
   }
   ind += 1
 }
-wordList.zip(cosVec)
+var topResults = wordList.zip(cosVec)
   .toSeq
-  .sortBy(- _._2)
+  .sortBy(-_._2)
   .take(num + 1)
   .tail
-  .toArray
+if (vecNorm != 0.0f) {
+  topResults = topResults.map { case (word, cosVec) =>
--- End diff --

Good point!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-11 Thread ygcao

Github user ygcao commented on a diff in the pull request:

https://github.com/apache/spark/pull/10152#discussion_r52573698
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -289,24 +301,20 @@ class Word2Vec extends Serializable with Logging {
 val expTable = sc.broadcast(createExpTable())
 val bcVocab = sc.broadcast(vocab)
 val bcVocabHash = sc.broadcast(vocabHash)
-
-val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
-  new Iterator[Array[Int]] {
-def hasNext: Boolean = iter.hasNext
-
-def next(): Array[Int] = {
-  val sentence = ArrayBuilder.make[Int]
-  var sentenceLength = 0
-  while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
-val word = bcVocabHash.value.get(iter.next())
-word match {
-  case Some(w) =>
-sentence += w
-sentenceLength += 1
-  case None =>
-}
+// each partition is a collection of sentences,
+// will be translated into arrays of Index integer
+val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter 
=>
+  // Each sentence will map to 0 or more Array[Int]
+  sentenceIter.flatMap { sentence => {
+  // Sentence of words, some of which map to a word index
+  val wordIndexes = sentence.flatMap(bcVocabHash.value.get)
+  if (wordIndexes.nonEmpty) {
--- End diff --

will empty iterator makes flatMap skip it just like skipping None?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-11 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/10152#discussion_r52575825
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -289,24 +301,20 @@ class Word2Vec extends Serializable with Logging {
 val expTable = sc.broadcast(createExpTable())
 val bcVocab = sc.broadcast(vocab)
 val bcVocabHash = sc.broadcast(vocabHash)
-
-val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
-  new Iterator[Array[Int]] {
-def hasNext: Boolean = iter.hasNext
-
-def next(): Array[Int] = {
-  val sentence = ArrayBuilder.make[Int]
-  var sentenceLength = 0
-  while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
-val word = bcVocabHash.value.get(iter.next())
-word match {
-  case Some(w) =>
-sentence += w
-sentenceLength += 1
-  case None =>
-}
+// each partition is a collection of sentences,
+// will be translated into arrays of Index integer
+val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter 
=>
+  // Each sentence will map to 0 or more Array[Int]
+  sentenceIter.flatMap { sentence => {
+  // Sentence of words, some of which map to a word index
+  val wordIndexes = sentence.flatMap(bcVocabHash.value.get)
+  if (wordIndexes.nonEmpty) {
--- End diff --

Yes, `flatMap` would flatten an empty iterator to nothing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-11 Thread ygcao

Github user ygcao commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-183197942
  
addressed new comments. still kept the if statement as I explained by 
sample codes.
reran test and lint test. Jenkins should still be happy :fireworks: 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-11 Thread ygcao

Github user ygcao commented on a diff in the pull request:

https://github.com/apache/spark/pull/10152#discussion_r52708705
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -289,24 +301,20 @@ class Word2Vec extends Serializable with Logging {
 val expTable = sc.broadcast(createExpTable())
 val bcVocab = sc.broadcast(vocab)
 val bcVocabHash = sc.broadcast(vocabHash)
-
-val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
-  new Iterator[Array[Int]] {
-def hasNext: Boolean = iter.hasNext
-
-def next(): Array[Int] = {
-  val sentence = ArrayBuilder.make[Int]
-  var sentenceLength = 0
-  while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
-val word = bcVocabHash.value.get(iter.next())
-word match {
-  case Some(w) =>
-sentence += w
-sentenceLength += 1
-  case None =>
-}
+// each partition is a collection of sentences,
+// will be translated into arrays of Index integer
+val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter 
=>
+  // Each sentence will map to 0 or more Array[Int]
+  sentenceIter.flatMap { sentence => {
+  // Sentence of words, some of which map to a word index
+  val wordIndexes = sentence.flatMap(bcVocabHash.value.get)
+  if (wordIndexes.nonEmpty) {
--- End diff --

Sorry, still not quite sure about this. did a test, turns out I am right 
:grinning: 
scala> val sentences=List("test sen 1","","testsen 2")
sentences: List[String] = List(test sen 1, "", testsen 2)

scala> val rdd=sc.parallelize(sentences)
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at 
parallelize at :23

scala> val results=rdd.flatMap(sen=>sen.split(" ").grouped(1))
results: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[1] at 
flatMap at :25

scala> results.collect
res0: Array[Array[String]] = Array(Array(test), Array(sen), Array(1), 
**Array("")**, Array(testsen), Array(2))

if we don't have the if statement, we'll result empty things which could 
cause trouble for following steps. I'd like to be on the safe side. if 
statement is cheap enough.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-11 Thread MLnick

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/10152#discussion_r52590418
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -76,6 +76,18 @@ class Word2Vec extends Serializable with Logging {
   private var numIterations = 1
   private var seed = Utils.random.nextLong()
   private var minCount = 5
+  private var maxSentenceLength = 1000
+
+  /**
+   * Sets the maximum length of each sentence in the input data.
+   * Any sentence longer than this threshold will be divided into chunks of
+   * up to `maxSentenceLength` size (default: 1000)
+   */
+  @Since("2.0.0")
+  def setMaxSentenceLength(maxSentenceLength: Int): this.type = {
--- End diff --

The param name comes from the original Google implementation. Either option 
(or both) works, but I guess I'd be marginally more in favour of amending the 
first line of doc to read `... maximum length (in words) of each ...`, or 
something similar.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-11 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/10152#discussion_r52638919
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -76,6 +76,18 @@ class Word2Vec extends Serializable with Logging {
   private var numIterations = 1
   private var seed = Utils.random.nextLong()
   private var minCount = 5
+  private var maxSentenceLength = 1000
+
+  /**
+   * Sets the maximum length of each sentence in the input data.
+   * Any sentence longer than this threshold will be divided into chunks of
+   * up to `maxSentenceLength` size (default: 1000)
+   */
+  @Since("2.0.0")
+  def setMaxSentenceLength(maxSentenceLength: Int): this.type = {
--- End diff --

Sounds good.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-10 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-182591988
  
@srowen Thanks! I will make a quick pass.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-10 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-182595810
  
made one pass and only minor comments


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-10 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/10152#discussion_r52530851
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -76,6 +76,18 @@ class Word2Vec extends Serializable with Logging {
   private var numIterations = 1
   private var seed = Utils.random.nextLong()
   private var minCount = 5
+  private var maxSentenceLength = 1000
+
+  /**
+   * Sets the maximum length of each sentence in the input data.
+   * Any sentence longer than this threshold will be divided into chunks of
+   * up to `maxSentenceLength` size (default: 1000)
+   */
+  @Since("2.0.0")
+  def setMaxSentenceLength(maxSentenceLength: Int): this.type = {
--- End diff --

It is not clear from the doc what "sentence length" means, number of words 
or number of characters. We can either update the doc or change the param name 
to `maxWordsPerSentence` to make this clear from the name.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-10 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/10152#discussion_r52530869
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -289,24 +301,20 @@ class Word2Vec extends Serializable with Logging {
 val expTable = sc.broadcast(createExpTable())
 val bcVocab = sc.broadcast(vocab)
 val bcVocabHash = sc.broadcast(vocabHash)
-
-val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
-  new Iterator[Array[Int]] {
-def hasNext: Boolean = iter.hasNext
-
-def next(): Array[Int] = {
-  val sentence = ArrayBuilder.make[Int]
-  var sentenceLength = 0
-  while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
-val word = bcVocabHash.value.get(iter.next())
-word match {
-  case Some(w) =>
-sentence += w
-sentenceLength += 1
-  case None =>
-}
+// each partition is a collection of sentences,
+// will be translated into arrays of Index integer
+val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter 
=>
+  // Each sentence will map to 0 or more Array[Int]
+  sentenceIter.flatMap { sentence => {
+  // Sentence of words, some of which map to a word index
+  val wordIndexes = sentence.flatMap(bcVocabHash.value.get)
+  if (wordIndexes.nonEmpty) {
--- End diff --

This `if ... else` is not necessary. `Seq.empty.grouped` returns an empty 
iterator.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-10 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/10152#discussion_r52530874
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -551,12 +551,17 @@ class Word2VecModel private[spark] (
   }
   ind += 1
 }
-wordList.zip(cosVec)
+var topResults = wordList.zip(cosVec)
   .toSeq
-  .sortBy(- _._2)
+  .sortBy(-_._2)
   .take(num + 1)
   .tail
-  .toArray
+if (vecNorm != 0.0f) {
+  topResults = topResults.map { case (word, cosVec) =>
--- End diff --

`cosVec` shadows the `cosVec` outside. We can rename it to `cos` since it 
is not a vector.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-10 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/10152#discussion_r52530860
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -272,15 +285,14 @@ class Word2Vec extends Serializable with Logging {
 
   /**
* Computes the vector representation of each word in vocabulary.
-   * @param dataset an RDD of words
+   * @param dataset a RDD of sentences,
--- End diff --

`a RDD` -> `an RDD`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-10 Thread srowen

Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-182317367
  
Agree, I'm ready to merge this. I'll CC @mengxr or @jkbradley in case they 
want a final comment today


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-09 Thread ygcao

Github user ygcao commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-182237525
  
It's getting to personal tastes now~~, still adopted suggestion though. 
Personally, I would like always to let machine to do the formatting and length 
limits(even adding braces for the if statement when we want to make it as a 
rule), if we don't like machine's default way, we can create template for Spark 
project to let machine do what the majority of spark community want(support 
eclipse is enough, intellij and others can adopt eclipse formatter's template), 
the key is that machine should be the guy to do the 'stupid' and repetitive 
work for us;) 
seems we can create an issue for Spark about this: create an template for 
IDE formatter for Spark contributors.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-181271212
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-08 Thread MLnick

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/10152#discussion_r52141782
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -289,26 +301,24 @@ class Word2Vec extends Serializable with Logging {
 val expTable = sc.broadcast(createExpTable())
 val bcVocab = sc.broadcast(vocab)
 val bcVocabHash = sc.broadcast(vocabHash)
-
-val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
-  new Iterator[Array[Int]] {
-def hasNext: Boolean = iter.hasNext
-
-def next(): Array[Int] = {
-  val sentence = ArrayBuilder.make[Int]
-  var sentenceLength = 0
-  while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
-val word = bcVocabHash.value.get(iter.next())
-word match {
-  case Some(w) =>
-sentence += w
-sentenceLength += 1
-  case None =>
+// each partition is a collection of sentences,
+// will be translated into arrays of Index integer
+val sentences: RDD[Array[Int]] = dataset.mapPartitions {
+  // Each sentence will map to 0 or more Array[Int]
+  sentenceIter =>
--- End diff --

Generally I prefer the style:

```
dataset.mapPartitions { sentenceIter =>
  sentenceIter.flatMap { sentence =>
...
```

I don't think it's a hard rule for Spark but most other code follows this 
convention.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-08 Thread MLnick

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/10152#discussion_r52141876
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -551,12 +553,17 @@ class Word2VecModel private[spark] (
   }
   ind += 1
 }
-wordList.zip(cosVec)
+var topResults = wordList.zip(cosVec)
   .toSeq
-  .sortBy(- _._2)
+  .sortBy(-_._2)
   .take(num + 1)
   .tail
-  .toArray
+if (vecNorm != 0.0f) {
+  topResults = topResults.map {
+case (word: String, cosVec: Double) => (word, cosVec / vecNorm)
--- End diff --

same here for style, prefer 
```
topResults = topResults.map { case (word: String, cosVec: Double) =>
  ...
```

(as long as it fits within line length restriction)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-08 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/10152#discussion_r52142152
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -551,12 +553,17 @@ class Word2VecModel private[spark] (
   }
   ind += 1
 }
-wordList.zip(cosVec)
+var topResults = wordList.zip(cosVec)
   .toSeq
-  .sortBy(- _._2)
+  .sortBy(-_._2)
   .take(num + 1)
   .tail
-  .toArray
+if (vecNorm != 0.0f) {
+  topResults = topResults.map {
+case (word: String, cosVec: Double) => (word, cosVec / vecNorm)
--- End diff --

I also generally write that way; you don't need types here on the case 
match statement


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-08 Thread srowen

Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-181253115
  
Jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-08 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-181257924
  
**[Test build #50915 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50915/consoleFull)**
 for PR 10152 at commit 
[`84a0bc4`](https://github.com/apache/spark/commit/84a0bc4c73ca0915f6e6291442c8e60f6d80d879).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-08 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/10152#discussion_r52142120
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -289,26 +301,24 @@ class Word2Vec extends Serializable with Logging {
 val expTable = sc.broadcast(createExpTable())
 val bcVocab = sc.broadcast(vocab)
 val bcVocabHash = sc.broadcast(vocabHash)
-
-val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
-  new Iterator[Array[Int]] {
-def hasNext: Boolean = iter.hasNext
-
-def next(): Array[Int] = {
-  val sentence = ArrayBuilder.make[Int]
-  var sentenceLength = 0
-  while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
-val word = bcVocabHash.value.get(iter.next())
-word match {
-  case Some(w) =>
-sentence += w
-sentenceLength += 1
-  case None =>
+// each partition is a collection of sentences,
+// will be translated into arrays of Index integer
+val sentences: RDD[Array[Int]] = dataset.mapPartitions {
+  // Each sentence will map to 0 or more Array[Int]
+  sentenceIter =>
--- End diff --

This could even be `dataset.mapPartitions { _.flatMap { sentence =>`, which 
I kind of like, but I don't know how much it matters.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-08 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-181270729
  
**[Test build #50915 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50915/consoleFull)**
 for PR 10152 at commit 
[`84a0bc4`](https://github.com/apache/spark/commit/84a0bc4c73ca0915f6e6291442c8e60f6d80d879).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-181271217
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50915/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-07 Thread ygcao

Github user ygcao commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-181240271
  
added braces to make lint happy. Jenkins should happy now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-06 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-180820174
  
**[Test build #50873 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50873/consoleFull)**
 for PR 10152 at commit 
[`443ec06`](https://github.com/apache/spark/commit/443ec06d8e6381a9df8f69b89fcd4955095e6b6c).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-180820176
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-06 Thread srowen

Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-180817180
  
Jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-06 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-180820036
  
**[Test build #50873 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50873/consoleFull)**
 for PR 10152 at commit 
[`443ec06`](https://github.com/apache/spark/commit/443ec06d8e6381a9df8f69b89fcd4955095e6b6c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-180820177
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50873/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-04 Thread MLnick

Github user MLnick commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-179858977
  
@ygcao sorry for the delay. I'm trying to run a few `spark-perf` tests and 
try larger scale if possible. Will revert ASAP.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-04 Thread ygcao

Github user ygcao commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-180188955
  
coolï¼Thanks. It will be helpful to see the large scale test result.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-01 Thread ygcao

Github user ygcao commented on a diff in the pull request:

https://github.com/apache/spark/pull/10152#discussion_r51391185
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -289,17 +301,28 @@ class Word2Vec extends Serializable with Logging {
 val expTable = sc.broadcast(createExpTable())
 val bcVocab = sc.broadcast(vocab)
 val bcVocabHash = sc.broadcast(vocabHash)
-
-val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
+// each partition is a collection of sentences, will be translated 
into arrays of Index integer
+val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter 
=>
   new Iterator[Array[Int]] {
-def hasNext: Boolean = iter.hasNext
+var wordIter: Iterator[String] = null
+
+def hasNext: Boolean = sentenceIter.hasNext || (wordIter != null 
&& wordIter.hasNext)
 
 def next(): Array[Int] = {
   val sentence = ArrayBuilder.make[Int]
   var sentenceLength = 0
-  while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
-val word = bcVocabHash.value.get(iter.next())
-word match {
+  // do translation of each word into its index in the vocabulary,
--- End diff --

I finally made up mind to do a hacky simple perf-test just for proof of 
concept: the 5x runs' perf diff of different implementation is quite ignorable 
since it's within variance of each run of the same version.
Some details:
I prepared a 32k document from two arbitrary picked wikipedia pages( for 
"machine learning" and "Adversarial machine learning", didn't include reference 
section), which contains 341 lines and can be split into 442 sentences by 
simply using dot+space for sentence boundary). I injected following test case 
into Word2VecSuite class and run it against three different implementations(the 
old one which is in the master branch, my final two versions before adopting 
Sean's suggestion, and after adopted Sean's suggestion) of fit function in 
mllib.feature.Word2Vec class.


test("testSpeed") {
val lines = sc.parallelize(Source.fromFile(new 
File("/home/ygcao/machinelearning.txt")).getLines().toSeq)
val sentences = lines.flatMap(_.split("\\. ")).map(line => line.split(" 
").toSeq)
println("read file into rdd, lines=", sentences.count())
var builtModel: org.apache.spark.mllib.feature.Word2VecModel = null
var duration = 0l
for (i <- 1 to 5) {
  val start = System.currentTimeMillis()
  val model = new 
org.apache.spark.mllib.feature.Word2Vec().setVectorSize(3).setSeed(42l)
  builtModel = model.fit(sentences)
  duration += (System.currentTimeMillis() - start)
}
println(s"builtModel take ${duration},vocabulary 
size:${builtModel.getVectors.size}, learning's 
synonyms:${builtModel.findSynonyms("learning", 4).mkString("\n")}")
  }


 the vocabulary size from the model is 155. and here are the time taking 
three runs of each version and the average of the final two runs of them. As 
you can see from the code, each run actually run the model building 5 times to 
magnify the potential diff.

masterVersion   PR-useIter  PR-useCollection
run1223221071933
run2208519861987
run3200521232004
avarage(run2, run3) 20452054.5  1995.5

BTW: Following is not relevant for perf-test, just FYI. the two versions in 
this pull request will produce exact the same result, which proves the 
correctness of both. and the result is interesting as well although the dataset 
is quite tiny, new versions(un-merged ones) looks better than the old 
version(the one in master branch right now). Of course, you can screw new 
version up by a bad sentence splitter(we can make a hard cut splitter to do 
exact the same thing as the old version). the simple splitter used for the test 
case can't deal with abbreviations, that's why I removed references section in 
the text.
Here are the top synonyms of learning using the tiny dataset, please keep 
in mind, it's just for fun, not a solid proof of which is definitely better 
since dataset is tiny.
New versions:  learning's synonyms:
(network,0.9990321742854605)
(related,0.9966140511173031)
(sparse,0.9965729586431097)
(algorithms,0.99376379497485)

Old version, learning's synonyms:
(against,0.9895162633562077)
(Support,0.9547255372896342)
(Association,0.9499811242788365)
(Attacks,0.9321700815006693)












---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working,

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-02-01 Thread ygcao

Github user ygcao commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-177855098
  
FYI: did a small scale perf-test and also checked logic correctness. please 
check my comment for the review text for details of my experiment done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-01-31 Thread ygcao

Github user ygcao commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-177429666
  
Adopted all Sean's suggestions with minor editing to address missed edge 
case and make compiler happy, I think these suggestions are good ones, thanks!. 
  please try to trigger the QA build and see whether everything goes fine and 
potentially do perf-test to see whether there is big differences.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-01-31 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/10152#discussion_r51361374
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -289,17 +301,28 @@ class Word2Vec extends Serializable with Logging {
 val expTable = sc.broadcast(createExpTable())
 val bcVocab = sc.broadcast(vocab)
 val bcVocabHash = sc.broadcast(vocabHash)
-
-val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
+// each partition is a collection of sentences, will be translated 
into arrays of Index integer
+val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter 
=>
   new Iterator[Array[Int]] {
-def hasNext: Boolean = iter.hasNext
+var wordIter: Iterator[String] = null
+
+def hasNext: Boolean = sentenceIter.hasNext || (wordIter != null 
&& wordIter.hasNext)
 
 def next(): Array[Int] = {
   val sentence = ArrayBuilder.make[Int]
   var sentenceLength = 0
-  while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
-val word = bcVocabHash.value.get(iter.next())
-word match {
+  // do translation of each word into its index in the vocabulary,
--- End diff --

Oops I keep imagining there is a flatMapPartitions. mapPartitions and 
flatMap.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-01-30 Thread ygcao

Github user ygcao commented on a diff in the pull request:

https://github.com/apache/spark/pull/10152#discussion_r51356725
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -289,17 +301,28 @@ class Word2Vec extends Serializable with Logging {
 val expTable = sc.broadcast(createExpTable())
 val bcVocab = sc.broadcast(vocab)
 val bcVocabHash = sc.broadcast(vocabHash)
-
-val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
+// each partition is a collection of sentences, will be translated 
into arrays of Index integer
+val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter 
=>
   new Iterator[Array[Int]] {
-def hasNext: Boolean = iter.hasNext
+var wordIter: Iterator[String] = null
+
+def hasNext: Boolean = sentenceIter.hasNext || (wordIter != null 
&& wordIter.hasNext)
 
 def next(): Array[Int] = {
   val sentence = ArrayBuilder.make[Int]
   var sentenceLength = 0
-  while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
-val word = bcVocabHash.value.get(iter.next())
-word match {
+  // do translation of each word into its index in the vocabulary,
--- End diff --

sorry, I can't do much about spark-perf thing, I even still didn't get time 
to figure out what's blocking me from running the whole test locally.
I(and my compiler) also don't aware of the existence of flatMapPartitions 
function, but I do made a version suppose to do whatever Sean suggested. Please 
review and help to do the perf-test for two latest versions.
BTW: I don't worry much about the perf difference since it shouldn't be 
much, even minutes(not quite possible) of difference for each partition, just 
mean possibly tens minutes of overall penalty, shouldn't matter much for a job 
runs hours. Of course, seeing the data is more convincing. My point is that, If 
diff is minor, we'd better optimize for readability.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-01-27 Thread MLnick

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/10152#discussion_r51088952
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -289,17 +301,28 @@ class Word2Vec extends Serializable with Logging {
 val expTable = sc.broadcast(createExpTable())
 val bcVocab = sc.broadcast(vocab)
 val bcVocabHash = sc.broadcast(vocabHash)
-
-val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
+// each partition is a collection of sentences, will be translated 
into arrays of Index integer
+val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter 
=>
   new Iterator[Array[Int]] {
-def hasNext: Boolean = iter.hasNext
+var wordIter: Iterator[String] = null
+
+def hasNext: Boolean = sentenceIter.hasNext || (wordIter != null 
&& wordIter.hasNext)
 
 def next(): Array[Int] = {
   val sentence = ArrayBuilder.make[Int]
   var sentenceLength = 0
-  while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
-val word = bcVocabHash.value.get(iter.next())
-word match {
+  // do translation of each word into its index in the vocabulary,
--- End diff --

Agreed, that is more succinct. @ygcao can you look into making that change? 
Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-01-27 Thread MLnick

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/10152#discussion_r51088866
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -556,6 +571,7 @@ class Word2VecModel private[spark] (
   .sortBy(- _._2)
   .take(num + 1)
   .tail
+  .map(v => (if (vecNorm == 0) v else (v._1, v._2 / vecNorm)))
--- End diff --

@ygcao please change this line to `.map { case (word, cosVec) => (word, if 
(vecNorm == 0.0) 0.0 else (cosVec / vecNorm)) }` as per discussion above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-01-27 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/10152#discussion_r50961556
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -289,17 +301,28 @@ class Word2Vec extends Serializable with Logging {
 val expTable = sc.broadcast(createExpTable())
 val bcVocab = sc.broadcast(vocab)
 val bcVocabHash = sc.broadcast(vocabHash)
-
-val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
+// each partition is a collection of sentences, will be translated 
into arrays of Index integer
+val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter 
=>
   new Iterator[Array[Int]] {
-def hasNext: Boolean = iter.hasNext
+var wordIter: Iterator[String] = null
+
+def hasNext: Boolean = sentenceIter.hasNext || (wordIter != null 
&& wordIter.hasNext)
 
 def next(): Array[Int] = {
   val sentence = ArrayBuilder.make[Int]
   var sentenceLength = 0
-  while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
-val word = bcVocabHash.value.get(iter.next())
-word match {
+  // do translation of each word into its index in the vocabulary,
--- End diff --

Good point; I'm not sure if it was for efficiency or simply to copy the C++ 
implementation. Hm, I don't suppose we have benchmarks here. I'm OK with 
keeping the longer more efficient implementation. but now that input sentence 
boundaries matter, this can still be simplified. This part can still be used:

```
// Each input sentence will produce 1 or more Array[Int], so 
flatMapPartitions
dataset.flatMapPartitions { sentenceIter => 
  // Each sentence will map to 1 or more Array[Int], so map
  sentenceIter.map { sentence =>
...
  }
}
```

That is you can still `flatMapPartitions` over `Iterator`s for each 
sentence to join them without trying to manage both the `Iterator` over 
sentences and words.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-01-25 Thread srowen

Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-174446659
  
@ygcao this has not been merged. You can see the PR is open and there is no 
message about merging into master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-01-25 Thread MLnick

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/10152#discussion_r50671686
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -556,6 +571,7 @@ class Word2VecModel private[spark] (
   .sortBy(- _._2)
   .take(num + 1)
   .tail
+  .map(v => (if (vecNorm == 0) v else (v._1, v._2 / vecNorm)))
--- End diff --

@srowen if either of the vector norms are 0, the cosine similarity should 
be 0. Though by definition I think here `v._2` will be 0 if the norm is 0, what 
do you think about making this `if vecNorm == 0) 0 ...` to be explicit?

(follows what you did in 
https://github.com/apache/spark/commit/94b39fecff3794727c186bd681fa4c6af4fd#diff-88f4b62c382b26ef8e856b23f5167ccdR542)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-01-25 Thread MLnick

Github user MLnick commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-174452391
  
As Sean says this hasn't been merged yet, I was waiting for the latest test 
build to pass before making a final pass over this. 

Ideally I'd like to just get one of @srowen @jkbradley @mengxr to give this 
a pass through too before merging.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-01-25 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/10152#discussion_r50676720
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -556,6 +571,7 @@ class Word2VecModel private[spark] (
   .sortBy(- _._2)
   .take(num + 1)
   .tail
+  .map(v => (if (vecNorm == 0) v else (v._1, v._2 / vecNorm)))
--- End diff --

I agree that we should define cosine similarity with a zero vector to be 0. 
In this case the results are pretty meaningless anyway, since the dot product 
was already 0 for everything, and so the top N are random. I'd say:

```
.map { case (word, cosVec) => (word, if (vecNorm == 0.0) 0.0 else (cosVec / 
vecNorm)) }
```
For more efficiency we could only apply this map in the corner case that 
`vecNorm` is 0.0; not sure if it's simpler but it's a little less work:

```
val result = wordListtail
if (vecNorm != 0.0) [
  result = result.map { case (word, cosVec) => (word, cosVec / vecNorm) }
}
result.toArray
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-01-25 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/10152#discussion_r50675972
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -289,17 +301,28 @@ class Word2Vec extends Serializable with Logging {
 val expTable = sc.broadcast(createExpTable())
 val bcVocab = sc.broadcast(vocab)
 val bcVocabHash = sc.broadcast(vocabHash)
-
-val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
+// each partition is a collection of sentences, will be translated 
into arrays of Index integer
+val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter 
=>
   new Iterator[Array[Int]] {
-def hasNext: Boolean = iter.hasNext
+var wordIter: Iterator[String] = null
+
+def hasNext: Boolean = sentenceIter.hasNext || (wordIter != null 
&& wordIter.hasNext)
 
 def next(): Array[Int] = {
   val sentence = ArrayBuilder.make[Int]
   var sentenceLength = 0
-  while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
-val word = bcVocabHash.value.get(iter.next())
-word match {
+  // do translation of each word into its index in the vocabulary,
--- End diff --

I understand that this part of the change intends to respect the implied 
sentence boundaries in the input. I think it can be simpler? One input sentence 
maps to 1 or more arrays, and the result should be flattened. Something like?

```
// Each input sentence will produce 1 or more Array[Int], so 
flatMapPartitions
dataset.flatMapPartitions { sentenceIter => 
  // Each sentence will map to 1 or more Array[Int], so map
  sentenceIter.map { sentence =>
// Sentence of words, some of which map to a hash, so flatMap
val hashes = sentence.flatMap(bcVocabHash.value.get)
// break into sequence of at most maxSentenceLength
hashes.grouped(maxSentenceLength).map(_.toArray)
  }
}
```

I haven't tested it but does that seem like the intent?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-01-25 Thread MLnick

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/10152#discussion_r50683565
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -289,17 +301,28 @@ class Word2Vec extends Serializable with Logging {
 val expTable = sc.broadcast(createExpTable())
 val bcVocab = sc.broadcast(vocab)
 val bcVocabHash = sc.broadcast(vocabHash)
-
-val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
+// each partition is a collection of sentences, will be translated 
into arrays of Index integer
+val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter 
=>
   new Iterator[Array[Int]] {
-def hasNext: Boolean = iter.hasNext
+var wordIter: Iterator[String] = null
+
+def hasNext: Boolean = sentenceIter.hasNext || (wordIter != null 
&& wordIter.hasNext)
 
 def next(): Array[Int] = {
   val sentence = ArrayBuilder.make[Int]
   var sentenceLength = 0
-  while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
-val word = bcVocabHash.value.get(iter.next())
-word match {
+  // do translation of each word into its index in the vocabulary,
--- End diff --

I also looked at a roughly similar approach. I assumed that the initial 
formulation was done for efficiency reasons (while loop etc) - hence why I 
didn't suggest it or explore it much further - but we'd have to get the 
original author to weigh in on that.

I agree your approach is simpler and more succinct.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-01-25 Thread MLnick

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/10152#discussion_r50684006
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -556,6 +571,7 @@ class Word2VecModel private[spark] (
   .sortBy(- _._2)
   .take(num + 1)
   .tail
+  .map(v => (if (vecNorm == 0) v else (v._1, v._2 / vecNorm)))
--- End diff --

Strictly speaking we could throw an error or warning if passing in a zero 
vector as it's meaningless as you say. Still, I think in real usage that's 
unlikely, so I'd just go for the first option. I don't think there are real 
world efficiency gains to be had with the second option.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-01-24 Thread ygcao

Github user ygcao commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-174413771
  
Thanks for the merge! It looks successful, but it seems like the change is 
still not appearing in the master. 
What's the rest of the workflow for it to be appearing in Master branch of 
Spark? Any follow up needed from me?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-01-21 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-173607998
  
**[Test build #2431 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2431/consoleFull)**
 for PR 10152 at commit 
[`e938208`](https://github.com/apache/spark/commit/e938208d9c85515f62b41635a8445b8ab31f55f2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-01-21 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-173621659
  
**[Test build #2431 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2431/consoleFull)**
 for PR 10152 at commit 
[`e938208`](https://github.com/apache/spark/commit/e938208d9c85515f62b41635a8445b8ab31f55f2).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-01-19 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-172787971
  
**[Test build #2407 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2407/consoleFull)**
 for PR 10152 at commit 
[`141d7a2`](https://github.com/apache/spark/commit/141d7a2bff8c8e462a9a2aff33b1f16bebc30b25).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-01-19 Thread ygcao

Github user ygcao commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-172777566
  
PySpark's test case fixed. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-01-19 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-172778827
  
**[Test build #2407 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2407/consoleFull)**
 for PR 10152 at commit 
[`141d7a2`](https://github.com/apache/spark/commit/141d7a2bff8c8e462a9a2aff33b1f16bebc30b25).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-01-19 Thread ygcao

Github user ygcao commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-173094302
  
adjusted python doctest format. Did the heavy job of running pyspark test 
locally. Now should be OK.
Finished test(python): pyspark.ml.feature (35s)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-01-18 Thread ygcao

Github user ygcao commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-172468729
  
ouch, we finally decided to make backward incompatible changes for 
synonyms~~. That caused test case failure.
I adjusted the expected value according our new logic: always return 
normalized similarity value for findSyonyms function
Also added defense code for potential divide by zero accordingly referring 
to @srowen 's hot fix of another similar situation.
I have 90% confidence that these change is enough for passing tests.
The reason why I can't be 100%, is that spark sql can't compile and 
run_tests script is not working in my local env. I hacked out the expected 
value, but I verified it against before and after change.
FYI: the compiling issue on master branch about Spark SQL, not relevant to 
our changes, but is a blocker somebody should look into.
[error] 
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystQl.scala:311:
 reference to Rollup is ambiguous;
[error] it is imported twice in the same scope by
[error] import org.apache.spark.sql.catalyst.plans.logical._
[error] and import org.apache.spark.sql.catalyst.expressions._
[error] Seq(Rollup(children.map(nodeToExpr))),
[error] ^
[error] 
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystQl.scala:319:
 reference to Cube is ambiguous;
[error] it is imported twice in the same scope by
[error] import org.apache.spark.sql.catalyst.plans.logical._
[error] and import org.apache.spark.sql.catalyst.expressions._
[error] Seq(Cube(children.map(nodeToExpr))),
[error] ^
[error] 
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:191:
 reference to Rollup is ambiguous;
[error] it is imported twice in the same scope by
[error] import org.apache.spark.sql.catalyst.plans.logical._
[error] and import org.apache.spark.sql.catalyst.expressions._
[error] def bitmasks(r: Rollup): Seq[Int] = {
[error] ^
[error] 
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:204:
 reference to Cube is ambiguous;
[error] it is imported twice in the same scope by
[error] import org.apache.spark.sql.catalyst.plans.logical._
[error] and import org.apache.spark.sql.catalyst.expressions._
[error] def bitmasks(c: Cube): Seq[Int] = {
[error] ^
[error] 
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:210:
 reference to Cube is ambiguous;
[error] it is imported twice in the same scope by
[error] import org.apache.spark.sql.catalyst.plans.logical._
[error] and import org.apache.spark.sql.catalyst.expressions._
[error]   case Aggregate(Seq(c @ Cube(groupByExprs)), 
aggregateExpressions, child) =>
[error]



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-01-18 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-172480884
  
**[Test build #2394 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2394/consoleFull)**
 for PR 10152 at commit 
[`76e8266`](https://github.com/apache/spark/commit/76e82667bd04bc359a336c7538d2595aade8384e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-01-18 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-172489168
  
**[Test build #2394 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2394/consoleFull)**
 for PR 10152 at commit 
[`76e8266`](https://github.com/apache/spark/commit/76e82667bd04bc359a336c7538d2595aade8384e).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-01-12 Thread MLnick

Github user MLnick commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-170854032
  
@ygcao sorry for the delay, I was on vacation and not checking email 
regularly. I'm pretty much happy with this, but I'd like to just get one of 
@srowen @jkbradley @mengxr to give it a final review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-01-12 Thread MLnick

Github user MLnick commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-170927597
  
@ygcao the changes have impacted the tests - could you take a look at the 
failure? I think we may need to update the test suite. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-01-12 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-170885940
  
**[Test build #2372 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2372/consoleFull)**
 for PR 10152 at commit 
[`214d0d9`](https://github.com/apache/spark/commit/214d0d9d3350a624f98f36aff2393b7b61fd6176).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-01-12 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-170878810
  
**[Test build #2372 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2372/consoleFull)**
 for PR 10152 at commit 
[`214d0d9`](https://github.com/apache/spark/commit/214d0d9d3350a624f98f36aff2393b7b61fd6176).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2016-01-04 Thread ygcao

Github user ygcao commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-168907161
  
@MLnick Happy new year. I think I've addressed all you comments last year, 
could you help to do the merge this year? If anyone still have other concerns, 
please let me know. an unfinished pull request without a good reason will be a 
huge waste. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2015-12-22 Thread MLnick

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/10152#discussion_r48229609
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -77,6 +77,18 @@ class Word2Vec extends Serializable with Logging {
   private var numIterations = 1
   private var seed = Utils.random.nextLong()
   private var minCount = 5
+  private var maxSentenceLength = 1000
+
+  /**
+   * Sets the maximum length of each sentence in the input data.
+   * Any sentence longer than this threshold will be divided into chunks of
+   * up to `maxSentenceLength` size (default: 1000)
+   */
+  @Since("1.6.1")
--- End diff --

@ygcao actually could we target `2.0.0` instead of `1.6.1` in this PR, 
since it's against master? We'll consider backporting after it is merged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2015-12-22 Thread ygcao

Github user ygcao commented on a diff in the pull request:

https://github.com/apache/spark/pull/10152#discussion_r48324725
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -77,6 +77,18 @@ class Word2Vec extends Serializable with Logging {
   private var numIterations = 1
   private var seed = Utils.random.nextLong()
   private var minCount = 5
+  private var maxSentenceLength = 1000
+
+  /**
+   * Sets the maximum length of each sentence in the input data.
+   * Any sentence longer than this threshold will be divided into chunks of
+   * up to `maxSentenceLength` size (default: 1000)
+   */
+  @Since("1.6.1")
--- End diff --

Done!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2015-12-22 Thread ygcao

Github user ygcao commented on a diff in the pull request:

https://github.com/apache/spark/pull/10152#discussion_r48324747
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -77,6 +77,18 @@ class Word2Vec extends Serializable with Logging {
   private var numIterations = 1
   private var seed = Utils.random.nextLong()
   private var minCount = 5
+  private var maxSentenceLength = 1000
+
+  /**
+   * Sets the maximum length of each sentence in the input data.
+   * Any sentence longer than this threshold will be divided into chunks of
+   * up to `maxSentenceLength` size (default: 1000)
+   */
+  @Since("1.6.1")
--- End diff --

Done!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2015-12-21 Thread ygcao

Github user ygcao commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-166544099
  
modified comment accordingly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2015-12-20 Thread MLnick

Github user MLnick commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-166215753
  
Pinging @jkbradley @mengxr @MechCoder again for a final review - could you 
give this a look and confirm you're in agreement with my comments above. 

Also thoughts on whether this should target `1.6.1` - as it is actually a 
fairly major yet subtle bug in the implementation. Or even be backported to 
`1.5.3`? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2015-12-20 Thread MLnick

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/10152#discussion_r48119953
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -77,6 +77,20 @@ class Word2Vec extends Serializable with Logging {
   private var numIterations = 1
   private var seed = Utils.random.nextLong()
   private var minCount = 5
+  private var maxSentenceLength = 1000
+
+  /**
+   * sets the maxSentenceLength, maxSentenceLength is used as the 
threshold for cutting sentence
--- End diff --

One final thing - can you address the comment above? And I think we can 
actually remove the `@param` and `@return` to match the comments for the other 
setters in this class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2015-12-20 Thread MLnick

Github user MLnick commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-166215472
  
@ygcao just one final comment on the `setMaxSentenceLength` setter comment 
to address, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2015-12-18 Thread ygcao

Github user ygcao commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-165959134
  
Thanks @MLnick! Added the @Since annotation in the code accordingly and 
updated issue to make type as bug and consistent with what we finalized for 
changing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2015-12-16 Thread MLnick

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/10152#discussion_r47875988
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -281,17 +295,28 @@ class Word2Vec extends Serializable with Logging {
 val expTable = sc.broadcast(createExpTable())
 val bcVocab = sc.broadcast(vocab)
 val bcVocabHash = sc.broadcast(vocabHash)
-
-val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
+// each partition is a collection of sentences, will be translated 
into arrays of Index integer
+val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter 
=>
   new Iterator[Array[Int]] {
-def hasNext: Boolean = iter.hasNext
+var wordIter: Iterator[String] = null
+
+def hasNext: Boolean = sentenceIter.hasNext || (wordIter != null 
&& wordIter.hasNext)
 
 def next(): Array[Int] = {
   val sentence = ArrayBuilder.make[Int]
   var sentenceLength = 0
-  while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
-val word = bcVocabHash.value.get(iter.next())
-word match {
+  // do translation of each word into its index in the vocabulary,
+  // do cutting only when the sentence is larger than 
maxSentenceLength
+  if ((wordIter == null || !wordIter.hasNext) && 
sentenceIter.hasNext) {
+do {
--- End diff --

ok, fair enough


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

2015-12-16 Thread MLnick

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/10152#discussion_r47876214
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -77,6 +77,20 @@ class Word2Vec extends Serializable with Logging {
   private var numIterations = 1
   private var seed = Utils.random.nextLong()
   private var minCount = 5
+  private var maxSentenceLength = 1000
+
+  /**
+   * sets the maxSentenceLength, maxSentenceLength is used as the 
threshold for cutting sentence
+   * into chunks when it is too long. (default: 1000)
+   * @param maxSentenceLength the maxSentenceLength allowed.
+   *  recommend to set it large enough to respect 
reasonable long sentences
+   *  while not overflow memory
+   * @return this object
+   */
+  def setMaxSentenceLength(maxSentenceLength: Int): this.type = {
--- End diff --

I think we should target `1.6.1` here: `@Since("1.6.1")` - overall I'd view 
this PR as a bugfix (though adding the parameter is a minor extra feature). I 
think we'd want to include this in branch-1.6, and possibly even think about 
backporting the core changes to branch-1.5


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

98 matches

Mail list logo