[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user ygcao commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-187313458 Thanks everybody for the review and help! Cheers! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user MLnick commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-187119717 @srowen thanks for merging it. @ygcao thanks for the PR! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/10152 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-187098446 Merged to master --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-186546462 At long last I think ready to go. @mengxr any more comments? or @MLnick --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-185685961 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51483/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-185685958 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-185685501 **[Test build #51483 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51483/consoleFull)** for PR 10152 at commit [`a4abd40`](https://github.com/apache/spark/commit/a4abd40f5a553fa95795e9da5fff0b4ac0256188). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-185669428 **[Test build #51483 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51483/consoleFull)** for PR 10152 at commit [`a4abd40`](https://github.com/apache/spark/commit/a4abd40f5a553fa95795e9da5fff0b4ac0256188). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-185659555 Jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user ygcao commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-185036851 Done! sorry for missing the comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user ygcao commented on a diff in the pull request: https://github.com/apache/spark/pull/10152#discussion_r53122840 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -289,24 +301,20 @@ class Word2Vec extends Serializable with Logging { val expTable = sc.broadcast(createExpTable()) val bcVocab = sc.broadcast(vocab) val bcVocabHash = sc.broadcast(vocabHash) - -val sentences: RDD[Array[Int]] = words.mapPartitions { iter => - new Iterator[Array[Int]] { -def hasNext: Boolean = iter.hasNext - -def next(): Array[Int] = { - val sentence = ArrayBuilder.make[Int] - var sentenceLength = 0 - while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) { -val word = bcVocabHash.value.get(iter.next()) -word match { - case Some(w) => -sentence += w -sentenceLength += 1 - case None => -} +// each partition is a collection of sentences, +// will be translated into arrays of Index integer +val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter => + // Each sentence will map to 0 or more Array[Int] + sentenceIter.flatMap { sentence => { + // Sentence of words, some of which map to a word index + val wordIndexes = sentence.flatMap(bcVocabHash.value.get) + if (wordIndexes.nonEmpty) { --- End diff -- You guys are right. didn't quite get @mengxr 's previous explanation. The splitting function is doing something unexpected to me, split an empty String will result in a non empty array. FYI. I verified the logic using following codes to be more explicit test, which proved your assertions. scala> sentences res4: List[List[String]] = List(List(a, b, c), List(b), List(c, d)) scala> dict res5: scala.collection.immutable.Map[String,Int] = Map(a -> 1, c -> 2) scala> sentences.flatMap(sen=>{val indexes=sen.flatMap(dict.get);indexes.grouped(2).map(_.toArray)}) res6: List[Array[Int]] = List(Array(1, 2), Array(2)) scala> "".split(" ") res7: Array[String] = Array("") scala> "".split(" ").size res8: Int = 1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/10152#discussion_r52975344 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -289,24 +301,19 @@ class Word2Vec extends Serializable with Logging { val expTable = sc.broadcast(createExpTable()) val bcVocab = sc.broadcast(vocab) val bcVocabHash = sc.broadcast(vocabHash) - -val sentences: RDD[Array[Int]] = words.mapPartitions { iter => - new Iterator[Array[Int]] { -def hasNext: Boolean = iter.hasNext - -def next(): Array[Int] = { - val sentence = ArrayBuilder.make[Int] - var sentenceLength = 0 - while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) { -val word = bcVocabHash.value.get(iter.next()) -word match { - case Some(w) => -sentence += w -sentenceLength += 1 - case None => -} - } - sentence.result() +// each partition is a collection of sentences, +// will be translated into arrays of Index integer +val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter => + // Each sentence will map to 0 or more Array[Int] + sentenceIter.flatMap { sentence => +// Sentence of words, some of which map to a word index +val wordIndexes = sentence.flatMap(bcVocabHash.value.get) +if (wordIndexes.nonEmpty) { --- End diff -- @ygcao you have kept the if statement here, which I believe both @mengxr and @srowen have shown is not necessary. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user ygcao commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-184528879 addressed the 'final' comment, and checked lint and test cases. shall we do the merge then? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/10152#discussion_r52867732 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -289,24 +301,20 @@ class Word2Vec extends Serializable with Logging { val expTable = sc.broadcast(createExpTable()) val bcVocab = sc.broadcast(vocab) val bcVocabHash = sc.broadcast(vocabHash) - -val sentences: RDD[Array[Int]] = words.mapPartitions { iter => - new Iterator[Array[Int]] { -def hasNext: Boolean = iter.hasNext - -def next(): Array[Int] = { - val sentence = ArrayBuilder.make[Int] - var sentenceLength = 0 - while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) { -val word = bcVocabHash.value.get(iter.next()) -word match { - case Some(w) => -sentence += w -sentenceLength += 1 - case None => -} +// each partition is a collection of sentences, +// will be translated into arrays of Index integer +val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter => + // Each sentence will map to 0 or more Array[Int] + sentenceIter.flatMap { sentence => { + // Sentence of words, some of which map to a word index + val wordIndexes = sentence.flatMap(bcVocabHash.value.get) + if (wordIndexes.nonEmpty) { --- End diff -- Agreed - @ygcao if we can make this final change I think we should be good to go --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/10152#discussion_r52867596 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -289,24 +301,20 @@ class Word2Vec extends Serializable with Logging { val expTable = sc.broadcast(createExpTable()) val bcVocab = sc.broadcast(vocab) val bcVocabHash = sc.broadcast(vocabHash) - -val sentences: RDD[Array[Int]] = words.mapPartitions { iter => - new Iterator[Array[Int]] { -def hasNext: Boolean = iter.hasNext - -def next(): Array[Int] = { - val sentence = ArrayBuilder.make[Int] - var sentenceLength = 0 - while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) { -val word = bcVocabHash.value.get(iter.next()) -word match { - case Some(w) => -sentence += w -sentenceLength += 1 - case None => -} +// each partition is a collection of sentences, +// will be translated into arrays of Index integer +val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter => + // Each sentence will map to 0 or more Array[Int] + sentenceIter.flatMap { sentence => { --- End diff -- Final (hopefully) style change - the second brace is not required (i.e. it should be `sentenceIter.flatMap { sentence =>` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/10152#discussion_r52843218 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -289,24 +301,20 @@ class Word2Vec extends Serializable with Logging { val expTable = sc.broadcast(createExpTable()) val bcVocab = sc.broadcast(vocab) val bcVocabHash = sc.broadcast(vocabHash) - -val sentences: RDD[Array[Int]] = words.mapPartitions { iter => - new Iterator[Array[Int]] { -def hasNext: Boolean = iter.hasNext - -def next(): Array[Int] = { - val sentence = ArrayBuilder.make[Int] - var sentenceLength = 0 - while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) { -val word = bcVocabHash.value.get(iter.next()) -word match { - case Some(w) => -sentence += w -sentenceLength += 1 - case None => -} +// each partition is a collection of sentences, +// will be translated into arrays of Index integer +val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter => + // Each sentence will map to 0 or more Array[Int] + sentenceIter.flatMap { sentence => { + // Sentence of words, some of which map to a word index + val wordIndexes = sentence.flatMap(bcVocabHash.value.get) + if (wordIndexes.nonEmpty) { --- End diff -- @ygcao I think you're mixing up an empty seq/iterator, with a seq/iterator over one "empty" element (like an empty string). We are talking about the former but your example above is the latter. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/10152#discussion_r52835896 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -289,24 +301,20 @@ class Word2Vec extends Serializable with Logging { val expTable = sc.broadcast(createExpTable()) val bcVocab = sc.broadcast(vocab) val bcVocabHash = sc.broadcast(vocabHash) - -val sentences: RDD[Array[Int]] = words.mapPartitions { iter => - new Iterator[Array[Int]] { -def hasNext: Boolean = iter.hasNext - -def next(): Array[Int] = { - val sentence = ArrayBuilder.make[Int] - var sentenceLength = 0 - while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) { -val word = bcVocabHash.value.get(iter.next()) -word match { - case Some(w) => -sentence += w -sentenceLength += 1 - case None => -} +// each partition is a collection of sentences, +// will be translated into arrays of Index integer +val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter => + // Each sentence will map to 0 or more Array[Int] + sentenceIter.flatMap { sentence => { + // Sentence of words, some of which map to a word index + val wordIndexes = sentence.flatMap(bcVocabHash.value.get) + if (wordIndexes.nonEmpty) { --- End diff -- They should be equivalent Scala code. We don't need to mix in their meanings. Let us try providing an example of `sentenceIter: Iterator[Array[String]]` and `vocabHash: Map[String, Int]` such that ~~~scala sentenceIter.flatMap { sentence => val wordIndexes = sentence.flatMap(vocabHash.get) wordIndexes.grouped(maxSentenceLength).map(_.toArray) } returns a different result from your code: ~~~scala sentenceIter.flatMap { sentence => val wordIndexes = sentence.flatMap(vocabHash.get) if (wordIndexes.nonEmpty) { val sentenceSplit = wordIndexes.grouped(maxSentenceLength) sentenceSplit.map(_.toArray) } else { None } } ~~~ Essentially we are comparing the behavior when `wordIndexes` is an empty `Iterator[Int]`. In this case, no matter what value `maxSentenceLength` takes, ` wordIndexes.grouped(maxSentenceLength).map(_.toArray)` returns an empty iterator, which will be skipped by `sentenceIter.flatMap`. So it is the same as `None` being skipped by `sentenceIter.flatMap` in the current implementation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/10152#discussion_r52722338 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -289,24 +301,20 @@ class Word2Vec extends Serializable with Logging { val expTable = sc.broadcast(createExpTable()) val bcVocab = sc.broadcast(vocab) val bcVocabHash = sc.broadcast(vocabHash) - -val sentences: RDD[Array[Int]] = words.mapPartitions { iter => - new Iterator[Array[Int]] { -def hasNext: Boolean = iter.hasNext - -def next(): Array[Int] = { - val sentence = ArrayBuilder.make[Int] - var sentenceLength = 0 - while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) { -val word = bcVocabHash.value.get(iter.next()) -word match { - case Some(w) => -sentence += w -sentenceLength += 1 - case None => -} +// each partition is a collection of sentences, +// will be translated into arrays of Index integer +val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter => + // Each sentence will map to 0 or more Array[Int] + sentenceIter.flatMap { sentence => { + // Sentence of words, some of which map to a word index + val wordIndexes = sentence.flatMap(bcVocabHash.value.get) + if (wordIndexes.nonEmpty) { --- End diff -- That is because `"".split(" ") = Array("")`, which has nothing to do with `grouped`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/10152#discussion_r52722467 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -289,24 +301,20 @@ class Word2Vec extends Serializable with Logging { val expTable = sc.broadcast(createExpTable()) val bcVocab = sc.broadcast(vocab) val bcVocabHash = sc.broadcast(vocabHash) - -val sentences: RDD[Array[Int]] = words.mapPartitions { iter => - new Iterator[Array[Int]] { -def hasNext: Boolean = iter.hasNext - -def next(): Array[Int] = { - val sentence = ArrayBuilder.make[Int] - var sentenceLength = 0 - while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) { -val word = bcVocabHash.value.get(iter.next()) -word match { - case Some(w) => -sentence += w -sentenceLength += 1 - case None => -} +// each partition is a collection of sentences, +// will be translated into arrays of Index integer +val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter => + // Each sentence will map to 0 or more Array[Int] + sentenceIter.flatMap { sentence => { + // Sentence of words, some of which map to a word index + val wordIndexes = sentence.flatMap(bcVocabHash.value.get) + if (wordIndexes.nonEmpty) { +// break wordIndexes into trunks of maxSentenceLength when has more +val sentenceSplit = wordIndexes.grouped(maxSentenceLength) +sentenceSplit.map(_.toArray) --- End diff -- `sentenceSplit` should be an `Iterator[Array[Int]]`. So this line might be unnecessary. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user ygcao commented on a diff in the pull request: https://github.com/apache/spark/pull/10152#discussion_r52773573 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -289,24 +301,20 @@ class Word2Vec extends Serializable with Logging { val expTable = sc.broadcast(createExpTable()) val bcVocab = sc.broadcast(vocab) val bcVocabHash = sc.broadcast(vocabHash) - -val sentences: RDD[Array[Int]] = words.mapPartitions { iter => - new Iterator[Array[Int]] { -def hasNext: Boolean = iter.hasNext - -def next(): Array[Int] = { - val sentence = ArrayBuilder.make[Int] - var sentenceLength = 0 - while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) { -val word = bcVocabHash.value.get(iter.next()) -word match { - case Some(w) => -sentence += w -sentenceLength += 1 - case None => -} +// each partition is a collection of sentences, +// will be translated into arrays of Index integer +val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter => + // Each sentence will map to 0 or more Array[Int] + sentenceIter.flatMap { sentence => { + // Sentence of words, some of which map to a word index + val wordIndexes = sentence.flatMap(bcVocabHash.value.get) + if (wordIndexes.nonEmpty) { --- End diff -- The ppurpose for if statement is that an empty sentence(after lookup) should not result in an empty element. Whether we use grouped or not won't change the fact that it could generate empty element which could be harmful and wasteful for late steps. My test proves the if statement is needed for get rid of empty element. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user ygcao commented on a diff in the pull request: https://github.com/apache/spark/pull/10152#discussion_r52773740 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -289,24 +301,20 @@ class Word2Vec extends Serializable with Logging { val expTable = sc.broadcast(createExpTable()) val bcVocab = sc.broadcast(vocab) val bcVocabHash = sc.broadcast(vocabHash) - -val sentences: RDD[Array[Int]] = words.mapPartitions { iter => - new Iterator[Array[Int]] { -def hasNext: Boolean = iter.hasNext - -def next(): Array[Int] = { - val sentence = ArrayBuilder.make[Int] - var sentenceLength = 0 - while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) { -val word = bcVocabHash.value.get(iter.next()) -word match { - case Some(w) => -sentence += w -sentenceLength += 1 - case None => -} +// each partition is a collection of sentences, +// will be translated into arrays of Index integer +val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter => + // Each sentence will map to 0 or more Array[Int] + sentenceIter.flatMap { sentence => { + // Sentence of words, some of which map to a word index + val wordIndexes = sentence.flatMap(bcVocabHash.value.get) + if (wordIndexes.nonEmpty) { +// break wordIndexes into trunks of maxSentenceLength when has more +val sentenceSplit = wordIndexes.grouped(maxSentenceLength) +sentenceSplit.map(_.toArray) --- End diff -- Good catch! I'll double check and change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user ygcao commented on a diff in the pull request: https://github.com/apache/spark/pull/10152#discussion_r52823895 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -289,24 +301,20 @@ class Word2Vec extends Serializable with Logging { val expTable = sc.broadcast(createExpTable()) val bcVocab = sc.broadcast(vocab) val bcVocabHash = sc.broadcast(vocabHash) - -val sentences: RDD[Array[Int]] = words.mapPartitions { iter => - new Iterator[Array[Int]] { -def hasNext: Boolean = iter.hasNext - -def next(): Array[Int] = { - val sentence = ArrayBuilder.make[Int] - var sentenceLength = 0 - while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) { -val word = bcVocabHash.value.get(iter.next()) -word match { - case Some(w) => -sentence += w -sentenceLength += 1 - case None => -} +// each partition is a collection of sentences, +// will be translated into arrays of Index integer +val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter => + // Each sentence will map to 0 or more Array[Int] + sentenceIter.flatMap { sentence => { + // Sentence of words, some of which map to a word index + val wordIndexes = sentence.flatMap(bcVocabHash.value.get) + if (wordIndexes.nonEmpty) { +// break wordIndexes into trunks of maxSentenceLength when has more +val sentenceSplit = wordIndexes.grouped(maxSentenceLength) +sentenceSplit.map(_.toArray) --- End diff -- Sorry again, we can't do the change. Compiler will complain, Iterator can't be used for flatMap, it expects GenTraversableOnce --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/10152#discussion_r52576421 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -272,15 +285,14 @@ class Word2Vec extends Serializable with Logging { /** * Computes the vector representation of each word in vocabulary. - * @param dataset an RDD of words + * @param dataset a RDD of sentences, --- End diff -- we should use "an" here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/10152#discussion_r52575653 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -272,15 +285,14 @@ class Word2Vec extends Serializable with Logging { /** * Computes the vector representation of each word in vocabulary. - * @param dataset an RDD of words + * @param dataset a RDD of sentences, --- End diff -- That's right, though RDD effectively starts with a vowel sound: arr-dee-dee. A native speaker would certainly say "an RDD" like "an hour". In a similar way, people disagree over "a SQL database" vs "an SQL database" but it's really a disagreement over whether you say "a _sequel_ database" or "an _ess-cyoo-ell_ database. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user ygcao commented on a diff in the pull request: https://github.com/apache/spark/pull/10152#discussion_r52574236 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -272,15 +285,14 @@ class Word2Vec extends Serializable with Logging { /** * Computes the vector representation of each word in vocabulary. - * @param dataset an RDD of words + * @param dataset a RDD of sentences, --- End diff -- This is an interesting topic, seem r is not a vowel, not sounds like vowel either, why 'an'? I found this from web:You use the article âaâ before words that start with a consonant sound and âanâ before words that start with a vowel sound. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user ygcao commented on a diff in the pull request: https://github.com/apache/spark/pull/10152#discussion_r52574384 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -551,12 +551,17 @@ class Word2VecModel private[spark] ( } ind += 1 } -wordList.zip(cosVec) +var topResults = wordList.zip(cosVec) .toSeq - .sortBy(- _._2) + .sortBy(-_._2) .take(num + 1) .tail - .toArray +if (vecNorm != 0.0f) { + topResults = topResults.map { case (word, cosVec) => --- End diff -- Good point! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user ygcao commented on a diff in the pull request: https://github.com/apache/spark/pull/10152#discussion_r52573698 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -289,24 +301,20 @@ class Word2Vec extends Serializable with Logging { val expTable = sc.broadcast(createExpTable()) val bcVocab = sc.broadcast(vocab) val bcVocabHash = sc.broadcast(vocabHash) - -val sentences: RDD[Array[Int]] = words.mapPartitions { iter => - new Iterator[Array[Int]] { -def hasNext: Boolean = iter.hasNext - -def next(): Array[Int] = { - val sentence = ArrayBuilder.make[Int] - var sentenceLength = 0 - while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) { -val word = bcVocabHash.value.get(iter.next()) -word match { - case Some(w) => -sentence += w -sentenceLength += 1 - case None => -} +// each partition is a collection of sentences, +// will be translated into arrays of Index integer +val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter => + // Each sentence will map to 0 or more Array[Int] + sentenceIter.flatMap { sentence => { + // Sentence of words, some of which map to a word index + val wordIndexes = sentence.flatMap(bcVocabHash.value.get) + if (wordIndexes.nonEmpty) { --- End diff -- will empty iterator makes flatMap skip it just like skipping None? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/10152#discussion_r52575825 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -289,24 +301,20 @@ class Word2Vec extends Serializable with Logging { val expTable = sc.broadcast(createExpTable()) val bcVocab = sc.broadcast(vocab) val bcVocabHash = sc.broadcast(vocabHash) - -val sentences: RDD[Array[Int]] = words.mapPartitions { iter => - new Iterator[Array[Int]] { -def hasNext: Boolean = iter.hasNext - -def next(): Array[Int] = { - val sentence = ArrayBuilder.make[Int] - var sentenceLength = 0 - while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) { -val word = bcVocabHash.value.get(iter.next()) -word match { - case Some(w) => -sentence += w -sentenceLength += 1 - case None => -} +// each partition is a collection of sentences, +// will be translated into arrays of Index integer +val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter => + // Each sentence will map to 0 or more Array[Int] + sentenceIter.flatMap { sentence => { + // Sentence of words, some of which map to a word index + val wordIndexes = sentence.flatMap(bcVocabHash.value.get) + if (wordIndexes.nonEmpty) { --- End diff -- Yes, `flatMap` would flatten an empty iterator to nothing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user ygcao commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-183197942 addressed new comments. still kept the if statement as I explained by sample codes. reran test and lint test. Jenkins should still be happy :fireworks: --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user ygcao commented on a diff in the pull request: https://github.com/apache/spark/pull/10152#discussion_r52708705 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -289,24 +301,20 @@ class Word2Vec extends Serializable with Logging { val expTable = sc.broadcast(createExpTable()) val bcVocab = sc.broadcast(vocab) val bcVocabHash = sc.broadcast(vocabHash) - -val sentences: RDD[Array[Int]] = words.mapPartitions { iter => - new Iterator[Array[Int]] { -def hasNext: Boolean = iter.hasNext - -def next(): Array[Int] = { - val sentence = ArrayBuilder.make[Int] - var sentenceLength = 0 - while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) { -val word = bcVocabHash.value.get(iter.next()) -word match { - case Some(w) => -sentence += w -sentenceLength += 1 - case None => -} +// each partition is a collection of sentences, +// will be translated into arrays of Index integer +val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter => + // Each sentence will map to 0 or more Array[Int] + sentenceIter.flatMap { sentence => { + // Sentence of words, some of which map to a word index + val wordIndexes = sentence.flatMap(bcVocabHash.value.get) + if (wordIndexes.nonEmpty) { --- End diff -- Sorry, still not quite sure about this. did a test, turns out I am right :grinning: scala> val sentences=List("test sen 1","","testsen 2") sentences: List[String] = List(test sen 1, "", testsen 2) scala> val rdd=sc.parallelize(sentences) rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at :23 scala> val results=rdd.flatMap(sen=>sen.split(" ").grouped(1)) results: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[1] at flatMap at :25 scala> results.collect res0: Array[Array[String]] = Array(Array(test), Array(sen), Array(1), **Array("")**, Array(testsen), Array(2)) if we don't have the if statement, we'll result empty things which could cause trouble for following steps. I'd like to be on the safe side. if statement is cheap enough. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/10152#discussion_r52590418 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -76,6 +76,18 @@ class Word2Vec extends Serializable with Logging { private var numIterations = 1 private var seed = Utils.random.nextLong() private var minCount = 5 + private var maxSentenceLength = 1000 + + /** + * Sets the maximum length of each sentence in the input data. + * Any sentence longer than this threshold will be divided into chunks of + * up to `maxSentenceLength` size (default: 1000) + */ + @Since("2.0.0") + def setMaxSentenceLength(maxSentenceLength: Int): this.type = { --- End diff -- The param name comes from the original Google implementation. Either option (or both) works, but I guess I'd be marginally more in favour of amending the first line of doc to read `... maximum length (in words) of each ...`, or something similar. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/10152#discussion_r52638919 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -76,6 +76,18 @@ class Word2Vec extends Serializable with Logging { private var numIterations = 1 private var seed = Utils.random.nextLong() private var minCount = 5 + private var maxSentenceLength = 1000 + + /** + * Sets the maximum length of each sentence in the input data. + * Any sentence longer than this threshold will be divided into chunks of + * up to `maxSentenceLength` size (default: 1000) + */ + @Since("2.0.0") + def setMaxSentenceLength(maxSentenceLength: Int): this.type = { --- End diff -- Sounds good. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-182591988 @srowen Thanks! I will make a quick pass. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-182595810 made one pass and only minor comments --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/10152#discussion_r52530851 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -76,6 +76,18 @@ class Word2Vec extends Serializable with Logging { private var numIterations = 1 private var seed = Utils.random.nextLong() private var minCount = 5 + private var maxSentenceLength = 1000 + + /** + * Sets the maximum length of each sentence in the input data. + * Any sentence longer than this threshold will be divided into chunks of + * up to `maxSentenceLength` size (default: 1000) + */ + @Since("2.0.0") + def setMaxSentenceLength(maxSentenceLength: Int): this.type = { --- End diff -- It is not clear from the doc what "sentence length" means, number of words or number of characters. We can either update the doc or change the param name to `maxWordsPerSentence` to make this clear from the name. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/10152#discussion_r52530869 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -289,24 +301,20 @@ class Word2Vec extends Serializable with Logging { val expTable = sc.broadcast(createExpTable()) val bcVocab = sc.broadcast(vocab) val bcVocabHash = sc.broadcast(vocabHash) - -val sentences: RDD[Array[Int]] = words.mapPartitions { iter => - new Iterator[Array[Int]] { -def hasNext: Boolean = iter.hasNext - -def next(): Array[Int] = { - val sentence = ArrayBuilder.make[Int] - var sentenceLength = 0 - while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) { -val word = bcVocabHash.value.get(iter.next()) -word match { - case Some(w) => -sentence += w -sentenceLength += 1 - case None => -} +// each partition is a collection of sentences, +// will be translated into arrays of Index integer +val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter => + // Each sentence will map to 0 or more Array[Int] + sentenceIter.flatMap { sentence => { + // Sentence of words, some of which map to a word index + val wordIndexes = sentence.flatMap(bcVocabHash.value.get) + if (wordIndexes.nonEmpty) { --- End diff -- This `if ... else` is not necessary. `Seq.empty.grouped` returns an empty iterator. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/10152#discussion_r52530874 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -551,12 +551,17 @@ class Word2VecModel private[spark] ( } ind += 1 } -wordList.zip(cosVec) +var topResults = wordList.zip(cosVec) .toSeq - .sortBy(- _._2) + .sortBy(-_._2) .take(num + 1) .tail - .toArray +if (vecNorm != 0.0f) { + topResults = topResults.map { case (word, cosVec) => --- End diff -- `cosVec` shadows the `cosVec` outside. We can rename it to `cos` since it is not a vector. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/10152#discussion_r52530860 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -272,15 +285,14 @@ class Word2Vec extends Serializable with Logging { /** * Computes the vector representation of each word in vocabulary. - * @param dataset an RDD of words + * @param dataset a RDD of sentences, --- End diff -- `a RDD` -> `an RDD` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-182317367 Agree, I'm ready to merge this. I'll CC @mengxr or @jkbradley in case they want a final comment today --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user ygcao commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-182237525 It's getting to personal tastes now~~, still adopted suggestion though. Personally, I would like always to let machine to do the formatting and length limits(even adding braces for the if statement when we want to make it as a rule), if we don't like machine's default way, we can create template for Spark project to let machine do what the majority of spark community want(support eclipse is enough, intellij and others can adopt eclipse formatter's template), the key is that machine should be the guy to do the 'stupid' and repetitive work for us;) seems we can create an issue for Spark about this: create an template for IDE formatter for Spark contributors. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-181271212 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/10152#discussion_r52141782 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -289,26 +301,24 @@ class Word2Vec extends Serializable with Logging { val expTable = sc.broadcast(createExpTable()) val bcVocab = sc.broadcast(vocab) val bcVocabHash = sc.broadcast(vocabHash) - -val sentences: RDD[Array[Int]] = words.mapPartitions { iter => - new Iterator[Array[Int]] { -def hasNext: Boolean = iter.hasNext - -def next(): Array[Int] = { - val sentence = ArrayBuilder.make[Int] - var sentenceLength = 0 - while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) { -val word = bcVocabHash.value.get(iter.next()) -word match { - case Some(w) => -sentence += w -sentenceLength += 1 - case None => +// each partition is a collection of sentences, +// will be translated into arrays of Index integer +val sentences: RDD[Array[Int]] = dataset.mapPartitions { + // Each sentence will map to 0 or more Array[Int] + sentenceIter => --- End diff -- Generally I prefer the style: ``` dataset.mapPartitions { sentenceIter => sentenceIter.flatMap { sentence => ... ``` I don't think it's a hard rule for Spark but most other code follows this convention. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/10152#discussion_r52141876 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -551,12 +553,17 @@ class Word2VecModel private[spark] ( } ind += 1 } -wordList.zip(cosVec) +var topResults = wordList.zip(cosVec) .toSeq - .sortBy(- _._2) + .sortBy(-_._2) .take(num + 1) .tail - .toArray +if (vecNorm != 0.0f) { + topResults = topResults.map { +case (word: String, cosVec: Double) => (word, cosVec / vecNorm) --- End diff -- same here for style, prefer ``` topResults = topResults.map { case (word: String, cosVec: Double) => ... ``` (as long as it fits within line length restriction) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/10152#discussion_r52142152 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -551,12 +553,17 @@ class Word2VecModel private[spark] ( } ind += 1 } -wordList.zip(cosVec) +var topResults = wordList.zip(cosVec) .toSeq - .sortBy(- _._2) + .sortBy(-_._2) .take(num + 1) .tail - .toArray +if (vecNorm != 0.0f) { + topResults = topResults.map { +case (word: String, cosVec: Double) => (word, cosVec / vecNorm) --- End diff -- I also generally write that way; you don't need types here on the case match statement --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-181253115 Jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-181257924 **[Test build #50915 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50915/consoleFull)** for PR 10152 at commit [`84a0bc4`](https://github.com/apache/spark/commit/84a0bc4c73ca0915f6e6291442c8e60f6d80d879). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/10152#discussion_r52142120 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -289,26 +301,24 @@ class Word2Vec extends Serializable with Logging { val expTable = sc.broadcast(createExpTable()) val bcVocab = sc.broadcast(vocab) val bcVocabHash = sc.broadcast(vocabHash) - -val sentences: RDD[Array[Int]] = words.mapPartitions { iter => - new Iterator[Array[Int]] { -def hasNext: Boolean = iter.hasNext - -def next(): Array[Int] = { - val sentence = ArrayBuilder.make[Int] - var sentenceLength = 0 - while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) { -val word = bcVocabHash.value.get(iter.next()) -word match { - case Some(w) => -sentence += w -sentenceLength += 1 - case None => +// each partition is a collection of sentences, +// will be translated into arrays of Index integer +val sentences: RDD[Array[Int]] = dataset.mapPartitions { + // Each sentence will map to 0 or more Array[Int] + sentenceIter => --- End diff -- This could even be `dataset.mapPartitions { _.flatMap { sentence =>`, which I kind of like, but I don't know how much it matters. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-181270729 **[Test build #50915 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50915/consoleFull)** for PR 10152 at commit [`84a0bc4`](https://github.com/apache/spark/commit/84a0bc4c73ca0915f6e6291442c8e60f6d80d879). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-181271217 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50915/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user ygcao commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-181240271 added braces to make lint happy. Jenkins should happy now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-180820174 **[Test build #50873 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50873/consoleFull)** for PR 10152 at commit [`443ec06`](https://github.com/apache/spark/commit/443ec06d8e6381a9df8f69b89fcd4955095e6b6c). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-180820176 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-180817180 Jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-180820036 **[Test build #50873 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50873/consoleFull)** for PR 10152 at commit [`443ec06`](https://github.com/apache/spark/commit/443ec06d8e6381a9df8f69b89fcd4955095e6b6c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-180820177 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50873/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user MLnick commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-179858977 @ygcao sorry for the delay. I'm trying to run a few `spark-perf` tests and try larger scale if possible. Will revert ASAP. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user ygcao commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-180188955 coolï¼Thanks. It will be helpful to see the large scale test result. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user ygcao commented on a diff in the pull request: https://github.com/apache/spark/pull/10152#discussion_r51391185 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -289,17 +301,28 @@ class Word2Vec extends Serializable with Logging { val expTable = sc.broadcast(createExpTable()) val bcVocab = sc.broadcast(vocab) val bcVocabHash = sc.broadcast(vocabHash) - -val sentences: RDD[Array[Int]] = words.mapPartitions { iter => +// each partition is a collection of sentences, will be translated into arrays of Index integer +val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter => new Iterator[Array[Int]] { -def hasNext: Boolean = iter.hasNext +var wordIter: Iterator[String] = null + +def hasNext: Boolean = sentenceIter.hasNext || (wordIter != null && wordIter.hasNext) def next(): Array[Int] = { val sentence = ArrayBuilder.make[Int] var sentenceLength = 0 - while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) { -val word = bcVocabHash.value.get(iter.next()) -word match { + // do translation of each word into its index in the vocabulary, --- End diff -- I finally made up mind to do a hacky simple perf-test just for proof of concept: the 5x runs' perf diff of different implementation is quite ignorable since it's within variance of each run of the same version. Some details: I prepared a 32k document from two arbitrary picked wikipedia pages( for "machine learning" and "Adversarial machine learning", didn't include reference section), which contains 341 lines and can be split into 442 sentences by simply using dot+space for sentence boundary). I injected following test case into Word2VecSuite class and run it against three different implementations(the old one which is in the master branch, my final two versions before adopting Sean's suggestion, and after adopted Sean's suggestion) of fit function in mllib.feature.Word2Vec class. test("testSpeed") { val lines = sc.parallelize(Source.fromFile(new File("/home/ygcao/machinelearning.txt")).getLines().toSeq) val sentences = lines.flatMap(_.split("\\. ")).map(line => line.split(" ").toSeq) println("read file into rdd, lines=", sentences.count()) var builtModel: org.apache.spark.mllib.feature.Word2VecModel = null var duration = 0l for (i <- 1 to 5) { val start = System.currentTimeMillis() val model = new org.apache.spark.mllib.feature.Word2Vec().setVectorSize(3).setSeed(42l) builtModel = model.fit(sentences) duration += (System.currentTimeMillis() - start) } println(s"builtModel take ${duration},vocabulary size:${builtModel.getVectors.size}, learning's synonyms:${builtModel.findSynonyms("learning", 4).mkString("\n")}") } the vocabulary size from the model is 155. and here are the time taking three runs of each version and the average of the final two runs of them. As you can see from the code, each run actually run the model building 5 times to magnify the potential diff. masterVersion PR-useIter PR-useCollection run1223221071933 run2208519861987 run3200521232004 avarage(run2, run3) 20452054.5 1995.5 BTW: Following is not relevant for perf-test, just FYI. the two versions in this pull request will produce exact the same result, which proves the correctness of both. and the result is interesting as well although the dataset is quite tiny, new versions(un-merged ones) looks better than the old version(the one in master branch right now). Of course, you can screw new version up by a bad sentence splitter(we can make a hard cut splitter to do exact the same thing as the old version). the simple splitter used for the test case can't deal with abbreviations, that's why I removed references section in the text. Here are the top synonyms of learning using the tiny dataset, please keep in mind, it's just for fun, not a solid proof of which is definitely better since dataset is tiny. New versions: learning's synonyms: (network,0.9990321742854605) (related,0.9966140511173031) (sparse,0.9965729586431097) (algorithms,0.99376379497485) Old version, learning's synonyms: (against,0.9895162633562077) (Support,0.9547255372896342) (Association,0.9499811242788365) (Attacks,0.9321700815006693) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working,
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user ygcao commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-177855098 FYI: did a small scale perf-test and also checked logic correctness. please check my comment for the review text for details of my experiment done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user ygcao commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-177429666 Adopted all Sean's suggestions with minor editing to address missed edge case and make compiler happy, I think these suggestions are good ones, thanks!. please try to trigger the QA build and see whether everything goes fine and potentially do perf-test to see whether there is big differences. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/10152#discussion_r51361374 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -289,17 +301,28 @@ class Word2Vec extends Serializable with Logging { val expTable = sc.broadcast(createExpTable()) val bcVocab = sc.broadcast(vocab) val bcVocabHash = sc.broadcast(vocabHash) - -val sentences: RDD[Array[Int]] = words.mapPartitions { iter => +// each partition is a collection of sentences, will be translated into arrays of Index integer +val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter => new Iterator[Array[Int]] { -def hasNext: Boolean = iter.hasNext +var wordIter: Iterator[String] = null + +def hasNext: Boolean = sentenceIter.hasNext || (wordIter != null && wordIter.hasNext) def next(): Array[Int] = { val sentence = ArrayBuilder.make[Int] var sentenceLength = 0 - while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) { -val word = bcVocabHash.value.get(iter.next()) -word match { + // do translation of each word into its index in the vocabulary, --- End diff -- Oops I keep imagining there is a flatMapPartitions. mapPartitions and flatMap. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user ygcao commented on a diff in the pull request: https://github.com/apache/spark/pull/10152#discussion_r51356725 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -289,17 +301,28 @@ class Word2Vec extends Serializable with Logging { val expTable = sc.broadcast(createExpTable()) val bcVocab = sc.broadcast(vocab) val bcVocabHash = sc.broadcast(vocabHash) - -val sentences: RDD[Array[Int]] = words.mapPartitions { iter => +// each partition is a collection of sentences, will be translated into arrays of Index integer +val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter => new Iterator[Array[Int]] { -def hasNext: Boolean = iter.hasNext +var wordIter: Iterator[String] = null + +def hasNext: Boolean = sentenceIter.hasNext || (wordIter != null && wordIter.hasNext) def next(): Array[Int] = { val sentence = ArrayBuilder.make[Int] var sentenceLength = 0 - while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) { -val word = bcVocabHash.value.get(iter.next()) -word match { + // do translation of each word into its index in the vocabulary, --- End diff -- sorry, I can't do much about spark-perf thing, I even still didn't get time to figure out what's blocking me from running the whole test locally. I(and my compiler) also don't aware of the existence of flatMapPartitions function, but I do made a version suppose to do whatever Sean suggested. Please review and help to do the perf-test for two latest versions. BTW: I don't worry much about the perf difference since it shouldn't be much, even minutes(not quite possible) of difference for each partition, just mean possibly tens minutes of overall penalty, shouldn't matter much for a job runs hours. Of course, seeing the data is more convincing. My point is that, If diff is minor, we'd better optimize for readability. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/10152#discussion_r51088952 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -289,17 +301,28 @@ class Word2Vec extends Serializable with Logging { val expTable = sc.broadcast(createExpTable()) val bcVocab = sc.broadcast(vocab) val bcVocabHash = sc.broadcast(vocabHash) - -val sentences: RDD[Array[Int]] = words.mapPartitions { iter => +// each partition is a collection of sentences, will be translated into arrays of Index integer +val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter => new Iterator[Array[Int]] { -def hasNext: Boolean = iter.hasNext +var wordIter: Iterator[String] = null + +def hasNext: Boolean = sentenceIter.hasNext || (wordIter != null && wordIter.hasNext) def next(): Array[Int] = { val sentence = ArrayBuilder.make[Int] var sentenceLength = 0 - while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) { -val word = bcVocabHash.value.get(iter.next()) -word match { + // do translation of each word into its index in the vocabulary, --- End diff -- Agreed, that is more succinct. @ygcao can you look into making that change? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/10152#discussion_r51088866 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -556,6 +571,7 @@ class Word2VecModel private[spark] ( .sortBy(- _._2) .take(num + 1) .tail + .map(v => (if (vecNorm == 0) v else (v._1, v._2 / vecNorm))) --- End diff -- @ygcao please change this line to `.map { case (word, cosVec) => (word, if (vecNorm == 0.0) 0.0 else (cosVec / vecNorm)) }` as per discussion above. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/10152#discussion_r50961556 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -289,17 +301,28 @@ class Word2Vec extends Serializable with Logging { val expTable = sc.broadcast(createExpTable()) val bcVocab = sc.broadcast(vocab) val bcVocabHash = sc.broadcast(vocabHash) - -val sentences: RDD[Array[Int]] = words.mapPartitions { iter => +// each partition is a collection of sentences, will be translated into arrays of Index integer +val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter => new Iterator[Array[Int]] { -def hasNext: Boolean = iter.hasNext +var wordIter: Iterator[String] = null + +def hasNext: Boolean = sentenceIter.hasNext || (wordIter != null && wordIter.hasNext) def next(): Array[Int] = { val sentence = ArrayBuilder.make[Int] var sentenceLength = 0 - while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) { -val word = bcVocabHash.value.get(iter.next()) -word match { + // do translation of each word into its index in the vocabulary, --- End diff -- Good point; I'm not sure if it was for efficiency or simply to copy the C++ implementation. Hm, I don't suppose we have benchmarks here. I'm OK with keeping the longer more efficient implementation. but now that input sentence boundaries matter, this can still be simplified. This part can still be used: ``` // Each input sentence will produce 1 or more Array[Int], so flatMapPartitions dataset.flatMapPartitions { sentenceIter => // Each sentence will map to 1 or more Array[Int], so map sentenceIter.map { sentence => ... } } ``` That is you can still `flatMapPartitions` over `Iterator`s for each sentence to join them without trying to manage both the `Iterator` over sentences and words. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-174446659 @ygcao this has not been merged. You can see the PR is open and there is no message about merging into master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/10152#discussion_r50671686 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -556,6 +571,7 @@ class Word2VecModel private[spark] ( .sortBy(- _._2) .take(num + 1) .tail + .map(v => (if (vecNorm == 0) v else (v._1, v._2 / vecNorm))) --- End diff -- @srowen if either of the vector norms are 0, the cosine similarity should be 0. Though by definition I think here `v._2` will be 0 if the norm is 0, what do you think about making this `if vecNorm == 0) 0 ...` to be explicit? (follows what you did in https://github.com/apache/spark/commit/94b39fecff3794727c186bd681fa4c6af4fd#diff-88f4b62c382b26ef8e856b23f5167ccdR542) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user MLnick commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-174452391 As Sean says this hasn't been merged yet, I was waiting for the latest test build to pass before making a final pass over this. Ideally I'd like to just get one of @srowen @jkbradley @mengxr to give this a pass through too before merging. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/10152#discussion_r50676720 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -556,6 +571,7 @@ class Word2VecModel private[spark] ( .sortBy(- _._2) .take(num + 1) .tail + .map(v => (if (vecNorm == 0) v else (v._1, v._2 / vecNorm))) --- End diff -- I agree that we should define cosine similarity with a zero vector to be 0. In this case the results are pretty meaningless anyway, since the dot product was already 0 for everything, and so the top N are random. I'd say: ``` .map { case (word, cosVec) => (word, if (vecNorm == 0.0) 0.0 else (cosVec / vecNorm)) } ``` For more efficiency we could only apply this map in the corner case that `vecNorm` is 0.0; not sure if it's simpler but it's a little less work: ``` val result = wordListtail if (vecNorm != 0.0) [ result = result.map { case (word, cosVec) => (word, cosVec / vecNorm) } } result.toArray ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/10152#discussion_r50675972 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -289,17 +301,28 @@ class Word2Vec extends Serializable with Logging { val expTable = sc.broadcast(createExpTable()) val bcVocab = sc.broadcast(vocab) val bcVocabHash = sc.broadcast(vocabHash) - -val sentences: RDD[Array[Int]] = words.mapPartitions { iter => +// each partition is a collection of sentences, will be translated into arrays of Index integer +val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter => new Iterator[Array[Int]] { -def hasNext: Boolean = iter.hasNext +var wordIter: Iterator[String] = null + +def hasNext: Boolean = sentenceIter.hasNext || (wordIter != null && wordIter.hasNext) def next(): Array[Int] = { val sentence = ArrayBuilder.make[Int] var sentenceLength = 0 - while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) { -val word = bcVocabHash.value.get(iter.next()) -word match { + // do translation of each word into its index in the vocabulary, --- End diff -- I understand that this part of the change intends to respect the implied sentence boundaries in the input. I think it can be simpler? One input sentence maps to 1 or more arrays, and the result should be flattened. Something like? ``` // Each input sentence will produce 1 or more Array[Int], so flatMapPartitions dataset.flatMapPartitions { sentenceIter => // Each sentence will map to 1 or more Array[Int], so map sentenceIter.map { sentence => // Sentence of words, some of which map to a hash, so flatMap val hashes = sentence.flatMap(bcVocabHash.value.get) // break into sequence of at most maxSentenceLength hashes.grouped(maxSentenceLength).map(_.toArray) } } ``` I haven't tested it but does that seem like the intent? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/10152#discussion_r50683565 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -289,17 +301,28 @@ class Word2Vec extends Serializable with Logging { val expTable = sc.broadcast(createExpTable()) val bcVocab = sc.broadcast(vocab) val bcVocabHash = sc.broadcast(vocabHash) - -val sentences: RDD[Array[Int]] = words.mapPartitions { iter => +// each partition is a collection of sentences, will be translated into arrays of Index integer +val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter => new Iterator[Array[Int]] { -def hasNext: Boolean = iter.hasNext +var wordIter: Iterator[String] = null + +def hasNext: Boolean = sentenceIter.hasNext || (wordIter != null && wordIter.hasNext) def next(): Array[Int] = { val sentence = ArrayBuilder.make[Int] var sentenceLength = 0 - while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) { -val word = bcVocabHash.value.get(iter.next()) -word match { + // do translation of each word into its index in the vocabulary, --- End diff -- I also looked at a roughly similar approach. I assumed that the initial formulation was done for efficiency reasons (while loop etc) - hence why I didn't suggest it or explore it much further - but we'd have to get the original author to weigh in on that. I agree your approach is simpler and more succinct. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/10152#discussion_r50684006 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -556,6 +571,7 @@ class Word2VecModel private[spark] ( .sortBy(- _._2) .take(num + 1) .tail + .map(v => (if (vecNorm == 0) v else (v._1, v._2 / vecNorm))) --- End diff -- Strictly speaking we could throw an error or warning if passing in a zero vector as it's meaningless as you say. Still, I think in real usage that's unlikely, so I'd just go for the first option. I don't think there are real world efficiency gains to be had with the second option. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user ygcao commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-174413771 Thanks for the merge! It looks successful, but it seems like the change is still not appearing in the master. What's the rest of the workflow for it to be appearing in Master branch of Spark? Any follow up needed from me? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-173607998 **[Test build #2431 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2431/consoleFull)** for PR 10152 at commit [`e938208`](https://github.com/apache/spark/commit/e938208d9c85515f62b41635a8445b8ab31f55f2). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-173621659 **[Test build #2431 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2431/consoleFull)** for PR 10152 at commit [`e938208`](https://github.com/apache/spark/commit/e938208d9c85515f62b41635a8445b8ab31f55f2). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-172787971 **[Test build #2407 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2407/consoleFull)** for PR 10152 at commit [`141d7a2`](https://github.com/apache/spark/commit/141d7a2bff8c8e462a9a2aff33b1f16bebc30b25). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user ygcao commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-172777566 PySpark's test case fixed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-172778827 **[Test build #2407 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2407/consoleFull)** for PR 10152 at commit [`141d7a2`](https://github.com/apache/spark/commit/141d7a2bff8c8e462a9a2aff33b1f16bebc30b25). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user ygcao commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-173094302 adjusted python doctest format. Did the heavy job of running pyspark test locally. Now should be OK. Finished test(python): pyspark.ml.feature (35s) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user ygcao commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-172468729 ouch, we finally decided to make backward incompatible changes for synonyms~~. That caused test case failure. I adjusted the expected value according our new logic: always return normalized similarity value for findSyonyms function Also added defense code for potential divide by zero accordingly referring to @srowen 's hot fix of another similar situation. I have 90% confidence that these change is enough for passing tests. The reason why I can't be 100%, is that spark sql can't compile and run_tests script is not working in my local env. I hacked out the expected value, but I verified it against before and after change. FYI: the compiling issue on master branch about Spark SQL, not relevant to our changes, but is a blocker somebody should look into. [error] spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystQl.scala:311: reference to Rollup is ambiguous; [error] it is imported twice in the same scope by [error] import org.apache.spark.sql.catalyst.plans.logical._ [error] and import org.apache.spark.sql.catalyst.expressions._ [error] Seq(Rollup(children.map(nodeToExpr))), [error] ^ [error] spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystQl.scala:319: reference to Cube is ambiguous; [error] it is imported twice in the same scope by [error] import org.apache.spark.sql.catalyst.plans.logical._ [error] and import org.apache.spark.sql.catalyst.expressions._ [error] Seq(Cube(children.map(nodeToExpr))), [error] ^ [error] spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:191: reference to Rollup is ambiguous; [error] it is imported twice in the same scope by [error] import org.apache.spark.sql.catalyst.plans.logical._ [error] and import org.apache.spark.sql.catalyst.expressions._ [error] def bitmasks(r: Rollup): Seq[Int] = { [error] ^ [error] spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:204: reference to Cube is ambiguous; [error] it is imported twice in the same scope by [error] import org.apache.spark.sql.catalyst.plans.logical._ [error] and import org.apache.spark.sql.catalyst.expressions._ [error] def bitmasks(c: Cube): Seq[Int] = { [error] ^ [error] spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:210: reference to Cube is ambiguous; [error] it is imported twice in the same scope by [error] import org.apache.spark.sql.catalyst.plans.logical._ [error] and import org.apache.spark.sql.catalyst.expressions._ [error] case Aggregate(Seq(c @ Cube(groupByExprs)), aggregateExpressions, child) => [error] --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-172480884 **[Test build #2394 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2394/consoleFull)** for PR 10152 at commit [`76e8266`](https://github.com/apache/spark/commit/76e82667bd04bc359a336c7538d2595aade8384e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-172489168 **[Test build #2394 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2394/consoleFull)** for PR 10152 at commit [`76e8266`](https://github.com/apache/spark/commit/76e82667bd04bc359a336c7538d2595aade8384e). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user MLnick commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-170854032 @ygcao sorry for the delay, I was on vacation and not checking email regularly. I'm pretty much happy with this, but I'd like to just get one of @srowen @jkbradley @mengxr to give it a final review. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user MLnick commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-170927597 @ygcao the changes have impacted the tests - could you take a look at the failure? I think we may need to update the test suite. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-170885940 **[Test build #2372 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2372/consoleFull)** for PR 10152 at commit [`214d0d9`](https://github.com/apache/spark/commit/214d0d9d3350a624f98f36aff2393b7b61fd6176). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-170878810 **[Test build #2372 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2372/consoleFull)** for PR 10152 at commit [`214d0d9`](https://github.com/apache/spark/commit/214d0d9d3350a624f98f36aff2393b7b61fd6176). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user ygcao commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-168907161 @MLnick Happy new year. I think I've addressed all you comments last year, could you help to do the merge this year? If anyone still have other concerns, please let me know. an unfinished pull request without a good reason will be a huge waste. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/10152#discussion_r48229609 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -77,6 +77,18 @@ class Word2Vec extends Serializable with Logging { private var numIterations = 1 private var seed = Utils.random.nextLong() private var minCount = 5 + private var maxSentenceLength = 1000 + + /** + * Sets the maximum length of each sentence in the input data. + * Any sentence longer than this threshold will be divided into chunks of + * up to `maxSentenceLength` size (default: 1000) + */ + @Since("1.6.1") --- End diff -- @ygcao actually could we target `2.0.0` instead of `1.6.1` in this PR, since it's against master? We'll consider backporting after it is merged. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user ygcao commented on a diff in the pull request: https://github.com/apache/spark/pull/10152#discussion_r48324725 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -77,6 +77,18 @@ class Word2Vec extends Serializable with Logging { private var numIterations = 1 private var seed = Utils.random.nextLong() private var minCount = 5 + private var maxSentenceLength = 1000 + + /** + * Sets the maximum length of each sentence in the input data. + * Any sentence longer than this threshold will be divided into chunks of + * up to `maxSentenceLength` size (default: 1000) + */ + @Since("1.6.1") --- End diff -- Done! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user ygcao commented on a diff in the pull request: https://github.com/apache/spark/pull/10152#discussion_r48324747 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -77,6 +77,18 @@ class Word2Vec extends Serializable with Logging { private var numIterations = 1 private var seed = Utils.random.nextLong() private var minCount = 5 + private var maxSentenceLength = 1000 + + /** + * Sets the maximum length of each sentence in the input data. + * Any sentence longer than this threshold will be divided into chunks of + * up to `maxSentenceLength` size (default: 1000) + */ + @Since("1.6.1") --- End diff -- Done! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user ygcao commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-166544099 modified comment accordingly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user MLnick commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-166215753 Pinging @jkbradley @mengxr @MechCoder again for a final review - could you give this a look and confirm you're in agreement with my comments above. Also thoughts on whether this should target `1.6.1` - as it is actually a fairly major yet subtle bug in the implementation. Or even be backported to `1.5.3`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/10152#discussion_r48119953 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -77,6 +77,20 @@ class Word2Vec extends Serializable with Logging { private var numIterations = 1 private var seed = Utils.random.nextLong() private var minCount = 5 + private var maxSentenceLength = 1000 + + /** + * sets the maxSentenceLength, maxSentenceLength is used as the threshold for cutting sentence --- End diff -- One final thing - can you address the comment above? And I think we can actually remove the `@param` and `@return` to match the comments for the other setters in this class. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user MLnick commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-166215472 @ygcao just one final comment on the `setMaxSentenceLength` setter comment to address, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user ygcao commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-165959134 Thanks @MLnick! Added the @Since annotation in the code accordingly and updated issue to make type as bug and consistent with what we finalized for changing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/10152#discussion_r47875988 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -281,17 +295,28 @@ class Word2Vec extends Serializable with Logging { val expTable = sc.broadcast(createExpTable()) val bcVocab = sc.broadcast(vocab) val bcVocabHash = sc.broadcast(vocabHash) - -val sentences: RDD[Array[Int]] = words.mapPartitions { iter => +// each partition is a collection of sentences, will be translated into arrays of Index integer +val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter => new Iterator[Array[Int]] { -def hasNext: Boolean = iter.hasNext +var wordIter: Iterator[String] = null + +def hasNext: Boolean = sentenceIter.hasNext || (wordIter != null && wordIter.hasNext) def next(): Array[Int] = { val sentence = ArrayBuilder.make[Int] var sentenceLength = 0 - while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) { -val word = bcVocabHash.value.get(iter.next()) -word match { + // do translation of each word into its index in the vocabulary, + // do cutting only when the sentence is larger than maxSentenceLength + if ((wordIter == null || !wordIter.hasNext) && sentenceIter.hasNext) { +do { --- End diff -- ok, fair enough --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/10152#discussion_r47876214 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -77,6 +77,20 @@ class Word2Vec extends Serializable with Logging { private var numIterations = 1 private var seed = Utils.random.nextLong() private var minCount = 5 + private var maxSentenceLength = 1000 + + /** + * sets the maxSentenceLength, maxSentenceLength is used as the threshold for cutting sentence + * into chunks when it is too long. (default: 1000) + * @param maxSentenceLength the maxSentenceLength allowed. + * recommend to set it large enough to respect reasonable long sentences + * while not overflow memory + * @return this object + */ + def setMaxSentenceLength(maxSentenceLength: Int): this.type = { --- End diff -- I think we should target `1.6.1` here: `@Since("1.6.1")` - overall I'd view this PR as a bugfix (though adding the parameter is a minor extra feature). I think we'd want to include this in branch-1.6, and possibly even think about backporting the core changes to branch-1.5 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org