Github user ygcao commented on a diff in the pull request: https://github.com/apache/spark/pull/10152#discussion_r46922797 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -281,16 +280,17 @@ class Word2Vec extends Serializable with Logging { val expTable = sc.broadcast(createExpTable()) val bcVocab = sc.broadcast(vocab) val bcVocabHash = sc.broadcast(vocabHash) - - val sentences: RDD[Array[Int]] = words.mapPartitions { iter => + //each partition is a collection of sentences, will be translated into arrays of Index integer + val sentences: RDD[Array[Int]] = dataset.mapPartitions { iter => new Iterator[Array[Int]] { def hasNext: Boolean = iter.hasNext def next(): Array[Int] = { val sentence = ArrayBuilder.make[Int] var sentenceLength = 0 - while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) { - val word = bcVocabHash.value.get(iter.next()) + //do translation of each word into its index in the vocabulary, not constraint by fixed length anymore + for (wd <- iter.next()) { --- End diff -- I think my change got some understanding now. The problem is not much about having a maxSentenceLength, the problem of original version is the discarding of nature sentences by flatmap it and then cut the document into chunks of maxSentenceLength(except for the last one which could be less),no sentence boundaries will be respected as a result. This change is designed to use the sentence structure of the input. As to the max sentence length limitations, don't really see a need for it.the worst case is that the entire document is one sentence, it will still work since each document is small enough to fit in memory and skip gram model is built around the max size-constraint sliding window instead of sentence or document. Respect sentence is to avoid the sliding window take words as context of target word from other adjacent sentences which could be irrelevant. Agreed people may use word2vec in very different scenario, we can make two logics for choices. I'll make it an option than arguing which one is absolutely correct although I don't see the good justification of using fixed size chunks instead of structure.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org