Github user ygcao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r46922797
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -281,16 +280,17 @@ class Word2Vec extends Serializable with Logging {
         val expTable = sc.broadcast(createExpTable())
         val bcVocab = sc.broadcast(vocab)
         val bcVocabHash = sc.broadcast(vocabHash)
    -
    -    val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
    +    //each partition is a collection of sentences, will be translated into 
arrays of Index integer
    +    val sentences: RDD[Array[Int]] = dataset.mapPartitions { iter =>
           new Iterator[Array[Int]] {
             def hasNext: Boolean = iter.hasNext
     
             def next(): Array[Int] = {
               val sentence = ArrayBuilder.make[Int]
               var sentenceLength = 0
    -          while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
    -            val word = bcVocabHash.value.get(iter.next())
    +          //do translation of each word into its index in the vocabulary, 
not constraint by fixed length anymore
    +          for (wd <- iter.next()) {
    --- End diff --
    
    I think my change got some understanding now. The problem is not much about 
having a maxSentenceLength, the problem of original version is the discarding 
of nature sentences by flatmap it and then cut the document into chunks of 
maxSentenceLength(except for the last one which could be less),no sentence 
boundaries will be respected as a result. 
    This change is designed to use the sentence structure of the input. As to 
the max sentence length limitations, don't really see a need for it.the worst 
case is that the entire document is one sentence, it will still work since each 
document is small enough to fit in memory and skip gram model is built around 
the max size-constraint sliding window instead of sentence or document. Respect 
sentence is to avoid the sliding window take words as context of target word 
from other adjacent sentences which could be irrelevant.  
    Agreed people may use word2vec in very different scenario, we can make two 
logics for choices. I'll make it an option than arguing which one is absolutely 
correct although I don't see the good justification of using fixed size chunks 
instead of structure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to