Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-162800260
  
    @jkbradley as far as I can see:
    
    The input to `fit` is `dataset: RDD[Iterable[String]]`. However, in 
[L273](https://github.com/apache/spark/pull/10152/files#diff-88f4b62c382b26ef8e856b23f5167ccdL273),
 we create `val words = dataset.flatMap(x => x)`, flattening out the RDD of 
sentences to an RDD of words, i.e. `RDD[Iterable[String]] -> RDD[String]`.
    
    Then in 
[L285](https://github.com/apache/spark/pull/10152/files#diff-88f4b62c382b26ef8e856b23f5167ccdL285),
 `words` is used in `mapPartitions` (as opposed to `dataset`). In the 
`mapPartitions` block, the behaviour of `next` is to advance through the 
iterator (which is now the flatMapped stream of words, *not* a stream of 
sentences), and have a hard cut off to create each sentence `Array[Int]` after 
`MAX_SENTENCE_LENGTH` is reached.
    
    So the way I read the current code, it indeed does just treat the input as 
a stream of words, discards sentence boundaries, and uses a hard 1000 word 
limit on "sentences". Let me know if I missed something here.
    
    This is in fact matching what the Google impl does (from my quick look 
through [the C code](http://word2vec.googlecode.com/svn/trunk/word2vec.c), e.g. 
L373-405 and L70 in `ReadWord`).
    
    So purely technically the current code is "correct" as it matches the 
original, but I'm not sure if it was intentional or not to use `words` in the 
`mapPartitions` block. But to me it's an open question whether this approach, 
or keeping sentence structure / boundaries, is better. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to