Github user MLnick commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-162800260 @jkbradley as far as I can see: The input to `fit` is `dataset: RDD[Iterable[String]]`. However, in [L273](https://github.com/apache/spark/pull/10152/files#diff-88f4b62c382b26ef8e856b23f5167ccdL273), we create `val words = dataset.flatMap(x => x)`, flattening out the RDD of sentences to an RDD of words, i.e. `RDD[Iterable[String]] -> RDD[String]`. Then in [L285](https://github.com/apache/spark/pull/10152/files#diff-88f4b62c382b26ef8e856b23f5167ccdL285), `words` is used in `mapPartitions` (as opposed to `dataset`). In the `mapPartitions` block, the behaviour of `next` is to advance through the iterator (which is now the flatMapped stream of words, *not* a stream of sentences), and have a hard cut off to create each sentence `Array[Int]` after `MAX_SENTENCE_LENGTH` is reached. So the way I read the current code, it indeed does just treat the input as a stream of words, discards sentence boundaries, and uses a hard 1000 word limit on "sentences". Let me know if I missed something here. This is in fact matching what the Google impl does (from my quick look through [the C code](http://word2vec.googlecode.com/svn/trunk/word2vec.c), e.g. L373-405 and L70 in `ReadWord`). So purely technically the current code is "correct" as it matches the original, but I'm not sure if it was intentional or not to use `words` in the `mapPartitions` block. But to me it's an open question whether this approach, or keeping sentence structure / boundaries, is better.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org