Github user ygcao commented on the pull request: https://github.com/apache/spark/pull/10152#issuecomment-163512458 I have to say word2vec or skip gram can be absolutely affected by training data just like any other ML algorithm. I also can tell you I observed big differences when I apply different massage for the data at the scale of millions to billions sentences. Researcher often tries to simplify engineering details, one obvious example is that the phrase2vec is highly simplified to show the algorithm's effectiveness instead of relying on complex entity recognition engine,that doesn't mean we should not do more advanced phrase construction for the training. It's quite intuitive about the benefit of taking use of sentence boundaries when you thinking in term of expected output of skip gram and how back propagation works. The beginning words of the next sentence is a garbage input as context words for the sentence tail word at the training stage. From another side, in the original version,The words around the document fixed-size cutting point are losing semantically meaningful context words. Those words at sentence ending or around cutting points are minority, so you may not notice huge impact for the hard cut version, but that's not a reason for us don't improve it further. Again, model building is absolutely sensitive to input data, just we human don't sensitive to minority caused issues without deep dive, but a good thing is that we still have theory to think about.what harm can be brought with respecting sentence boundary?
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org