Github user ygcao commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-163512458
  
    I have to say word2vec or skip gram can be absolutely affected by training 
data just like any other  ML algorithm. I also can tell you I observed big 
differences when I apply different massage for the data at the scale of 
millions to billions sentences.
    Researcher often tries to simplify engineering details, one obvious example 
is that the phrase2vec is highly simplified to show the algorithm's 
effectiveness instead of relying on complex entity recognition engine,that 
doesn't mean we should not do more advanced phrase construction for the 
training.
    It's quite intuitive about the benefit of taking use of sentence boundaries 
when you thinking in term of expected output of skip gram and how back 
propagation works. The beginning words of the next sentence is a garbage input 
as context words for the sentence tail word at the training stage. From another 
side, in the original version,The words around the document fixed-size cutting 
point are losing semantically meaningful context words. Those words at sentence 
ending or around cutting points are minority, so you may not notice huge impact 
for the hard cut version, but that's not a reason for us don't improve it 
further. Again, model building is absolutely sensitive to input data, just we 
human don't sensitive to minority caused issues without deep dive, but a good 
thing is that we still have theory to think about.what harm can be brought with 
respecting sentence boundary?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to