YongGang Cao created SPARK-12153: ------------------------------------ Summary: Word2Vec uses a fixed length for sentences which is not reasonable for reality, and similarity functions and fields are not accessible Key: SPARK-12153 URL: https://issues.apache.org/jira/browse/SPARK-12153 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.5.2 Reporter: YongGang Cao Priority: Minor
sentence boundary matters for sliding window, we shouldn't train model from a window across sentences. the current 100 word as a hard split for sentences doesn't really make sense. And the cosinesimilarity functions is private which is useless for caller. we may need to access the vocabulary and wordindex table as well, those need getters I made changes to address above issues. will send out pull request for your review. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org