Sean Owen created SPARK-28081: --------------------------------- Summary: word2vec 'large' count value too low for very large corpora Key: SPARK-28081 URL: https://issues.apache.org/jira/browse/SPARK-28081 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.4.3 Reporter: Sean Owen Assignee: Sean Owen
The word2vec implementation operates on word counts, and uses a hard-coded value of 1e9 to mean "a very large count, larger than any actual count". However this causes the logic to fail if, in fact, a large corpora has some words that really do occur more than this many times. We can probably improve the implementation to better handle very large counts in general. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org