[ https://issues.apache.org/jira/browse/SPARK-15720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15311647#comment-15311647 ]
yuhao yang commented on SPARK-15720: ------------------------------------ This can only happen when creating a Word2VecModel from pre-trained vectors, as currently Word2Vec in MLlib cannot support that scope. Current upper limit is vocab * vectorSize < Max Array Size (approximately (Int.Max - 8) depending on different platforms). This will be a fundamental change to Word2Vec if we want to extend the scope. [~rohangpatil] What's the target scope of vocabSize and vectorLength for your application. I'm not sure if larger scope of word2vec is a popular requirement, appreciate if you can provide some supporting examples. > MLLIB Word2Vec loading large number of vectors in the model results in > java.lang.NegativeArraySizeException > ----------------------------------------------------------------------------------------------------------- > > Key: SPARK-15720 > URL: https://issues.apache.org/jira/browse/SPARK-15720 > Project: Spark > Issue Type: Bug > Components: MLlib > Affects Versions: 1.6.1 > Environment: Linux > Reporter: Rohan G Patil > > While loading a large number of pre-trained vectors into Spark MLLIB's > Word2Vec model, will result in java.lang.NegativeArraySizeException. > Code - > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L597 > Test with number of vectors greater than 16777215 with size of each vector > 128 or more. > there is Integer Overflow happening here. Should be an easy fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org