[ https://issues.apache.org/jira/browse/SPARK-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14248262#comment-14248262 ]
Joseph Tang commented on SPARK-4846: ------------------------------------ Hi Sean and Joseph. You are right. I had another test about the 'lazy' and then found my misunderstanding of the object serialization in JVM. The 'lazy' does not work. I'll remove the 'lazy' later. Maybe it's the automatic increase of partition number in my code that helped. Joseph K. Bradley's good advice about automatic minCount should work, meanwhile it costs less memory. However, one potential issue may be that in the worst case we have to try many times to find the best minCount when changing the minCount step by step. Also I think a quick fix can be automatically increasing the partition number, but in this case it has lower accuracy. Or else, if a Spark user changes one of minCount and numPartitions, we automatically change the other one. If a user change neither, we just auto change numPartitions by default. How do you think? > When the vocabulary size is large, Word2Vec may yield "OutOfMemoryError: > Requested array size exceeds VM limit" > --------------------------------------------------------------------------------------------------------------- > > Key: SPARK-4846 > URL: https://issues.apache.org/jira/browse/SPARK-4846 > Project: Spark > Issue Type: Bug > Components: MLlib > Affects Versions: 1.1.0 > Environment: Use Word2Vec to process a corpus(sized 3.5G) with one > partition. > The corpus contains about 300 million words and its vocabulary size is about > 10 million. > Reporter: Joseph Tang > Priority: Critical > > Exception in thread "Driver" java.lang.reflect.InvocationTargetException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162) > Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit > at java.util.Arrays.copyOf(Arrays.java:2271) > at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) > at > java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) > at > java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870) > at > java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73) > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) > at org.apache.spark.SparkContext.clean(SparkContext.scala:1242) > at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610) > at > org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) > at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org