[ 
https://issues.apache.org/jira/browse/SPARK-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14248262#comment-14248262
 ] 

Joseph Tang commented on SPARK-4846:
------------------------------------

Hi Sean and Joseph.

You are right. I had another test about the 'lazy' and then found my 
misunderstanding of the object serialization in JVM. The 'lazy' does not work. 
I'll remove the 'lazy' later. Maybe it's the automatic increase of partition 
number in my code that helped. 

Joseph K. Bradley's good advice about automatic minCount should work, meanwhile 
it costs less memory.  However, one potential issue may be that in the worst 
case we have to try many times to find the best minCount when changing the 
minCount step by step.
Also I think a quick fix can be automatically increasing the partition number, 
but in this case it has lower accuracy. 

Or else, if a Spark user changes one of minCount and numPartitions, we 
automatically change the other one.  If  a user change neither, we just auto 
change numPartitions by default.  How do you think?




> When the vocabulary size is large, Word2Vec may yield "OutOfMemoryError: 
> Requested array size exceeds VM limit"
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-4846
>                 URL: https://issues.apache.org/jira/browse/SPARK-4846
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 1.1.0
>         Environment: Use Word2Vec to process a corpus(sized 3.5G) with one 
> partition.
> The corpus contains about 300 million words and its vocabulary size is about 
> 10 million.
>            Reporter: Joseph Tang
>            Priority: Critical
>
> Exception in thread "Driver" java.lang.reflect.InvocationTargetException
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:606)
>     at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)
> Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit 
>     at java.util.Arrays.copyOf(Arrays.java:2271)
>     at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
>     at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
>     at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
>     at 
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870)
>     at 
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779)
>     at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186)
>     at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
>     at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
>     at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
>     at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
>     at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
>     at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
>     at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610)
>     at 
> org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291)
>     at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
>     at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to