Please decrease spark.serializer.objectStreamReset for your queries. The default value is 100.
I logged SPARK-10787 for improvement. Cheers On Wed, Sep 23, 2015 at 6:59 PM, jluan <jaylu...@gmail.com> wrote: > I have been stuck on this problem for the last few days: > > I am attempting to run random forest from MLLIB, it gets through most of > it, > but breaks when doing a mapPartition operation. The following stack trace > is > shown: > > : An error occurred while calling o94.trainRandomForestModel. > : java.lang.OutOfMemoryError > at > java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) > at > java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) > at > java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) > at > java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) > at > > java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877) > at > > java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786) > at > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) > at > java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) > at > > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44) > at > > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:84) > at > > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301) > at > > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2021) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:703) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:702) > at > > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at > > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) > at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:702) > at > > org.apache.spark.mllib.tree.DecisionTree$.findBestSplits(DecisionTree.scala:625) > at > org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:235) > at > > org.apache.spark.mllib.tree.RandomForest$.trainClassifier(RandomForest.scala:291) > at > > org.apache.spark.mllib.api.python.PythonMLLibAPI.trainRandomForestModel(PythonMLLibAPI.scala:742) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at > py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at > py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:745) > > It seems to me that it's trying to serialize the mapPartitions closure, but > runs out of space doing so. However I don't understand how it could run out > of space when I gave the driver ~190GB for a file that's 45MB. > > I have a cluster setup on AWS such that my master is a r3.8xlarge along > with > two r3.4xlarge workers. I have the following configurations: > > spark version: 1.5.0 > ----------------------------------- > spark.executor.memory 32000m > spark.driver.memory 230000m > spark.driver.cores 10 > spark.executor.cores 5 > spark.executor.instances 17 > spark.driver.maxResultSize 0 > spark.storage.safetyFraction 1 > spark.storage.memoryFraction 0.9 > spark.storage.shuffleFraction 0.05 > spark.default.parallelism 128 > > The master machine has approximately 240 GB of ram and each worker has > about > 120GB of ram. > > I load in a relatively tiny RDD of MLLIB LabeledPoint objects, with each > holding sparse vectors inside. This RDD has a total size of roughly 45MB. > My > sparse vector has a total length of ~15 million while only about 3000 or so > are non-zeros. > > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-ClosureCleaner-or-java-serializer-OOM-when-trying-to-grow-tp24796.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >