Re: Spark 2.0.0 OOM error at beginning of RDD map on AWS

Arun Luthra Tue, 23 Aug 2016 12:59:11 -0700

Splitting up the Maps to separate objects did not help.

However, I was able to work around the problem by reimplementing it with
RDD joins.


On Aug 18, 2016 5:16 PM, "Arun Luthra" <arun.lut...@gmail.com> wrote:

> This might be caused by a few large Map objects that Spark is trying to
> serialize. These are not broadcast variables or anything, they're just
> regular objects.
>
> Would it help if I further indexed these maps into a two-level Map i.e.
> Map[String, Map[String, Int]] ? Or would this still count against me?
>
> What if I manually split them up into numerous Map variables?
>
> On Mon, Aug 15, 2016 at 2:12 PM, Arun Luthra <arun.lut...@gmail.com>
> wrote:
>
>> I got this OOM error in Spark local mode. The error seems to have been at
>> the start of a stage (all of the stages on the UI showed as complete, there
>> were more stages to do but had not showed up on the UI yet).
>>
>> There appears to be ~100G of free memory at the time of the error.
>>
>> Spark 2.0.0
>> 200G driver memory
>> local[30]
>> 8 /mntX/tmp directories for spark.local.dir
>> "spark.sql.shuffle.partitions", "500"
>> "spark.driver.maxResultSize","500"
>> "spark.default.parallelism", "1000"
>>
>> The line number for the error is at an RDD map operation where there are
>> some potentially large Map objects that are going to be accessed by each
>> record. Does it matter if they are broadcast variables or not? I imagine
>> not because its in local mode they should be available in memory to every
>> executor/core.
>>
>> Possibly related:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Cl
>> osureCleaner-or-java-serializer-OOM-when-trying-to-grow-td24796.html
>>
>> Exception in thread "main" java.lang.OutOfMemoryError
>> at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputSt
>> ream.java:123)
>> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
>> at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutput
>> Stream.java:93)
>> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
>> at java.io.ObjectOutputStream$BlockDataOutputStream.drain(Objec
>> tOutputStream.java:1877)
>> at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDat
>> aMode(ObjectOutputStream.java:1786)
>> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
>> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>> at org.apache.spark.serializer.JavaSerializationStream.writeObj
>> ect(JavaSerializer.scala:43)
>> at org.apache.spark.serializer.JavaSerializerInstance.serialize
>> (JavaSerializer.scala:100)
>> at org.apache.spark.util.ClosureCleaner$.ensureSerializable(Clo
>> sureCleaner.scala:295)
>> at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$
>> ClosureCleaner$$clean(ClosureCleaner.scala:288)
>> at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>> at org.apache.spark.SparkContext.clean(SparkContext.scala:2037)
>> at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:366)
>> at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:365)
>> at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperati
>> onScope.scala:151)
>> at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperati
>> onScope.scala:112)
>> at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
>> at org.apache.spark.rdd.RDD.map(RDD.scala:365)
>> at abc.Abc$.main(abc.scala:395)
>> at abc.Abc.main(abc.scala)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
>> ssorImpl.java:62)
>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
>> thodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:498)
>> at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy
>> $SparkSubmit$$runMain(SparkSubmit.scala:729)
>> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit
>> .scala:185)
>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>
>>
>

Re: Spark 2.0.0 OOM error at beginning of RDD map on AWS

Reply via email to