Also for the record, turning on kryo was not able to help. On Tue, Aug 23, 2016 at 12:58 PM, Arun Luthra <arun.lut...@gmail.com> wrote:
> Splitting up the Maps to separate objects did not help. > > However, I was able to work around the problem by reimplementing it with > RDD joins. > > On Aug 18, 2016 5:16 PM, "Arun Luthra" <arun.lut...@gmail.com> wrote: > >> This might be caused by a few large Map objects that Spark is trying to >> serialize. These are not broadcast variables or anything, they're just >> regular objects. >> >> Would it help if I further indexed these maps into a two-level Map i.e. >> Map[String, Map[String, Int]] ? Or would this still count against me? >> >> What if I manually split them up into numerous Map variables? >> >> On Mon, Aug 15, 2016 at 2:12 PM, Arun Luthra <arun.lut...@gmail.com> >> wrote: >> >>> I got this OOM error in Spark local mode. The error seems to have been >>> at the start of a stage (all of the stages on the UI showed as complete, >>> there were more stages to do but had not showed up on the UI yet). >>> >>> There appears to be ~100G of free memory at the time of the error. >>> >>> Spark 2.0.0 >>> 200G driver memory >>> local[30] >>> 8 /mntX/tmp directories for spark.local.dir >>> "spark.sql.shuffle.partitions", "500" >>> "spark.driver.maxResultSize","500" >>> "spark.default.parallelism", "1000" >>> >>> The line number for the error is at an RDD map operation where there are >>> some potentially large Map objects that are going to be accessed by each >>> record. Does it matter if they are broadcast variables or not? I imagine >>> not because its in local mode they should be available in memory to every >>> executor/core. >>> >>> Possibly related: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Cl >>> osureCleaner-or-java-serializer-OOM-when-trying-to-grow-td24796.html >>> >>> Exception in thread "main" java.lang.OutOfMemoryError >>> at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputSt >>> ream.java:123) >>> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) >>> at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutput >>> Stream.java:93) >>> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) >>> at java.io.ObjectOutputStream$BlockDataOutputStream.drain(Objec >>> tOutputStream.java:1877) >>> at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDat >>> aMode(ObjectOutputStream.java:1786) >>> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) >>> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) >>> at org.apache.spark.serializer.JavaSerializationStream.writeObj >>> ect(JavaSerializer.scala:43) >>> at org.apache.spark.serializer.JavaSerializerInstance.serialize >>> (JavaSerializer.scala:100) >>> at org.apache.spark.util.ClosureCleaner$.ensureSerializable(Clo >>> sureCleaner.scala:295) >>> at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ >>> ClosureCleaner$$clean(ClosureCleaner.scala:288) >>> at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) >>> at org.apache.spark.SparkContext.clean(SparkContext.scala:2037) >>> at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:366) >>> at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:365) >>> at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperati >>> onScope.scala:151) >>> at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperati >>> onScope.scala:112) >>> at org.apache.spark.rdd.RDD.withScope(RDD.scala:358) >>> at org.apache.spark.rdd.RDD.map(RDD.scala:365) >>> at abc.Abc$.main(abc.scala:395) >>> at abc.Abc.main(abc.scala) >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce >>> ssorImpl.java:62) >>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe >>> thodAccessorImpl.java:43) >>> at java.lang.reflect.Method.invoke(Method.java:498) >>> at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy >>> $SparkSubmit$$runMain(SparkSubmit.scala:729) >>> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit >>> .scala:185) >>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210) >>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124) >>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) >>> >>> >>