Splitting up the Maps to separate objects did not help. However, I was able to work around the problem by reimplementing it with RDD joins.
On Aug 18, 2016 5:16 PM, "Arun Luthra" <arun.lut...@gmail.com> wrote: > This might be caused by a few large Map objects that Spark is trying to > serialize. These are not broadcast variables or anything, they're just > regular objects. > > Would it help if I further indexed these maps into a two-level Map i.e. > Map[String, Map[String, Int]] ? Or would this still count against me? > > What if I manually split them up into numerous Map variables? > > On Mon, Aug 15, 2016 at 2:12 PM, Arun Luthra <arun.lut...@gmail.com> > wrote: > >> I got this OOM error in Spark local mode. The error seems to have been at >> the start of a stage (all of the stages on the UI showed as complete, there >> were more stages to do but had not showed up on the UI yet). >> >> There appears to be ~100G of free memory at the time of the error. >> >> Spark 2.0.0 >> 200G driver memory >> local[30] >> 8 /mntX/tmp directories for spark.local.dir >> "spark.sql.shuffle.partitions", "500" >> "spark.driver.maxResultSize","500" >> "spark.default.parallelism", "1000" >> >> The line number for the error is at an RDD map operation where there are >> some potentially large Map objects that are going to be accessed by each >> record. Does it matter if they are broadcast variables or not? I imagine >> not because its in local mode they should be available in memory to every >> executor/core. >> >> Possibly related: >> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Cl >> osureCleaner-or-java-serializer-OOM-when-trying-to-grow-td24796.html >> >> Exception in thread "main" java.lang.OutOfMemoryError >> at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputSt >> ream.java:123) >> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) >> at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutput >> Stream.java:93) >> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) >> at java.io.ObjectOutputStream$BlockDataOutputStream.drain(Objec >> tOutputStream.java:1877) >> at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDat >> aMode(ObjectOutputStream.java:1786) >> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) >> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) >> at org.apache.spark.serializer.JavaSerializationStream.writeObj >> ect(JavaSerializer.scala:43) >> at org.apache.spark.serializer.JavaSerializerInstance.serialize >> (JavaSerializer.scala:100) >> at org.apache.spark.util.ClosureCleaner$.ensureSerializable(Clo >> sureCleaner.scala:295) >> at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ >> ClosureCleaner$$clean(ClosureCleaner.scala:288) >> at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) >> at org.apache.spark.SparkContext.clean(SparkContext.scala:2037) >> at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:366) >> at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:365) >> at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperati >> onScope.scala:151) >> at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperati >> onScope.scala:112) >> at org.apache.spark.rdd.RDD.withScope(RDD.scala:358) >> at org.apache.spark.rdd.RDD.map(RDD.scala:365) >> at abc.Abc$.main(abc.scala:395) >> at abc.Abc.main(abc.scala) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce >> ssorImpl.java:62) >> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe >> thodAccessorImpl.java:43) >> at java.lang.reflect.Method.invoke(Method.java:498) >> at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy >> $SparkSubmit$$runMain(SparkSubmit.scala:729) >> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit >> .scala:185) >> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210) >> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124) >> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) >> >> >