Hi, We have an application that submits several thousands jobs within the same SparkContext, using a thread pool to run about 50 in parallel. We're running on YARN using Spark 1.4.1 and seeing a problem where our driver is killed by YARN due to running beyond physical memory limits (no Java OOM stack trace though).
Plugging in YourKit, I can see that in fact the application is running low on heap. The suspicious thing we're seeing is that the old generation is filling up with dead objects, which don't seem to be fully removed during the stop-the-world sweeps we see happening later in the running of the application. With allocation tracking enabled, I can see that maybe 80%+ of that dead heap space consists of byte arrays, which appear to contain some snappy-compressed Hadoop configuration data. Many of them are 4MB each, other hundreds of KBs. The allocation tracking reveals that they were originally allocated in calls to sparkContext.hadoopFile() (from AvroRelation in spark-avro). It seems that this data was broadcast to the executors as a result of that call? I'm not clear on the implementation details, but I can imagine that might be necessary? This application is essentially a batch job to take many Avro files and merging them into larger Parquet files. What it does is builds a DataFrame of Avro files, then for each DataFrame, starts a job using .coalesce(N).write().parquet() on a fixed size thread pool. It seems that for each of those calls, another chunk of heap space disappears to one of these byte arrays and is never reclaimed. I understand that broadcast variables remain in memory on the driver application in their serialized form, and that at least appears to be consistent with what I'm seeing here. Question is, what can we do about this? Is there a way to reclaim this memory? Should those arrays be GC'ed when jobs finish? Any guidance greatly appreciated. Many thanks, James.