Hi,

We have an application that submits several thousands jobs within the same
SparkContext, using a thread pool to run about 50 in parallel. We're
running on YARN using Spark 1.4.1 and seeing a problem where our driver is
killed by YARN due to running beyond physical memory limits (no Java OOM
stack trace though).

Plugging in YourKit, I can see that in fact the application is running low
on heap. The suspicious thing we're seeing is that the old generation is
filling up with dead objects, which don't seem to be fully removed during
the stop-the-world sweeps we see happening later in the running of the
application.

With allocation tracking enabled, I can see that maybe 80%+ of that dead
heap space consists of byte arrays, which appear to contain some
snappy-compressed Hadoop configuration data. Many of them are 4MB each,
other hundreds of KBs. The allocation tracking reveals that they were
originally allocated in calls to sparkContext.hadoopFile() (from
AvroRelation in spark-avro). It seems that this data was broadcast to the
executors as a result of that call? I'm not clear on the implementation
details, but I can imagine that might be necessary?

This application is essentially a batch job to take many Avro files and
merging them into larger Parquet files. What it does is builds a DataFrame
of Avro files, then for each DataFrame, starts a job using
.coalesce(N).write().parquet() on a fixed size thread pool.

It seems that for each of those calls, another chunk of heap space
disappears to one of these byte arrays and is never reclaimed. I understand
that broadcast variables remain in memory on the driver application in
their serialized form, and that at least appears to be consistent with what
I'm seeing here. Question is, what can we do about this? Is there a way to
reclaim this memory? Should those arrays be GC'ed when jobs finish?

Any guidance greatly appreciated.


Many thanks,

James.

Reply via email to