Hi Spark Users,

I am running some spark jobs which is running every hour.After running for
12 hours master is getting killed giving exception as

*java.lang.OutOfMemoryError: GC overhead limit exceeded*

It look like there is some memory issue in spark master.

Same kind of issue I noticed with spark history server.

In my job I have to monitor if job completed successfully, for that I am
hitting curl to get status but when no of jobs has increased to >80 apps
history server start responding with delay.Like it is taking more then 5
min to respond status of jobs.

Running spark 1.4.1 in standalone mode on 5 machine cluster.

Kindly suggest me solution for memory issue it is blocker.

Thanks,
Saurav Sinha

On Fri, Sep 25, 2015 at 5:01 PM, James Aley <james.a...@swiftkey.com> wrote:

> Hi,
>
> We have an application that submits several thousands jobs within the same
> SparkContext, using a thread pool to run about 50 in parallel. We're
> running on YARN using Spark 1.4.1 and seeing a problem where our driver is
> killed by YARN due to running beyond physical memory limits (no Java OOM
> stack trace though).
>
> Plugging in YourKit, I can see that in fact the application is running low
> on heap. The suspicious thing we're seeing is that the old generation is
> filling up with dead objects, which don't seem to be fully removed during
> the stop-the-world sweeps we see happening later in the running of the
> application.
>
> With allocation tracking enabled, I can see that maybe 80%+ of that dead
> heap space consists of byte arrays, which appear to contain some
> snappy-compressed Hadoop configuration data. Many of them are 4MB each,
> other hundreds of KBs. The allocation tracking reveals that they were
> originally allocated in calls to sparkContext.hadoopFile() (from
> AvroRelation in spark-avro). It seems that this data was broadcast to the
> executors as a result of that call? I'm not clear on the implementation
> details, but I can imagine that might be necessary?
>
> This application is essentially a batch job to take many Avro files and
> merging them into larger Parquet files. What it does is builds a DataFrame
> of Avro files, then for each DataFrame, starts a job using
> .coalesce(N).write().parquet() on a fixed size thread pool.
>
> It seems that for each of those calls, another chunk of heap space
> disappears to one of these byte arrays and is never reclaimed. I understand
> that broadcast variables remain in memory on the driver application in
> their serialized form, and that at least appears to be consistent with what
> I'm seeing here. Question is, what can we do about this? Is there a way to
> reclaim this memory? Should those arrays be GC'ed when jobs finish?
>
> Any guidance greatly appreciated.
>
>
> Many thanks,
>
> James.
>



-- 
Thanks and Regards,

Saurav Sinha

Contact: 9742879062

Reply via email to