Hi Spark Users, I am running some spark jobs which is running every hour.After running for 12 hours master is getting killed giving exception as
*java.lang.OutOfMemoryError: GC overhead limit exceeded* It look like there is some memory issue in spark master. Same kind of issue I noticed with spark history server. In my job I have to monitor if job completed successfully, for that I am hitting curl to get status but when no of jobs has increased to >80 apps history server start responding with delay.Like it is taking more then 5 min to respond status of jobs. Running spark 1.4.1 in standalone mode on 5 machine cluster. Kindly suggest me solution for memory issue it is blocker. Thanks, Saurav Sinha On Fri, Sep 25, 2015 at 5:01 PM, James Aley <james.a...@swiftkey.com> wrote: > Hi, > > We have an application that submits several thousands jobs within the same > SparkContext, using a thread pool to run about 50 in parallel. We're > running on YARN using Spark 1.4.1 and seeing a problem where our driver is > killed by YARN due to running beyond physical memory limits (no Java OOM > stack trace though). > > Plugging in YourKit, I can see that in fact the application is running low > on heap. The suspicious thing we're seeing is that the old generation is > filling up with dead objects, which don't seem to be fully removed during > the stop-the-world sweeps we see happening later in the running of the > application. > > With allocation tracking enabled, I can see that maybe 80%+ of that dead > heap space consists of byte arrays, which appear to contain some > snappy-compressed Hadoop configuration data. Many of them are 4MB each, > other hundreds of KBs. The allocation tracking reveals that they were > originally allocated in calls to sparkContext.hadoopFile() (from > AvroRelation in spark-avro). It seems that this data was broadcast to the > executors as a result of that call? I'm not clear on the implementation > details, but I can imagine that might be necessary? > > This application is essentially a batch job to take many Avro files and > merging them into larger Parquet files. What it does is builds a DataFrame > of Avro files, then for each DataFrame, starts a job using > .coalesce(N).write().parquet() on a fixed size thread pool. > > It seems that for each of those calls, another chunk of heap space > disappears to one of these byte arrays and is never reclaimed. I understand > that broadcast variables remain in memory on the driver application in > their serialized form, and that at least appears to be consistent with what > I'm seeing here. Question is, what can we do about this? Is there a way to > reclaim this memory? Should those arrays be GC'ed when jobs finish? > > Any guidance greatly appreciated. > > > Many thanks, > > James. > -- Thanks and Regards, Saurav Sinha Contact: 9742879062