Re: Long-running job OOMs driver process

Yong Zhang Fri, 18 Nov 2016 07:31:30 -0800

Just wondering, is it possible the memory usage keeping going up due to the web 
UI content?

Yong

________________________________
From: Alexis Seigneurin <aseigneu...@ipponusa.com>
Sent: Friday, November 18, 2016 10:17 AM
To: Nathan Lande
Cc: Keith Bourgoin; Irina Truong; u...@spark.incubator.apache.org
Subject: Re: Long-running job OOMs driver process

+1 for using S3A.

It would also depend on what format you're using. I agree with Steve that 
Parquet, for instance, is a good option. If you're using plain text files, some 
people use GZ files but they cannot be partitioned, thus putting a lot of 
pressure on the driver. It doesn't look like this is the issue you're running 
into, though, because it would not be a progressive slow down, but please 
provide as much detail as possible about your app.

The cache could be an issue but the OOM would come from an executor, not from 
the driver.

>From what you're saying, Keith, it indeed looks like some memory is not being 
>freed. Seeing the code would help. If you can, also send all the logs (with 
>Spark at least in INFO level).

Alexis

On Fri, Nov 18, 2016 at 10:08 AM, Nathan Lande 
<nathanla...@gmail.com<mailto:nathanla...@gmail.com>> wrote:

+1 to not threading.

What does your load look like? If you are loading many files and cacheing them 
in N rdds rather than 1 rdd this could be an issue.

If the above two things don't fix your oom issue, without knowing anything else 
about your job, I would focus on your cacheing strategy as a potential culprit. 
Try running without any cacheing to isolate the issue; bad cacheing strategy is 
the source of oom issues for me most of the time.

On Nov 18, 2016 6:31 AM, "Keith Bourgoin" 
<ke...@parsely.com<mailto:ke...@parsely.com>> wrote:
Hi Alexis,

Thanks for the response. I've been working with Irina on trying to sort this 
issue out.

We thread the file processing to amortize the cost of things like getting files 
from S3. It's a pattern we've seen recommended in many places, but I don't have 
any of those links handy.  The problem isn't the threading, per se, but clearly 
some sort of memory leak in the driver itself.  Each file is a self-contained 
unit of work, so once it's done all memory related to it should be freed. 
Nothing in the script itself grows over time, so if it can do 10 concurrently, 
it should be able to run like that forever.

I've hit this same issue working on another Spark app which wasn't threaded, 
but produced tens of thousands of jobs. Eventually, the Spark UI would get 
slow, then unresponsive, and then be killed due to OOM.

I'll try to cook up some examples of this today, threaded and not. We were 
hoping that someone had seen this before and it rung a bell. Maybe there's a 
setting to clean up info from old jobs that we can adjust.

Cheers,

Keith.

On Thu, Nov 17, 2016 at 9:50 PM Alexis Seigneurin 
<aseigneu...@ipponusa.com<mailto:aseigneu...@ipponusa.com>> wrote:
Hi Irina,

I would question the use of multiple threads in your application. Since Spark 
is going to run the processing of each DataFrame on all the cores of your 
cluster, the processes will be competing for resources. In fact, they would not 
only compete for CPU cores but also for memory.

Spark is designed to run your processes in a sequence, and each process will be 
run in a distributed manner (multiple threads on multiple instances). I would 
suggest to follow this principle.

Feel free to share to code if you can. It's always helpful so that we can give 
better advice.

Alexis

On Thu, Nov 17, 2016 at 8:51 PM, Irina Truong 
<ir...@parsely.com<mailto:ir...@parsely.com>> wrote:

We have an application that reads text files, converts them to dataframes, and 
saves them in Parquet format. The application runs fine when processing a few 
files, but we have several thousand produced every day. When running the job 
for all files, we have spark-submit killed on OOM:

#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
#   Executing /bin/sh -c "kill -9 27226"...

The job is written in Python. We're running it in Amazon EMR 5.0 (Spark 2.0.0) 
with spark-submit. We're using a cluster with a master c3.2xlarge instance (8 
cores and 15g of RAM) and 3 core c3.4xlarge instances (16 cores and 30g of RAM 
each). Spark config settings are as follows:

('spark.serializer', 'org.apache.spark.serializer.KryoSerializer'),

('spark.executors.instances', '3'),

('spark.yarn.executor.memoryOverhead', '9g'),

('spark.executor.cores', '15'),

('spark.executor.memory', '12g'),

('spark.scheduler.mode', 'FIFO'),

('spark.cleaner.ttl', '1800'),

The job processes each file in a thread, and we have 10 threads running 
concurrently. The process will OOM after about 4 hours, at which point Spark 
has processed over 20,000 jobs.

It seems like the driver is running out of memory, but each individual job is 
quite small. Are there any known memory leaks for long-running Spark 
applications on Yarn?

--
Alexis Seigneurin
Managing Consultant
(202) 459-1591<tel:202%20459.1591> - 
LinkedIn<http://www.linkedin.com/in/alexisseigneurin>

[X]<http://ipponusa.com/>
Rate our service<https://www.recommendi.com/app/survey/Eh2ZnWUPTxY>

--
Alexis Seigneurin
Managing Consultant
(202) 459-1591<tel:202%20459.1591> - 
LinkedIn<http://www.linkedin.com/in/alexisseigneurin>

[cid:E06DEFDD-9FA0-472D-8830-D7C7B1FEF8E9@home]<http://ipponusa.com/>
Rate our service<https://www.recommendi.com/app/survey/Eh2ZnWUPTxY>

Re: Long-running job OOMs driver process

Reply via email to