Hi Alexis,

Thanks for the response. I've been working with Irina on trying to sort
this issue out.

We thread the file processing to amortize the cost of things like getting
files from S3. It's a pattern we've seen recommended in many places, but I
don't have any of those links handy.  The problem isn't the threading, per
se, but clearly some sort of memory leak in the driver itself.  Each file
is a self-contained unit of work, so once it's done all memory related to
it should be freed. Nothing in the script itself grows over time, so if it
can do 10 concurrently, it should be able to run like that forever.

I've hit this same issue working on another Spark app which wasn't
threaded, but produced tens of thousands of jobs. Eventually, the Spark UI
would get slow, then unresponsive, and then be killed due to OOM.

I'll try to cook up some examples of this today, threaded and not. We were
hoping that someone had seen this before and it rung a bell. Maybe there's
a setting to clean up info from old jobs that we can adjust.

Cheers,

Keith.

On Thu, Nov 17, 2016 at 9:50 PM Alexis Seigneurin <aseigneu...@ipponusa.com>
wrote:

> Hi Irina,
>
> I would question the use of multiple threads in your application. Since
> Spark is going to run the processing of each DataFrame on all the cores of
> your cluster, the processes will be competing for resources. In fact, they
> would not only compete for CPU cores but also for memory.
>
> Spark is designed to run your processes in a sequence, and each process
> will be run in a distributed manner (multiple threads on multiple
> instances). I would suggest to follow this principle.
>
> Feel free to share to code if you can. It's always helpful so that we can
> give better advice.
>
> Alexis
>
> On Thu, Nov 17, 2016 at 8:51 PM, Irina Truong <ir...@parsely.com> wrote:
>
> We have an application that reads text files, converts them to dataframes,
> and saves them in Parquet format. The application runs fine when processing
> a few files, but we have several thousand produced every day. When running
> the job for all files, we have spark-submit killed on OOM:
>
> #
> # java.lang.OutOfMemoryError: Java heap space
> # -XX:OnOutOfMemoryError="kill -9 %p"
> #   Executing /bin/sh -c "kill -9 27226"...
>
> The job is written in Python. We’re running it in Amazon EMR 5.0 (Spark
> 2.0.0) with spark-submit. We’re using a cluster with a master c3.2xlarge
> instance (8 cores and 15g of RAM) and 3 core c3.4xlarge instances (16 cores
> and 30g of RAM each). Spark config settings are as follows:
>
> ('spark.serializer', 'org.apache.spark.serializer.KryoSerializer'),
>
> ('spark.executors.instances', '3'),
>
> ('spark.yarn.executor.memoryOverhead', '9g'),
>
> ('spark.executor.cores', '15'),
>
> ('spark.executor.memory', '12g'),
>
> ('spark.scheduler.mode', 'FIFO'),
>
> ('spark.cleaner.ttl', '1800'),
>
> The job processes each file in a thread, and we have 10 threads running
> concurrently. The process will OOM after about 4 hours, at which point
> Spark has processed over 20,000 jobs.
>
> It seems like the driver is running out of memory, but each individual job
> is quite small. Are there any known memory leaks for long-running Spark
> applications on Yarn?
>
>
>
>
> --
>
> *Alexis Seigneurin*
> *Managing Consultant*
> (202) 459-1591 <202%20459.1591> - LinkedIn
> <http://www.linkedin.com/in/alexisseigneurin>
>
> <http://ipponusa.com/>
> Rate our service <https://www.recommendi.com/app/survey/Eh2ZnWUPTxY>
>

Reply via email to