We have an application that reads text files, converts them to dataframes,
and saves them in Parquet format. The application runs fine when processing
a few files, but we have several thousand produced every day. When running
the job for all files, we have spark-submit killed on OOM:

#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
#   Executing /bin/sh -c "kill -9 27226"...

The job is written in Python. We’re running it in Amazon EMR 5.0 (Spark
2.0.0) with spark-submit. We’re using a cluster with a master c3.2xlarge
instance (8 cores and 15g of RAM) and 3 core c3.4xlarge instances (16 cores
and 30g of RAM each). Spark config settings are as follows:

('spark.serializer', 'org.apache.spark.serializer.KryoSerializer'),

('spark.executors.instances', '3'),

('spark.yarn.executor.memoryOverhead', '9g'),

('spark.executor.cores', '15'),

('spark.executor.memory', '12g'),

('spark.scheduler.mode', 'FIFO'),

('spark.cleaner.ttl', '1800'),

The job processes each file in a thread, and we have 10 threads running
concurrently. The process will OOM after about 4 hours, at which point
Spark has processed over 20,000 jobs.

It seems like the driver is running out of memory, but each individual job
is quite small. Are there any known memory leaks for long-running Spark
applications on Yarn?

Reply via email to