Strategy regarding maximum number of executor's failure for log running jobs/ spark streaming jobs

twinkle sachdeva Tue, 31 Mar 2015 23:54:06 -0700

Hi,

In spark over YARN, there is a property "spark.yarn.max.executor.failures"
which controls the maximum number of executor's failure an application will
survive.


If number of executor's failures ( due to any reason like OOM or machine
failure etc ), increases this value then applications quits.

For small duration spark job, this looks fine, but for the long running
jobs as this does not take into account the duration, this can lead to same
treatment for two different scenarios ( mentioned below) :
1. executors failing with in 5 mins.
2. executors failing sparsely, but at some point even a single executor
failure ( which application could have survived ) can make the application
quit.

Sending it to the community to listen what kind of behaviour / strategy
they think will be suitable for long running spark jobs or spark streaming
jobs.

Thanks and Regards,
Twinkle

Strategy regarding maximum number of executor's failure for log running jobs/ spark streaming jobs

Reply via email to