Hi, In spark over YARN, there is a property "spark.yarn.max.executor.failures" which controls the maximum number of executor's failure an application will survive.
If number of executor's failures ( due to any reason like OOM or machine failure etc ), increases this value then applications quits. For small duration spark job, this looks fine, but for the long running jobs as this does not take into account the duration, this can lead to same treatment for two different scenarios ( mentioned below) : 1. executors failing with in 5 mins. 2. executors failing sparsely, but at some point even a single executor failure ( which application could have survived ) can make the application quit. Sending it to the community to listen what kind of behaviour / strategy they think will be suitable for long running spark jobs or spark streaming jobs. Thanks and Regards, Twinkle