That's a good question, Twinkle. One solution could be to allow a maximum number of failures within any given time span. E.g. a max failures per hour property.
-Sandy On Tue, Mar 31, 2015 at 11:52 PM, twinkle sachdeva < twinkle.sachd...@gmail.com> wrote: > Hi, > > In spark over YARN, there is a property "spark.yarn.max.executor.failures" > which controls the maximum number of executor's failure an application will > survive. > > If number of executor's failures ( due to any reason like OOM or machine > failure etc ), increases this value then applications quits. > > For small duration spark job, this looks fine, but for the long running > jobs as this does not take into account the duration, this can lead to same > treatment for two different scenarios ( mentioned below) : > 1. executors failing with in 5 mins. > 2. executors failing sparsely, but at some point even a single executor > failure ( which application could have survived ) can make the application > quit. > > Sending it to the community to listen what kind of behaviour / strategy > they think will be suitable for long running spark jobs or spark streaming > jobs. > > Thanks and Regards, > Twinkle >