What's the advantage of killing an application for lack of resources? I think the rationale behind killing an app based on executor failures is that, if we see a lot of them in a short span of time, it means there's probably something going wrong in the app or on the cluster.
On Wed, Apr 1, 2015 at 7:08 PM, twinkle sachdeva <twinkle.sachd...@gmail.com > wrote: > Hi, > > Thanks Sandy. > > > Another way to look at this is that would we like to have our long running > application to die? > > So let's say, we create a window of around 10 batches, and we are using > incremental kind of operations inside our application, as restart here is a > relatively more costlier, so should it be the maximum number of executor > failure's kind of criteria to fail the application or should we have some > parameters around minimum number of executor's availability for some x time? > > So, if the application is not able to have minimum n number of executors > within x period of time, then we should fail the application. > > Adding time factor here, will allow some window for spark to get more > executors allocated if some of them fails. > > Thoughts please. > > Thanks, > Twinkle > > > On Wed, Apr 1, 2015 at 10:19 PM, Sandy Ryza <sandy.r...@cloudera.com> > wrote: > >> That's a good question, Twinkle. >> >> One solution could be to allow a maximum number of failures within any >> given time span. E.g. a max failures per hour property. >> >> -Sandy >> >> On Tue, Mar 31, 2015 at 11:52 PM, twinkle sachdeva < >> twinkle.sachd...@gmail.com> wrote: >> >>> Hi, >>> >>> In spark over YARN, there is a property "spark.yarn.max.executor.failures" >>> which controls the maximum number of executor's failure an application will >>> survive. >>> >>> If number of executor's failures ( due to any reason like OOM or machine >>> failure etc ), exceeds this value then applications quits. >>> >>> For small duration spark job, this looks fine, but for the long running >>> jobs as this does not take into account the duration, this can lead to same >>> treatment for two different scenarios ( mentioned below) : >>> 1. executors failing with in 5 mins. >>> 2. executors failing sparsely, but at some point even a single executor >>> failure ( which application could have survived ) can make the application >>> quit. >>> >>> Sending it to the community to listen what kind of behaviour / strategy >>> they think will be suitable for long running spark jobs or spark streaming >>> jobs. >>> >>> Thanks and Regards, >>> Twinkle >>> >> >> >