That's a good question, Twinkle.

One solution could be to allow a maximum number of failures within any
given time span.  E.g. a max failures per hour property.

-Sandy

On Tue, Mar 31, 2015 at 11:52 PM, twinkle sachdeva <
twinkle.sachd...@gmail.com> wrote:

> Hi,
>
> In spark over YARN, there is a property "spark.yarn.max.executor.failures"
> which controls the maximum number of executor's failure an application will
> survive.
>
> If number of executor's failures ( due to any reason like OOM or machine
> failure etc ), increases this value then applications quits.
>
> For small duration spark job, this looks fine, but for the long running
> jobs as this does not take into account the duration, this can lead to same
> treatment for two different scenarios ( mentioned below) :
> 1. executors failing with in 5 mins.
> 2. executors failing sparsely, but at some point even a single executor
> failure ( which application could have survived ) can make the application
> quit.
>
> Sending it to the community to listen what kind of behaviour / strategy
> they think will be suitable for long running spark jobs or spark streaming
> jobs.
>
> Thanks and Regards,
> Twinkle
>

Reply via email to