Re: Strategy regarding maximum number of executor's failure for log running jobs/ spark streaming jobs

Sandy Ryza Mon, 06 Apr 2015 13:34:27 -0700

What's the advantage of killing an application for lack of resources?

I think the rationale behind killing an app based on executor failures is
that, if we see a lot of them in a short span of time, it means there's
probably something going wrong in the app or on the cluster.


On Wed, Apr 1, 2015 at 7:08 PM, twinkle sachdeva <twinkle.sachd...@gmail.com
> wrote:

> Hi,
>
> Thanks Sandy.
>
>
> Another way to look at this is that would we like to have our long running
> application to die?
>
> So let's say, we create a window of around 10 batches, and we are using
> incremental kind of operations inside our application, as restart here is a
> relatively more costlier, so should it be the maximum number of executor
> failure's kind of criteria to fail the application or should we have some
> parameters around minimum number of executor's availability for some x time?
>
> So, if the application is not able to have minimum n number of executors
> within x period of time, then we should fail the application.
>
> Adding time factor here, will allow some window for spark to get more
> executors allocated if some of them fails.
>
> Thoughts please.
>
> Thanks,
> Twinkle
>
>
> On Wed, Apr 1, 2015 at 10:19 PM, Sandy Ryza <sandy.r...@cloudera.com>
> wrote:
>
>> That's a good question, Twinkle.
>>
>> One solution could be to allow a maximum number of failures within any
>> given time span.  E.g. a max failures per hour property.
>>
>> -Sandy
>>
>> On Tue, Mar 31, 2015 at 11:52 PM, twinkle sachdeva <
>> twinkle.sachd...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> In spark over YARN, there is a property "spark.yarn.max.executor.failures"
>>> which controls the maximum number of executor's failure an application will
>>> survive.
>>>
>>> If number of executor's failures ( due to any reason like OOM or machine
>>> failure etc ), exceeds this value then applications quits.
>>>
>>> For small duration spark job, this looks fine, but for the long running
>>> jobs as this does not take into account the duration, this can lead to same
>>> treatment for two different scenarios ( mentioned below) :
>>> 1. executors failing with in 5 mins.
>>> 2. executors failing sparsely, but at some point even a single executor
>>> failure ( which application could have survived ) can make the application
>>> quit.
>>>
>>> Sending it to the community to listen what kind of behaviour / strategy
>>> they think will be suitable for long running spark jobs or spark streaming
>>> jobs.
>>>
>>> Thanks and Regards,
>>> Twinkle
>>>
>>
>>
>

Re: Strategy regarding maximum number of executor's failure for log running jobs/ spark streaming jobs

Reply via email to