Would be interested to know the answer too.

On Wed, Aug 26, 2015 at 11:45 AM, Sadhan Sood <sadhan.s...@gmail.com> wrote:

> Interestingly, if there is nothing running on dev spark-shell, it recovers
> successfully and regains the lost executors. Attaching the log for that.
> Notice, the "Registering block manager .." statements in the very end after
> all executors were lost.
>
> On Wed, Aug 26, 2015 at 11:27 AM, Sadhan Sood <sadhan.s...@gmail.com>
> wrote:
>
>> Attaching log for when the dev job gets stuck (once all its executors are
>> lost due to preemption). This is a spark-shell job running in yarn-client
>> mode.
>>
>> On Wed, Aug 26, 2015 at 10:45 AM, Sadhan Sood <sadhan.s...@gmail.com>
>> wrote:
>>
>>> Hi All,
>>>
>>> We've set up our spark cluster on aws running on yarn (running on hadoop
>>> 2.3) with fair scheduling and preemption turned on. The cluster is shared
>>> for prod and dev work where prod runs with a higher fair share and can
>>> preempt dev jobs if there are not enough resources available for it.
>>> It appears that dev jobs which get preempted often get unstable after
>>> losing some executors and the whole jobs gets stuck (without making any
>>> progress) or end up getting restarted (and hence losing all the work done).
>>> Has someone encountered this before ? Is the solution just to set 
>>> spark.task.maxFailures
>>> to a really high value to recover from task failures in such scenarios? Are
>>> there other approaches that people have taken for spark multi tenancy that
>>> works better in such scenario?
>>>
>>> Thanks,
>>> Sadhan
>>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

Reply via email to