Would be interested to know the answer too. On Wed, Aug 26, 2015 at 11:45 AM, Sadhan Sood <sadhan.s...@gmail.com> wrote:
> Interestingly, if there is nothing running on dev spark-shell, it recovers > successfully and regains the lost executors. Attaching the log for that. > Notice, the "Registering block manager .." statements in the very end after > all executors were lost. > > On Wed, Aug 26, 2015 at 11:27 AM, Sadhan Sood <sadhan.s...@gmail.com> > wrote: > >> Attaching log for when the dev job gets stuck (once all its executors are >> lost due to preemption). This is a spark-shell job running in yarn-client >> mode. >> >> On Wed, Aug 26, 2015 at 10:45 AM, Sadhan Sood <sadhan.s...@gmail.com> >> wrote: >> >>> Hi All, >>> >>> We've set up our spark cluster on aws running on yarn (running on hadoop >>> 2.3) with fair scheduling and preemption turned on. The cluster is shared >>> for prod and dev work where prod runs with a higher fair share and can >>> preempt dev jobs if there are not enough resources available for it. >>> It appears that dev jobs which get preempted often get unstable after >>> losing some executors and the whole jobs gets stuck (without making any >>> progress) or end up getting restarted (and hence losing all the work done). >>> Has someone encountered this before ? Is the solution just to set >>> spark.task.maxFailures >>> to a really high value to recover from task failures in such scenarios? Are >>> there other approaches that people have taken for spark multi tenancy that >>> works better in such scenario? >>> >>> Thanks, >>> Sadhan >>> >> >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org >