Do you mean this setup? https://spark.apache.org/docs/1.5.2/job-scheduling.html#dynamic-resource-allocation
On Wed, Feb 3, 2016 at 11:50 AM, Marcelo Vanzin <van...@cloudera.com> wrote: > Without the exact error from the driver that caused the job to restart, > it's hard to tell. But a simple way to improve things is to install the > Spark shuffle service on the YARN nodes, so that even if an executor > crashes, its shuffle output is still available to other executors. > > On Wed, Feb 3, 2016 at 11:46 AM, Nirav Patel <npa...@xactlycorp.com> > wrote: > >> Hi, >> >> I have a spark job running on yarn-client mode. At some point during Join >> stage, executor(container) runs out of memory and yarn kills it. Due to >> this Entire job restarts! and it keeps doing it on every failure? >> >> What is the best way to checkpoint? I see there's checkpoint api and >> other option might be to persist before Join stage. Would that prevent >> retry of entire job? How about just retrying only the task that was >> distributed to that faulty executor? >> >> Thanks >> >> >> >> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/> >> >> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] >> <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] >> <https://twitter.com/Xactly> [image: Facebook] >> <https://www.facebook.com/XactlyCorp> [image: YouTube] >> <http://www.youtube.com/xactlycorporation> > > > > > -- > Marcelo > -- [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] <https://twitter.com/Xactly> [image: Facebook] <https://www.facebook.com/XactlyCorp> [image: YouTube] <http://www.youtube.com/xactlycorporation>