Hi Nirav
There is a difference between dynamic resource allocation and shuffle service. 
The dynamic allocation when you enable the configurations for it, every time 
you run any task spark will determine the number of executors required to run 
that task for you, which means decreasing the executors when task is simple and 
bumping more executors when task is complex. However, shuffle service would 
basically transfer the intermediate state during any transformation or a task 
execution to another executor if the current executor dies during the process. 
So even one of your executor dies the other active executor could take the 
intermediate state and start executing the process. 
> On Feb 3, 2016, at 1:02 PM, Marcelo Vanzin <van...@cloudera.com> wrote:
> 
> Yes, but you don't necessarily need to use dynamic allocation (just enable 
> the external shuffle service).
> 
> On Wed, Feb 3, 2016 at 11:53 AM, Nirav Patel <npa...@xactlycorp.com 
> <mailto:npa...@xactlycorp.com>> wrote:
> Do you mean this setup?
> https://spark.apache.org/docs/1.5.2/job-scheduling.html#dynamic-resource-allocation
>  
> <https://spark.apache.org/docs/1.5.2/job-scheduling.html#dynamic-resource-allocation>
> 
> 
> 
> On Wed, Feb 3, 2016 at 11:50 AM, Marcelo Vanzin <van...@cloudera.com 
> <mailto:van...@cloudera.com>> wrote:
> Without the exact error from the driver that caused the job to restart, it's 
> hard to tell. But a simple way to improve things is to install the Spark 
> shuffle service on the YARN nodes, so that even if an executor crashes, its 
> shuffle output is still available to other executors.
> 
> On Wed, Feb 3, 2016 at 11:46 AM, Nirav Patel <npa...@xactlycorp.com 
> <mailto:npa...@xactlycorp.com>> wrote:
> Hi,
> 
> I have a spark job running on yarn-client mode. At some point during Join 
> stage, executor(container) runs out of memory and yarn kills it. Due to this 
> Entire job restarts! and it keeps doing it on every failure?
> 
> What is the best way to checkpoint? I see there's checkpoint api and other 
> option might be to persist before Join stage. Would that prevent retry of 
> entire job? How about just retrying only the task that was distributed to 
> that faulty executor? 
> 
> Thanks
> 
> 
> 
>  <http://www.xactlycorp.com/email-click/>
> 
>  <https://www.nyse.com/quote/XNYS:XTLY>   
> <https://www.linkedin.com/company/xactly-corporation>   
> <https://twitter.com/Xactly>   <https://www.facebook.com/XactlyCorp>   
> <http://www.youtube.com/xactlycorporation>
> 
> 
> -- 
> Marcelo
> 
> 
> 
> 
>  <http://www.xactlycorp.com/email-click/>
> 
>  <https://www.nyse.com/quote/XNYS:XTLY>   
> <https://www.linkedin.com/company/xactly-corporation>   
> <https://twitter.com/Xactly>   <https://www.facebook.com/XactlyCorp>   
> <http://www.youtube.com/xactlycorporation>
> 
> 
> -- 
> Marcelo

Reply via email to