[ 
https://issues.apache.org/jira/browse/SPARK-36446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17394931#comment-17394931
 ] 

Thomas Graves commented on SPARK-36446:
---------------------------------------

Is this with the yarn nodemangar recovery enabled?  ie yarn stores the 
necessarily information in a database which on restart it loads back up, if 
that is not configured it will not remember blocks.  

> YARN shuffle server restart crashes all dynamic allocation jobs that have 
> deallocated an executor
> -------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-36446
>                 URL: https://issues.apache.org/jira/browse/SPARK-36446
>             Project: Spark
>          Issue Type: Bug
>          Components: Shuffle
>    Affects Versions: 2.4.8, 3.1.2
>            Reporter: Adam Kennedy
>            Priority: Critical
>
> When dynamic allocation is enabled, executors that deallocate rely on the 
> shuffle server to hold blocks and supply them to remaining executors.
> When YARN Shuffle Server restarts (either intentionally or due to a crash), 
> it loses block information and relies on being able to contact Executors (the 
> locations of which it durably stores) to refetch the list of blocks.
> This mutual dependency on the other to hold block information fails fatally 
> under some common scenarios.
> For example, if a Spark application is running under dynamic allocation, some 
> amount of executors will almost always shut down.
> If, after this has occurred, any shuffle server crashes, or is restarted 
> (either directly when running as a standalone service, or as part of a YARN 
> node manager restart) then there is no way to restore block data and it is 
> permanently lost.
> Worse, when Executors try to fetch blocks from the shuffle server, the 
> shuffle server cannot location the exeutor, decides it doesn't exist, treats 
> it as a fatal exception, and causes the application to terminate and crash.
> Thus, in a real world scenario that we observe on a 1000+ node multi-tenant 
> cluster  where dynamic allocation is on by default, a rolling restart of the 
> YARN node managers will cause ALL jobs that have deallocated any executor and 
> have shuffles or transferred blocks to the shuffle server in order to shut 
> down, to crash.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to