[jira] [Commented] (YARN-11466) Graceful Decommission for Shuffle Services

yanbin.zhang (Jira) Wed, 14 Jan 2026 15:04:14 -0800


    [ 
https://issues.apache.org/jira/browse/YARN-11466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18051964#comment-18051964
 ]


yanbin.zhang commented on YARN-11466:
-------------------------------------

[~prabhujoseph] Hello, have you ever implemented this idea?

> Graceful Decommission for Shuffle Services
> ------------------------------------------
>
>                 Key: YARN-11466
>                 URL: https://issues.apache.org/jira/browse/YARN-11466
>             Project: Hadoop YARN
>          Issue Type: New Feature
>            Reporter: Prabhu Joseph
>            Assignee: Prabhu Joseph
>            Priority: Major
>
> Currently, YARN Graceful Decommission waits for the completion of both 
> running containers and the running applications 
> (https://issues.apache.org/jira/browse/YARN-9608) of those containers 
> launched on the node under decommission. This adds unnecessary huge cost to 
> users on cloud deployments as most of the idle nodes are under decommission 
> waiting for the running application to complete.
> This feature aims to improve the Graceful Decommission logic by waiting for 
> the actual shuffle data to be consumed by dependent tasks rather than the 
> entire application. Below is the high-level design I have in mind.
> Add a new interface (say AuxiliaryShuffleService extends AuxiliaryService) 
> through which the workloads (Spark, Tez, MapReduce) ShuffleHandler exposes 
> shuffle data metrics (like shuffle data being present or not). NodeManager 
> periodically collects the shuffle data metrics from the configured 
> AuxiliaryShuffleServices and sends them along with the heartbeat to the 
> ResourceManager. The graceful decommission logic runs inside ResourceManager 
> waits until the shuffle data is consumed, with a maximum wait time up to the 
> configured graceful decommission timeout.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YARN-11466) Graceful Decommission for Shuffle Services

Reply via email to