[ https://issues.apache.org/jira/browse/SPARK-32199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Holden Karau resolved SPARK-32199. ---------------------------------- Fix Version/s: 3.1.0 Assignee: Devesh Agrawal Resolution: Fixed > Clear shuffle state when decommissioned nodes/executors are finally lost > ------------------------------------------------------------------------ > > Key: SPARK-32199 > URL: https://issues.apache.org/jira/browse/SPARK-32199 > Project: Spark > Issue Type: Sub-task > Components: Spark Core > Affects Versions: 3.1.0 > Reporter: Devesh Agrawal > Assignee: Devesh Agrawal > Priority: Major > Fix For: 3.1.0 > > > While every effort has been made to try to migrate the cached and shuffle > blocks out of a decommissioned node – its still possible that there are > lingering references for some blocks on a decommissioned node. These will > result in a fetch failures – that will not only take time to detect but can > also cause job failure. > This is a bit tricky in terms of when to clear the shuffle state ? Ideally > you want to clear it the millisecond before the shuffle service on the node > dies (or the executor dies when there is no external shuffle service) – too > soon and it could lead to some wastage and too late would lead to fetch > failures. > There are only very few cases where we precisely know when the shuffle data > will start being unavailable – perhaps during a cloud spot kill that gives > some advance warning. The next best thing is to clear this state lazily at > the first sign: ie, when the first fetch failure is observed on a > decommissioned entity (node or executor). We take that as a hint that finally > the entity has gone away. > What we care about here is whether the shuffle data is going away: ie, if > there is an (external) shuffle service resident on the node being > decommissioned, or when the shuffle service is embedded inside an executor > and the executor is being decommissioned. > This clearing need not be done if the shuffle data is truly remote in certain > disaggregated environments. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org