[
https://issues.apache.org/jira/browse/SPARK-32199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Apache Spark reassigned SPARK-32199:
------------------------------------
Assignee: (was: Apache Spark)
> Clear shuffle state when decommissioned nodes/executors are finally lost
> ------------------------------------------------------------------------
>
> Key: SPARK-32199
> URL: https://issues.apache.org/jira/browse/SPARK-32199
> Project: Spark
> Issue Type: Sub-task
> Components: Spark Core
> Affects Versions: 3.1.0
> Reporter: Devesh Agrawal
> Priority: Major
>
> While every effort has been made to try to migrate the cached and shuffle
> blocks out of a decommissioned node – its still possible that there are
> lingering references for some blocks on a decommissioned node. These will
> result in a fetch failures – that will not only take time to detect but can
> also cause job failure.
> This is a bit tricky in terms of when to clear the shuffle state ? Ideally
> you want to clear it the millisecond before the shuffle service on the node
> dies (or the executor dies when there is no external shuffle service) – too
> soon and it could lead to some wastage and too late would lead to fetch
> failures.
> There are only very few cases where we precisely know when the shuffle data
> will start being unavailable – perhaps during a cloud spot kill that gives
> some advance warning. The next best thing is to clear this state lazily at
> the first sign: ie, when the first fetch failure is observed on a
> decommissioned entity (node or executor). We take that as a hint that finally
> the entity has gone away.
> What we care about here is whether the shuffle data is going away: ie, if
> there is an (external) shuffle service resident on the node being
> decommissioned, or when the shuffle service is embedded inside an executor
> and the executor is being decommissioned.
> This clearing need not be done if the shuffle data is truly remote in certain
> disaggregated environments.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]