[ 
https://issues.apache.org/jira/browse/SPARK-40979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-40979:
-------------------------------------------

    Assignee: Zhongwei Zhu

> Keep removed executor info in decommission state
> ------------------------------------------------
>
>                 Key: SPARK-40979
>                 URL: https://issues.apache.org/jira/browse/SPARK-40979
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 3.4.0
>            Reporter: Zhongwei Zhu
>            Assignee: Zhongwei Zhu
>            Priority: Major
>
> Removed executor due to decommission should be kept in a separate set. To 
> avoid OOM, set size will be limited to 1K or 10K.
> FetchFailed caused by decom executor could be divided into 2 categories:
>  # When FetchFailed reached DAGScheduler, the executor is still alive or is 
> lost but the lost info hasn't reached TaskSchedulerImpl. This is already 
> handled in SPARK-40979
>  # FetchFailed is caused by decom executor loss, so the decom info is already 
> removed in TaskSchedulerImpl. If we keep such info in a short period, it is 
> good enough. Even we limit the size of removed executors to 10K, it could be 
> only at most 10MB memory usage. In real case, it's rare to have cluster size 
> of over 10K and the chance that all these executors decomed and lost at the 
> same time would be small.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to