[ 
https://issues.apache.org/jira/browse/SPARK-40979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongwei Zhu updated SPARK-40979:
---------------------------------
    Description: 
Removed executor due to decommission should be kept in a separate set. To avoid 
OOM, set size will be limited to 1K or 10K.

FetchFailed caused by decom executor could be divided into 2 categories:
 # When FetchFailed reached DAGScheduler, the executor is still alive or is 
lost but the lost info hasn't reached TaskSchedulerImpl. This is already 
handled in SPARK-40979
 # FetchFailed is caused by decom executor loss, so the decom info is already 
removed in TaskSchedulerImpl. If we keep such info in a short period, it is 
good enough. Even we limit the size of removed executors to 10K, it could be 
only at most 10MB memory usage. In real case, it's rare to have cluster size of 
over 10K and the chance that all these executors decomed and lost at the same 
time would be small.

> Keep removed executor info in decommission state
> ------------------------------------------------
>
>                 Key: SPARK-40979
>                 URL: https://issues.apache.org/jira/browse/SPARK-40979
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 3.4.0
>            Reporter: Zhongwei Zhu
>            Priority: Major
>
> Removed executor due to decommission should be kept in a separate set. To 
> avoid OOM, set size will be limited to 1K or 10K.
> FetchFailed caused by decom executor could be divided into 2 categories:
>  # When FetchFailed reached DAGScheduler, the executor is still alive or is 
> lost but the lost info hasn't reached TaskSchedulerImpl. This is already 
> handled in SPARK-40979
>  # FetchFailed is caused by decom executor loss, so the decom info is already 
> removed in TaskSchedulerImpl. If we keep such info in a short period, it is 
> good enough. Even we limit the size of removed executors to 10K, it could be 
> only at most 10MB memory usage. In real case, it's rare to have cluster size 
> of over 10K and the chance that all these executors decomed and lost at the 
> same time would be small.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to