[ https://issues.apache.org/jira/browse/SPARK-40979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648757#comment-17648757 ]
Dongjoon Hyun commented on SPARK-40979: --------------------------------------- I collected this as a subtask of SPARK-41550 > Keep removed executor info in decommission state > ------------------------------------------------ > > Key: SPARK-40979 > URL: https://issues.apache.org/jira/browse/SPARK-40979 > Project: Spark > Issue Type: Sub-task > Components: Spark Core > Affects Versions: 3.4.0 > Reporter: Zhongwei Zhu > Assignee: Zhongwei Zhu > Priority: Major > Fix For: 3.4.0 > > > Removed executor due to decommission should be kept in a separate set. To > avoid OOM, set size will be limited to 1K or 10K. > FetchFailed caused by decom executor could be divided into 2 categories: > # When FetchFailed reached DAGScheduler, the executor is still alive or is > lost but the lost info hasn't reached TaskSchedulerImpl. This is already > handled in SPARK-40979 > # FetchFailed is caused by decom executor loss, so the decom info is already > removed in TaskSchedulerImpl. If we keep such info in a short period, it is > good enough. Even we limit the size of removed executors to 10K, it could be > only at most 10MB memory usage. In real case, it's rare to have cluster size > of over 10K and the chance that all these executors decomed and lost at the > same time would be small. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org