[ https://issues.apache.org/jira/browse/SPARK-49868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dongjoon Hyun updated SPARK-49868: ---------------------------------- Parent: SPARK-49524 Issue Type: Sub-task (was: Bug) > ExecutorFailureTracker sometimes miss failed executors on k8s > ------------------------------------------------------------- > > Key: SPARK-49868 > URL: https://issues.apache.org/jira/browse/SPARK-49868 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes > Affects Versions: 4.0.0, 3.5.3 > Reporter: Attila Zsolt Piros > Assignee: Attila Zsolt Piros > Priority: Major > Fix For: 4.0.0 > > > On k8s currently the executor failure tracking is done in the > ExecutorPodsAllocator class which calculates with the failed PODs (where POD > state is PodFailed) but ExecutorPodsLifecycleManager can remove those pods > when "spark.kubernetes.executor.deleteOnTermination" is switched on. > This can reproduced easily by making the executor fail and check the executor > IDs known by ExecutorFailureTracker by adding some new logs: > {noformat} > 24-09-25 20:29:51 WARN ExecutorPodsAllocator: all failed: Set(69, 5, 10, 56, > 37, 52, 20, 46, 78, 74, 70, 21, 33, 53, 77, 73, 32, 34, 45, 22, 71, 54, 49, > 76, 91, 66, 35, 48, 18, 50, 67, 11, 72, 55, 75, 36, 51, 19, 47, 68, 90) > 24-09-25 20:29:51 ERROR ExecutorPodsAllocator: Max number of executor > failures (40) reached > {noformat} > You can see 40 was the limit but the "all failed" set does not contain the > first 40 consecutive executor IDs but there are huge gaps (even executor ID > 91 is contained). > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org