[ 
https://issues.apache.org/jira/browse/SPARK-49868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-49868:
----------------------------------
        Parent: SPARK-49524
    Issue Type: Sub-task  (was: Bug)

> ExecutorFailureTracker sometimes miss failed executors on k8s
> -------------------------------------------------------------
>
>                 Key: SPARK-49868
>                 URL: https://issues.apache.org/jira/browse/SPARK-49868
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Kubernetes
>    Affects Versions: 4.0.0, 3.5.3
>            Reporter: Attila Zsolt Piros
>            Assignee: Attila Zsolt Piros
>            Priority: Major
>             Fix For: 4.0.0
>
>
> On k8s currently the executor failure tracking is done in the 
> ExecutorPodsAllocator class which calculates with the failed PODs (where POD 
> state is PodFailed) but ExecutorPodsLifecycleManager can remove those pods 
> when "spark.kubernetes.executor.deleteOnTermination" is switched on.
> This can reproduced easily by making the executor fail and check the executor 
> IDs known by ExecutorFailureTracker by adding some new logs:
> {noformat}
> 24-09-25 20:29:51 WARN ExecutorPodsAllocator: all failed: Set(69, 5, 10, 56, 
> 37, 52, 20, 46, 78, 74, 70, 21, 33, 53, 77, 73, 32, 34, 45, 22, 71, 54, 49, 
> 76, 91, 66, 35, 48, 18, 50, 67, 11, 72, 55, 75, 36, 51, 19, 47, 68, 90)
> 24-09-25 20:29:51 ERROR ExecutorPodsAllocator: Max number of executor 
> failures (40) reached       
> {noformat}
> You can see 40 was the limit but the "all failed" set does not contain the 
> first 40 consecutive executor IDs but there are huge gaps (even executor ID 
> 91 is contained).
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to