[I] KubernetesJobOperator does not recover when pods are deleted on completion [airflow]

via GitHub Fri, 17 Oct 2025 14:51:47 -0700


pmcquighan-camus opened a new issue, #56693:
URL: https://github.com/apache/airflow/issues/56693


   ### Apache Airflow Provider(s)
   
   cncf-kubernetes
   
   ### Versions of Apache Airflow Providers
   
   apache-airflow-providers-cncf-kubernetes==10.7.0
   
   ### Apache Airflow version
   
   3.0.6
   
   ### Operating System
   
   debian 12
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Deployment details
   
   Running on GKE , kubernetes version 1.33
   
   ### What happened
   
   A job with parallelism 1 and 1 completion (i.e. just running a single pod to 
completion) completed successfully.  The triggerer detected the job completion, 
but before the task was restarted GKE deleted the pod for a node scaling event. 
 Since the pod is `Complete` the Job is also considered `Complete` and so 
kubernetes will not retry the pod or anything.  Then, when the task wakes up, 
it fails when trying to `resume_execution`, notably when trying to fetch logs.  
The worst part is that on *task retries* the operator sees "job is completed" 
and tries to resume from `execute_complete` and hits the same pod not found 
error again (instead of perhaps retrying the Job from the start).
   
   ```
   [2025-10-15, 09:48:22] ERROR - Task failed with exception: source="task"
   ApiException: (404)
   Reason: Not Found
   
   File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/sdk/execution_time/task_runner.py",
 line 920 in run
   File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/sdk/execution_time/task_runner.py",
 line 1215 in _execute_task
   File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/sdk/bases/operator.py",
 line 1606 in resume_execution
   File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/operators/job.py",
 line 276 in execute_complete
   File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/hooks/kubernetes.py",
 line 470 in get_pod
   File 
"/home/airflow/.local/lib/python3.12/site-packages/kubernetes/client/api/core_v1_api.py",
 line 23999 in read_namespaced_pod
   File 
"/home/airflow/.local/lib/python3.12/site-packages/kubernetes/client/api/core_v1_api.py",
 line 24086 in read_namespaced_pod_with_http_info
   File 
"/home/airflow/.local/lib/python3.12/site-packages/kubernetes/client/api_client.py",
 line 348 in call_api
   File 
"/home/airflow/.local/lib/python3.12/site-packages/kubernetes/client/api_client.py",
 line 180 in __call_api
   File 
"/home/airflow/.local/lib/python3.12/site-packages/kubernetes/client/api_client.py",
 line 373 in request
   File 
"/home/airflow/.local/lib/python3.12/site-packages/kubernetes/client/rest.py", 
line 244 in GET
   File 
"/home/airflow/.local/lib/python3.12/site-packages/kubernetes/client/rest.py", 
line 238 in request
   ```
   
   I think a primary workaround is to set `get_logs=False`, but I'm not totally 
certain that this workaround fixes all cases where a PodNotFound might occur.
   
   Also note that the None-check on getting the pod 
[here](https://github.com/apache/airflow/blob/providers-cncf-kubernetes/10.7.0/providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/operators/job.py#L276-L278)
 is not hit since the method `get_pod` ends up throwing a 
kubernetes.client.ApiException.  I tried patching the code to catch that 
exception and rethrow the `PodNotFoundException`, but that had no effect.
   
   This feels similar to, but not fixed by 
https://github.com/apache/airflow/issues/39239, notably a task retry does not 
result in a successful execution.
   
   ### What you think should happen instead
   
   I think failing with PodNotFoundException for the task when `get_logs=True` 
is reasonable, however it seems like a task retry should then result in the 
full task being retried instead of just re-running `execute_complete` and 
failing on the same exception multiple times.  This behavior seemed to occur 
regardless of if the kubernetes Job object still remained either.
   
   ### How to reproduce
   
   Run a KubernetesJobOperator that does anything, and once the pod completes 
(but prior to airflow fetching logs/marking the task complete), manually delete 
the pod.  In an actual cloud-hosted Kubernetes environment, a 
cluster-autoscaling component might result in the pod being deleted, but it is 
hard to rely on that so a manual delete mimics the same behavior. 
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] KubernetesJobOperator does not recover when pods are deleted on completion [airflow]

Reply via email to