darshil929 commented on issue #55368:
URL: https://github.com/apache/airflow/issues/55368#issuecomment-4762323844
Hi @o-nikolas
Thanks for the pointer. I took your suggestion and dug into this against
current main rather than the version it was originally reported on, and I think
it's largely been resolved since.
The relevant change looks like #66705 ("Re-defer task when Kubernetes pod is
not completed"). It reworks `KubernetesPodOperator.trigger_reentry` so that
when the trigger returns an error event but the pod is still running / pending
/ waiting, the operator re-defers and keeps monitoring instead of falling
through and waiting for completion. That matches the pause → restart →
wait-until-completion behavior described here.
A few things that point to it being fixed:
- Grepping the operator on `10.7.0` (the version reported here) for the
re-defer logic turns up nothing:
```
$ git show
providers-cncf-kubernetes/10.7.0:providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/operators/pod.py
| grep -n "_redefer_count\|MAX_REDEFER_ATTEMPTS"
```
No matches. On `main` the same grep shows it living in `trigger_reentry`:
```
276: MAX_REDEFER_ATTEMPTS = 3
1026: redefer_count = event.get("_redefer_count", 0)
1028: if event["status"] == "error" and pod_is_not_done and
redefer_count < self.MAX_REDEFER_ATTEMPTS:
1039: self.trigger_kwargs["_redefer_count"] = redefer_count + 1
```
- It first shipped in `apache-airflow-providers-cncf-kubernetes==10.17.1`,
so any release from there onward should include it.
- The regression tests added alongside #66705 pass on current `main`:
<img width="1320" height="379" alt="Image"
src="https://github.com/user-attachments/assets/2ec604fc-ed80-404a-96ef-7fdafc5be3b9"
/>
One thing worth flagging: in the original logs the re-entry is actually
kicked off by a `ConfigException: Invalid kube-config file. Expected key
current-context in kube-config`. #66705 stops that from immediately failing the
task, but I don't think it addresses why the kube-config ends up invalid on the
triggerer, that part may be closer to #61736.
@mhaure-touze would you be able to retest on `cncf-kubernetes>=10.17.1` and
check whether the deferrable behavior looks correct now? If the kube-config
error is gone as well, this can probably be closed. if it's still there, it
might be worth tracking separately.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]