AutomationDev85 commented on code in PR #61778:
URL: https://github.com/apache/airflow/pull/61778#discussion_r2821476467
##########
providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/triggers/pod.py:
##########
@@ -183,7 +184,7 @@ async def run(self) -> AsyncIterator[TriggerEvent]:
event = await self._wait_for_container_completion()
yield event
return
- except PodLaunchTimeoutException as e:
+ except (PodLaunchTimeoutException, PodLaunchFailedException) as e:
Review Comment:
I’d love to see this issue fixed soon. We’ve observed the same problem with
rate limits, but I’d also like to keep fail‑fast behavior: for Airflow users,
it should be immediately visible if they started a run with the wrong image,
rather than discovering hours later that the task waited until a timeout. The
best solution seems to be logic that determines whether ErrImagePull is
retryable. Since Kubernetes doesn’t expose this directly, we may need to
inspect the container_waiting.message to decide if an error is worth retrying.
We’ve seen a 503 from ACR (egress over account limit), so perhaps we should
parse the error code from the message and retry only on 429 or 503.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]