johnhoran commented on code in PR #61778:
URL: https://github.com/apache/airflow/pull/61778#discussion_r2803250607
##########
providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/triggers/pod.py:
##########
@@ -183,7 +184,7 @@ async def run(self) -> AsyncIterator[TriggerEvent]:
event = await self._wait_for_container_completion()
yield event
return
- except PodLaunchTimeoutException as e:
+ except (PodLaunchTimeoutException, PodLaunchFailedException) as e:
Review Comment:
No that shouldn't happen. In that scenario what would happen is the
triggerer would exit, because of `detect_pod_terminate_early_issues` it would
happen on the first time in saw the image pull failure, and before the
`startup_timeout` expires, with a timeout state. The timeout state basically
then accounts for the gap in time between the triggerer exiting and the
operator starting back up and does a final check to see if the pod is in a
running or terminal state. In this scenario it wouldn't be, so the task fails.
I think there is a case for renaming the timeout state. Basically the
triggerer can return one of `error`, `fatal`, `timeout` and `success`. Timeout
is essentially for situations where the pod didn't start up in time, but if it
has started when it gets to the operator, I think its better to let it run
rather than fail the task and retry. So if I could think of a pithy name for
"fatal unless recovered" then I'd rename it to that.
I'm also a little unhappy with what the operator does in the `error` state,
but that's beyond the scope of this PR.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]