johnhoran opened a new issue, #61775: URL: https://github.com/apache/airflow/issues/61775
### Apache Airflow Provider(s) cncf-kubernetes ### Versions of Apache Airflow Providers _No response_ ### Apache Airflow version 3 ### Operating System astronomer ### Deployment Official Apache Airflow Helm Chart ### Deployment details _No response_ ### What happened When running KPO in deferred mode I ran into an issue caused by rate limits imposed by our docker registry. When the pod tried to pull the image, kubernetes hit the limit and the triggerer marked the task as failed. ``` airflow.providers.cncf.kubernetes.kubernetes_helper_functions.PodLaunchFailedException: Pod docker image cannot be pulled, unable to start: ErrImagePull pull QPS exceeded ``` Coming from https://github.com/apache/airflow/blob/1c41180381a459b77b6d964229bdc19a4a7ec0b3/providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/utils/pod_manager.py#L187-L205 The triggerer then passes back to the operator, however in the time taken for the operator to pick up the task, kubernetes has managed to successfully pull the image and start the pod. The task outputs some logs from the pod and then just waits for the pod to complete. I note that https://github.com/apache/airflow/blob/1c41180381a459b77b6d964229bdc19a4a7ec0b3/providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/operators/pod.py#L1003-L1005 suggests we should skip the waiting on ErrImagePull, but https://github.com/apache/airflow/blob/1c41180381a459b77b6d964229bdc19a4a7ec0b3/providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/utils/pod_manager.py#L179-L182 returns a launch failure instead of a timeout so hence the waiting. ### What you think should happen instead 1. ErrImagePull should still result in a timeout instead of a failed status. 2. When handing back from the triggerer to the operator if the status is timeout we should still do one more check to see if the pod has started, and if it has we should defer again. ### How to reproduce - ### Anything else _No response_ ### Are you willing to submit PR? - [x] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
