johnhoran opened a new issue, #61775:
URL: https://github.com/apache/airflow/issues/61775

   ### Apache Airflow Provider(s)
   
   cncf-kubernetes
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Apache Airflow version
   
   3
   
   ### Operating System
   
   astronomer
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Deployment details
   
   _No response_
   
   ### What happened
   
   When running KPO in deferred mode I ran into an issue caused by rate limits 
imposed by our docker registry.  When the pod tried to pull the image, 
kubernetes hit the limit and the triggerer marked the task as failed.  
   ```
   
airflow.providers.cncf.kubernetes.kubernetes_helper_functions.PodLaunchFailedException:
 Pod docker image cannot be pulled, unable to start: ErrImagePull
   pull QPS exceeded
   ```
   Coming from 
   
https://github.com/apache/airflow/blob/1c41180381a459b77b6d964229bdc19a4a7ec0b3/providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/utils/pod_manager.py#L187-L205
   
   The triggerer then passes back to the operator, however in the time taken 
for the operator to pick up the task, kubernetes has managed to successfully 
pull the image and start the pod.  The task outputs some logs from the pod and 
then just waits for the pod to complete.
   
   I note that 
https://github.com/apache/airflow/blob/1c41180381a459b77b6d964229bdc19a4a7ec0b3/providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/operators/pod.py#L1003-L1005
 suggests we should skip the waiting on ErrImagePull, but 
https://github.com/apache/airflow/blob/1c41180381a459b77b6d964229bdc19a4a7ec0b3/providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/utils/pod_manager.py#L179-L182
 returns a launch failure instead of a timeout so hence the waiting.
   
   ### What you think should happen instead
   
   1. ErrImagePull should still result in a timeout instead of a failed status.
   2. When handing back from the triggerer to the operator if the status is 
timeout we should still do one more check to see if the pod has started, and if 
it has we should defer again.
   
   ### How to reproduce
   
   -
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to