passionworkeer commented on issue #59626:
URL: https://github.com/apache/airflow/issues/59626#issuecomment-4038905695

   This is a well-documented issue. The proposed solution of adding internal 
retry logic in PodManager.read_pod() is the right approach.
   
   For the implementation, consider:
   
   1. **In pod_manager.py**, modify  ead_pod() to catch ApiException with 
status 404 and retry:
   `python
   import time
   from kubernetes.client.rest import ApiException
   
   def read_pod(self, name, namespace, retries=3):
       for attempt in range(retries):
           try:
               return self.core_v1.read_namespaced_pod(name, namespace)
           except ApiException as e:
               if e.status == 404 and attempt < retries - 1:
                   time.sleep(2 ** attempt)  # exponential backoff
                   continue
               raise
   `
   
   2. **Detect preemption** - check for  eason=Preempted in pod events before 
failing
   3. **Use pod.metadata.resource_version** to ensure stale reads don't cause 
issues
   
   This way the operator handles transient preemption automatically without 
requiring users to set high task-level retries.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to