sakethsomaraju commented on issue #59626:
URL: https://github.com/apache/airflow/issues/59626#issuecomment-3722779861

   One important safeguard to add here: we should only apply internal retries 
when the `base` container has not yet reached the Running state.
   
   If the pod (or `base` container) was already `Running`, retrying on a 404 
could be dangerous for non-idempotent workloads. In that case, the container 
may already have started executing user code, and a retry could result in 
duplicate side effects (e.g. writes, external API calls, mutations).
   
   Concretely, the retry logic should be gated on something like:
   
   - Pod never reached `Running`, or Pod was preempted while still in `Pending` 
/ `ContainerCreating`
   
   - If the pod was observed in `Running` at least once, a 404 should be 
treated as a terminal failure and surfaced immediately.
   
   This keeps the retry behaviour safe for non-idempotent tasks while still 
addressing the transient preemption window that occurs during node bootstrap 
and daemon set scheduling.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to