sakethsomaraju commented on issue #59626: URL: https://github.com/apache/airflow/issues/59626#issuecomment-3722779861
One important safeguard to add here: we should only apply internal retries when the `base` container has not yet reached the Running state. If the pod (or `base` container) was already `Running`, retrying on a 404 could be dangerous for non-idempotent workloads. In that case, the container may already have started executing user code, and a retry could result in duplicate side effects (e.g. writes, external API calls, mutations). Concretely, the retry logic should be gated on something like: - Pod never reached `Running`, or Pod was preempted while still in `Pending` / `ContainerCreating` - If the pod was observed in `Running` at least once, a 404 should be treated as a terminal failure and surfaced immediately. This keeps the retry behaviour safe for non-idempotent tasks while still addressing the transient preemption window that occurs during node bootstrap and daemon set scheduling. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
