dirrao commented on code in PR #36882:
URL: https://github.com/apache/airflow/pull/36882#discussion_r1460550276


##########
airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py:
##########
@@ -434,9 +434,9 @@ def sync(self) -> None:
                     )
                     self.fail(task[0], e)
                 except ApiException as e:
-                    # These codes indicate something is wrong with pod 
definition; otherwise we assume pod
-                    # definition is ok, and that retrying may work
-                    if e.status in (400, 422):
+                    # In case of the below error codes, fail the task and 
honor the task retires.
+                    # Otherwise, go for continuous/infinite retries.
+                    if e.status in (400, 403, 404, 422):

Review Comment:
   Yes. Backoff retry strategy works for transient errors. However, it doesn't 
make sense to apply for non-transient errors. K8 API error code 403 includes 
both transient and non-transient errors. We need to deep-dive into all the 
error messages and apply a back-off retry strategy for transient errors only. 
However, it is a bit more complicated but doable. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to