dirrao commented on code in PR #36882:
URL: https://github.com/apache/airflow/pull/36882#discussion_r1460214073


##########
airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py:
##########
@@ -434,9 +434,9 @@ def sync(self) -> None:
                     )
                     self.fail(task[0], e)
                 except ApiException as e:
-                    # These codes indicate something is wrong with pod 
definition; otherwise we assume pod
-                    # definition is ok, and that retrying may work
-                    if e.status in (400, 422):
+                    # In case of the below error codes, fail the task and 
honor the task retires.
+                    # Otherwise, go for continuous/infinite retries.
+                    if e.status in (400, 403, 404, 422):

Review Comment:
   I am not in favor of overloading the Kubernetes API server by default.  Even 
in case of transient errors like quota exceeds, it needs at-least a few mintues 
for quota to be available. In certain scenarios like permission issues, the 
task ends in a queued state forever which is not good. Instead, let the task 
fail eventually after retries so that the consumer can take an appropriate 
action (increase quota or adjust the job timing). 
   ```
            - 403 Forbidden will returns in scenarios like
                - your request exceeds the namespace quota 
                - scheduler role doesn't have permission to launch the pod
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to