John Hofman created AIRFLOW-2966:
------------------------------------

             Summary: KubernetesExecutor + namespace quotas kills scheduler if 
the pod can't be launched
                 Key: AIRFLOW-2966
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-2966
             Project: Apache Airflow
          Issue Type: Bug
          Components: scheduler
    Affects Versions: 1.10
         Environment: Kubernetes 1.9.8
            Reporter: John Hofman


When running Airflow in Kubernetes with the KubernetesExecutor and resource 
quota's set on the namespace Airflow is deployed in. If the scheduler tries to 
launch a pod into the namespace that exceeds the namespace limits it gets an 
ApiException, and crashes the scheduler.

This stack trace is an example of the ApiException from the kubernetes client:
{code:java}
[2018-08-27 09:51:08,516] {pod_launcher.py:58} ERROR - Exception when 
attempting to create Namespaced Pod.
Traceback (most recent call last):
File "/src/apache-airflow/airflow/contrib/kubernetes/pod_launcher.py", line 55, 
in run_pod_async
resp = self._client.create_namespaced_pod(body=req, namespace=pod.namespace)
File 
"/usr/local/lib/python3.6/site-packages/kubernetes/client/apis/core_v1_api.py", 
line 6057, in create_namespaced_pod
(data) = self.create_namespaced_pod_with_http_info(namespace, body, **kwargs)
File 
"/usr/local/lib/python3.6/site-packages/kubernetes/client/apis/core_v1_api.py", 
line 6142, in create_namespaced_pod_with_http_info
collection_formats=collection_formats)
File "/usr/local/lib/python3.6/site-packages/kubernetes/client/api_client.py", 
line 321, in call_api
_return_http_data_only, collection_formats, _preload_content, _request_timeout)
File "/usr/local/lib/python3.6/site-packages/kubernetes/client/api_client.py", 
line 155, in __call_api
_request_timeout=_request_timeout)
File "/usr/local/lib/python3.6/site-packages/kubernetes/client/api_client.py", 
line 364, in request
body=body)
File "/usr/local/lib/python3.6/site-packages/kubernetes/client/rest.py", line 
266, in POST
body=body)
File "/usr/local/lib/python3.6/site-packages/kubernetes/client/rest.py", line 
222, in request
raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (403)
Reason: Forbidden
HTTP response headers: HTTPHeaderDict({'Audit-Id': 
'b00e2cbb-bdb2-41f3-8090-824aee79448c', 'Content-Type': 'application/json', 
'Date': 'Mon, 27 Aug 2018 09:51:08 GMT', 'Content-Length': '410'})
HTTP response body: 
{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods
 \"podname-ec366e89ef934d91b2d3ffe96234a725\" is forbidden: exceeded quota: 
compute-resources, requested: limits.memory=4Gi, used: limits.memory=6508Mi, 
limited: 
limits.memory=10Gi","reason":"Forbidden","details":{"name":"podname-ec366e89ef934d91b2d3ffe96234a725","kind":"pods"},"code":403}{code}
 

I would expect the scheduler to catch the Exception and at least mark the task 
as failed, or better yet retry the task later.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to