[ 
https://issues.apache.org/jira/browse/AIRFLOW-6040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16980186#comment-16980186
 ] 

Max commented on AIRFLOW-6040:
------------------------------

We ran into this same issue. I believe this is actually an issue in the 
upstream [kubernetes|[https://github.com/kubernetes-client/python]] package and 
not Airflow.

The exception is thrown from [this 
loop|https://github.com/apache/airflow/blob/1.10.6/airflow/contrib/executors/kubernetes_executor.py#L356].
 It passes: {{label_selector="airflow-worker=<uuid>"}} to the 
{{list_namespaced_pod()}} method. When used in a {{Watch()}}, this doesn't 
return anything when there are no Pods that match the given UUID. The 
{{_request_timeout}} [config 
setting|https://github.com/apache/airflow/blob/1.10.6/airflow/config_templates/default_airflow.cfg#L828]
 causes the underlying {{urllib3}} library to throw a timeout exception which 
is unhandled by {{Watch()}}.

You can easily reproduce this by running a simple Python pod (in your Airflow 
namespace so it has the same ServiceAccount permissions) and executing the 
following snippet:
{code:bash}
$ kubectl -n <your-namespace> run -i -t python 
--image=python:3.7.4-slim-stretch --restart=Never --command -- /bin/sh
# pip install kubernetes
# python
>>> from kubernetes import config, client, watch
>>> from kubernetes.client.rest import ApiException
>>> config.load_incluster_config()
>>> k8s = client.CoreV1Api()
>>> watcher = watch.Watch()
>>> namespace = "<your-namespace>"
>>> for event in watcher.stream(k8s.list_namespaced_pod, namespace, 
>>> resource_version="0", label_selector="airflow-worker=dont-find-this", 
>>> _request_timeout=(60, 60)):
>>>     print(event['object'])
{code}
I've observed this behavior in both Airflow 1.10.5 & 1.10.6, Python 2.7 & 
Python 3.7, K8s 1.15 & K8s 1.16, urllib3 1.24 & urllib3 1.25.

As a workaround, setting 
[kube_client_request_args|https://github.com/apache/airflow/blob/1.10.6/airflow/config_templates/default_airflow.cfg#L828]
 to:
{noformat}
"{ \"_request_timeout\" : [60,60], \"timeout_seconds\" : 50 }"
{noformat}
will cause a warning instead of an exception. {{timeout_seconds}} targets the 
[list_namespaced_pod|https://github.com/kubernetes-client/python/blob/master/kubernetes/docs/CoreV1Api.md#list_namespaced_pod]
 method as opposed to the underlying urllib3 library.

Hope this helps others that are facing this issues.

> Airflow scheduler with kubernetes executor fails :- Unknown error in 
> KubernetesJobWatcher
> -----------------------------------------------------------------------------------------
>
>                 Key: AIRFLOW-6040
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-6040
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: contrib, executor-kubernetes, scheduler
>    Affects Versions: 1.10.6
>            Reporter: Ashutosh Srivastava
>            Assignee: Daniel Imberman
>            Priority: Major
>
> I am trying to set up airflow with the kubernetes executor. I have cloned 
> airflow 1.10.6 and am building the docker image and then deploying it with 
> kube. The pods are running, the service airflow also starts. The webserver is 
> working fine. But when I check the logs for the scheduler I get the following 
> error.
>  
> {{ERROR - Error while health checking kube watcher process. Process died for 
> unknown reasons
> INFO - Event: and now my watch begins starting at resource_version: 0
> ERROR - Unknown error in KubernetesJobWatcher. Failing
> Traceback (most recent call last):
>   File 
> "/usr/local/lib/python2.7/dist-packages/airflow/contrib/executors/kubernetes_executor.py",
>  line 333, in run
>     self.worker_uuid, self.kube_config)
>   File 
> "/usr/local/lib/python2.7/dist-packages/airflow/contrib/executors/kubernetes_executor.py",
>  line 358, in _run
>     **kwargs):
>   File "/usr/local/lib/python2.7/dist-packages/kubernetes/watch/watch.py", 
> line 144, in stream
>     for line in iter_resp_lines(resp):
>   File "/usr/local/lib/python2.7/dist-packages/kubernetes/watch/watch.py", 
> line 48, in iter_resp_lines
>     for seg in resp.read_chunked(decode_content=False):
>   File "/usr/local/lib/python2.7/dist-packages/urllib3/response.py", line 
> 781, in read_chunked
>     self._original_response.close()
>   File "/usr/lib/python2.7/contextlib.py", line 35, in __exit__
>     self.gen.throw(type, value, traceback)
>   File "/usr/local/lib/python2.7/dist-packages/urllib3/response.py", line 
> 439, in _error_catcher
>     raise ReadTimeoutError(self._pool, None, "Read timed out.")
> ReadTimeoutError: HTTPSConnectionPool(host='10.0.0.1', port=443): Read timed 
> out.}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to