[ https://issues.apache.org/jira/browse/AIRFLOW-6040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16980186#comment-16980186 ]
Max commented on AIRFLOW-6040: ------------------------------ We ran into this same issue. I believe this is actually an issue in the upstream [kubernetes|[https://github.com/kubernetes-client/python]] package and not Airflow. The exception is thrown from [this loop|https://github.com/apache/airflow/blob/1.10.6/airflow/contrib/executors/kubernetes_executor.py#L356]. It passes: {{label_selector="airflow-worker=<uuid>"}} to the {{list_namespaced_pod()}} method. When used in a {{Watch()}}, this doesn't return anything when there are no Pods that match the given UUID. The {{_request_timeout}} [config setting|https://github.com/apache/airflow/blob/1.10.6/airflow/config_templates/default_airflow.cfg#L828] causes the underlying {{urllib3}} library to throw a timeout exception which is unhandled by {{Watch()}}. You can easily reproduce this by running a simple Python pod (in your Airflow namespace so it has the same ServiceAccount permissions) and executing the following snippet: {code:bash} $ kubectl -n <your-namespace> run -i -t python --image=python:3.7.4-slim-stretch --restart=Never --command -- /bin/sh # pip install kubernetes # python >>> from kubernetes import config, client, watch >>> from kubernetes.client.rest import ApiException >>> config.load_incluster_config() >>> k8s = client.CoreV1Api() >>> watcher = watch.Watch() >>> namespace = "<your-namespace>" >>> for event in watcher.stream(k8s.list_namespaced_pod, namespace, >>> resource_version="0", label_selector="airflow-worker=dont-find-this", >>> _request_timeout=(60, 60)): >>> print(event['object']) {code} I've observed this behavior in both Airflow 1.10.5 & 1.10.6, Python 2.7 & Python 3.7, K8s 1.15 & K8s 1.16, urllib3 1.24 & urllib3 1.25. As a workaround, setting [kube_client_request_args|https://github.com/apache/airflow/blob/1.10.6/airflow/config_templates/default_airflow.cfg#L828] to: {noformat} "{ \"_request_timeout\" : [60,60], \"timeout_seconds\" : 50 }" {noformat} will cause a warning instead of an exception. {{timeout_seconds}} targets the [list_namespaced_pod|https://github.com/kubernetes-client/python/blob/master/kubernetes/docs/CoreV1Api.md#list_namespaced_pod] method as opposed to the underlying urllib3 library. Hope this helps others that are facing this issues. > Airflow scheduler with kubernetes executor fails :- Unknown error in > KubernetesJobWatcher > ----------------------------------------------------------------------------------------- > > Key: AIRFLOW-6040 > URL: https://issues.apache.org/jira/browse/AIRFLOW-6040 > Project: Apache Airflow > Issue Type: Bug > Components: contrib, executor-kubernetes, scheduler > Affects Versions: 1.10.6 > Reporter: Ashutosh Srivastava > Assignee: Daniel Imberman > Priority: Major > > I am trying to set up airflow with the kubernetes executor. I have cloned > airflow 1.10.6 and am building the docker image and then deploying it with > kube. The pods are running, the service airflow also starts. The webserver is > working fine. But when I check the logs for the scheduler I get the following > error. > > {{ERROR - Error while health checking kube watcher process. Process died for > unknown reasons > INFO - Event: and now my watch begins starting at resource_version: 0 > ERROR - Unknown error in KubernetesJobWatcher. Failing > Traceback (most recent call last): > File > "/usr/local/lib/python2.7/dist-packages/airflow/contrib/executors/kubernetes_executor.py", > line 333, in run > self.worker_uuid, self.kube_config) > File > "/usr/local/lib/python2.7/dist-packages/airflow/contrib/executors/kubernetes_executor.py", > line 358, in _run > **kwargs): > File "/usr/local/lib/python2.7/dist-packages/kubernetes/watch/watch.py", > line 144, in stream > for line in iter_resp_lines(resp): > File "/usr/local/lib/python2.7/dist-packages/kubernetes/watch/watch.py", > line 48, in iter_resp_lines > for seg in resp.read_chunked(decode_content=False): > File "/usr/local/lib/python2.7/dist-packages/urllib3/response.py", line > 781, in read_chunked > self._original_response.close() > File "/usr/lib/python2.7/contextlib.py", line 35, in __exit__ > self.gen.throw(type, value, traceback) > File "/usr/local/lib/python2.7/dist-packages/urllib3/response.py", line > 439, in _error_catcher > raise ReadTimeoutError(self._pool, None, "Read timed out.") > ReadTimeoutError: HTTPSConnectionPool(host='10.0.0.1', port=443): Read timed > out.}} -- This message was sent by Atlassian Jira (v8.3.4#803005)