gillbuchanan opened a new issue #13916:
URL: https://github.com/apache/airflow/issues/13916


   **Apache Airflow version**: 2.0.0
   
   
   **Kubernetes version**:
   ```
   Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.3", 
GitCommit:"2d3c76f9091b6bec110a5e63777c332469e0cba2", GitTreeState:"clean", 
BuildDate:"2019-08-19T11:13:54Z", GoVersion:"go1.12.9", Compiler:"gc", 
Platform:"darwin/amd64"}
   Server Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.13", 
GitCommit:"37c06f456fdb4d25e402b5fbcb72cd6a77a021a9", GitTreeState:"clean", 
BuildDate:"2020-09-18T21:59:14Z", GoVersion:"go1.13.9", Compiler:"gc", 
Platform:"linux/amd64"}
   ```
   
   **Environment**:
   
   - **Cloud provider or hardware configuration**: Azure Kubernetes Service
   - **Image** : apache/airflow/2.0.0-python3.6
   - **Config Variables**: 
   ```bash
   AIRFLOW__CORE__DAGS_FOLDER=/opt/airflow/dags
   AIRFLOW__CORE__DONOT_PICKLE=false
   AIRFLOW__CORE__ENABLE_XCOM_PICKLING=false
   AIRFLOW__CORE__EXECUTOR=KubernetesExecutor
   AIRFLOW__CORE__FERNET_KEY=*****
   AIRFLOW__CORE__LOAD_EXAMPLES=false
   AIRFLOW__CORE__SQL_ALCHEMY_CONN_CMD=bash -c 'eval "$DATABASE_SQLALCHEMY_CMD"'
   AIRFLOW__ELASTICSEARCH__WRITE_STDOUT=True
   AIRFLOW__KUBERNETES__ENV_FROM_CONFIGMAP_REF=my-name-env
   AIRFLOW__KUBERNETES__NAMESPACE=airflow
   
AIRFLOW__KUBERNETES__POD_TEMPLATE_FILE=/home/airflow/scripts/pod-template.yaml
   AIRFLOW__KUBERNETES__WORKER_SERVICE_ACCOUNT_NAME=my-name
   AIRFLOW__LOGGING__BASE_LOG_FOLDER=/opt/airflow/logs
   
AIRFLOW__LOGGING__DAG_PROCESSOR_MANAGER_LOG_LOCATION=/opt/airflow/logs/dag_processor_manager/dag_processor_manager.log
   
AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER=wasb://airflow-logs@******.blob.core.windows.net
   AIRFLOW__SCHEDULER__CHILD_PROCESS_LOG_DIRECTORY=/opt/airflow/logs/scheduler
   AIRFLOW__WEBSERVER__BASE_URL=http://****/my-name
   AIRFLOW__WEBSERVER__WEB_SERVER_PORT=8080
   ```
   
   **What happened**:
   
   After installing airflow in AKS via helm charts, webserver and scheduler 
start up as expected. After some time (with activity or while sitting idly) 
scheduler spits out the following:
   <details><summary>scheduler error messages</summary>
   
   ```
   [2021-01-26 16:22:08,620] {kubernetes_executor.py:111} ERROR - Unknown error 
in KubernetesJobWatcher. Failing
   Traceback (most recent call last):
     File 
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py",
 line 313, in recv_into
       return self.connection.recv_into(*args, **kwargs)
     File "/home/airflow/.local/lib/python3.6/site-packages/OpenSSL/SSL.py", 
line 1840, in recv_into
       self._raise_ssl_error(self._ssl, result)
     File "/home/airflow/.local/lib/python3.6/site-packages/OpenSSL/SSL.py", 
line 1663, in _raise_ssl_error
       raise SysCallError(errno, errorcode.get(errno))
   OpenSSL.SSL.SysCallError: (104, 'ECONNRESET')
   
   During handling of the above exception, another exception occurred:
   
   Traceback (most recent call last):
     File 
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 
436, in _error_catcher
       yield
     File 
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 
763, in read_chunked
       self._update_chunk_length()
     File 
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 
693, in _update_chunk_length
       line = self._fp.fp.readline()
     File "/usr/local/lib/python3.6/socket.py", line 586, in readinto
       return self._sock.recv_into(b)
     File 
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py",
 line 318, in recv_into
       raise SocketError(str(e))
   OSError: (104, 'ECONNRESET')
   
   During handling of the above exception, another exception occurred:
   
   Traceback (most recent call last):
     File 
"/home/airflow/.local/lib/python3.6/site-packages/airflow/executors/kubernetes_executor.py",
 line 103, in run
       kube_client, self.resource_version, self.scheduler_job_id, 
self.kube_config
     File 
"/home/airflow/.local/lib/python3.6/site-packages/airflow/executors/kubernetes_executor.py",
 line 145, in _run
       for event in list_worker_pods():
     File 
"/home/airflow/.local/lib/python3.6/site-packages/kubernetes/watch/watch.py", 
line 144, in stream
       for line in iter_resp_lines(resp):
     File 
"/home/airflow/.local/lib/python3.6/site-packages/kubernetes/watch/watch.py", 
line 46, in iter_resp_lines
       for seg in resp.read_chunked(decode_content=False):
     File 
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 
792, in read_chunked
       self._original_response.close()
     File "/usr/local/lib/python3.6/contextlib.py", line 99, in __exit__
       self.gen.throw(type, value, traceback)
     File 
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 
454, in _error_catcher
       raise ProtocolError("Connection broken: %r" % e, e)
   urllib3.exceptions.ProtocolError: ('Connection broken: OSError("(104, 
\'ECONNRESET\')",)', OSError("(104, 'ECONNRESET')",))
   Process KubernetesJobWatcher-3:
   Traceback (most recent call last):
     File 
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py",
 line 313, in recv_into
       return self.connection.recv_into(*args, **kwargs)
     File "/home/airflow/.local/lib/python3.6/site-packages/OpenSSL/SSL.py", 
line 1840, in recv_into
       self._raise_ssl_error(self._ssl, result)
     File "/home/airflow/.local/lib/python3.6/site-packages/OpenSSL/SSL.py", 
line 1663, in _raise_ssl_error
       raise SysCallError(errno, errorcode.get(errno))
   OpenSSL.SSL.SysCallError: (104, 'ECONNRESET')
   
   During handling of the above exception, another exception occurred:
   
   Traceback (most recent call last):
     File 
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 
436, in _error_catcher
       yield
     File 
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 
763, in read_chunked
       self._update_chunk_length()
     File 
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 
693, in _update_chunk_length
       line = self._fp.fp.readline()
     File "/usr/local/lib/python3.6/socket.py", line 586, in readinto
       return self._sock.recv_into(b)
     File 
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py",
 line 318, in recv_into
       raise SocketError(str(e))
   OSError: (104, 'ECONNRESET')
   
   During handling of the above exception, another exception occurred:
   
   Traceback (most recent call last):
     File "/usr/local/lib/python3.6/multiprocessing/process.py", line 258, in 
_bootstrap
       self.run()
     File 
"/home/airflow/.local/lib/python3.6/site-packages/airflow/executors/kubernetes_executor.py",
 line 103, in run
       kube_client, self.resource_version, self.scheduler_job_id, 
self.kube_config
     File 
"/home/airflow/.local/lib/python3.6/site-packages/airflow/executors/kubernetes_executor.py",
 line 145, in _run
       for event in list_worker_pods():
     File 
"/home/airflow/.local/lib/python3.6/site-packages/kubernetes/watch/watch.py", 
line 144, in stream
       for line in iter_resp_lines(resp):
     File 
"/home/airflow/.local/lib/python3.6/site-packages/kubernetes/watch/watch.py", 
line 46, in iter_resp_lines
       for seg in resp.read_chunked(decode_content=False):
     File 
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 
792, in read_chunked
       self._original_response.close()
     File "/usr/local/lib/python3.6/contextlib.py", line 99, in __exit__
       self.gen.throw(type, value, traceback)
     File 
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 
454, in _error_catcher
       raise ProtocolError("Connection broken: %r" % e, e)
   urllib3.exceptions.ProtocolError: ('Connection broken: OSError("(104, 
\'ECONNRESET\')",)', OSError("(104, 'ECONNRESET')",))
   [2021-01-26 16:22:10,177] {kubernetes_executor.py:266} ERROR - Error while 
health checking kube watcher process. Process died for unknown reasons
   [2021-01-26 16:22:10,189] {kubernetes_executor.py:126} INFO - Event: and now 
my watch begins starting at resource_version: 0
   [2021-01-26 16:23:00,720] {scheduler_job.py:1751} INFO - Resetting orphaned 
tasks for active dag runs
   ```
   </details>
   
   **Steps I've taken to debug**:
   
   Based on the location of the errors in the stack trace, I assumed the error 
was related to the `KubernetesExecutor` making an api request for a list of 
pods. To debug this I `exec`ed into the pod and ran 
   ```bash
   KUBE_TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
   curl -sSk -H "Authorization: Bearer $KUBE_TOKEN" 
https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_PORT_443_TCP_PORT/api/v1/pods/
   ```
   which initially gave me a 403 forbidden error. I then created the following 
`ClusterRoleBinding`:
   
   <details><summary>rbac-read.yaml</summary>
   
   ```yaml
   apiVersion: rbac.authorization.k8s.io/v1beta1
   kind: ClusterRoleBinding
   metadata:
     name: system:serviceaccount:airflow:my-name:read-pods
     namespace: kube-system
   subjects:
     - kind: ServiceAccount
       name: my-name
       namespace: airflow
   roleRef:
     kind: ClusterRole
     name: cluster-admin
     apiGroup: rbac.authorization.k8s.io
   ```
   </details>
   
   Afterward the above bash commands successfully returned a list of pods in 
the cluster. I then opened a python shell (still within the `scheduler` pod) 
and successfully ran
   
   ```python
   >>> from kubernetes import client, config
   >>> config.load_incluster_config()
   >>> v1 = client.CoreV1Api()
   >>> pods = v1.list_pod_for_all_namespaces(watch=False)
   >>> airflow_pods = v1.list_namespaced_pod("airflow")
   ```
   
   Given that this ran successfully, I'm at a loss as to why I'm still getting 
the `ECONNRESET` error.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to