aagateuip opened a new issue, #32111:
URL: https://github.com/apache/airflow/issues/32111

   ### Apache Airflow version
   
   Other Airflow 2 version (please specify below)
   
   ### What happened
   
   We have seen that KubernetesPodOperator sometimes fails to retrieve json 
from xcom sidecar container due to network connectivity issues or in some cases 
retrieves incomplete json which cannot be parsed. The KubernetesPodOperator 
task then fails with these error stack traces
   
   e.g.
   
   `File 
"/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py",
 line 398, in execute
   result = self.extract_xcom(pod=self.pod)
   File 
"/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py",
 line 372, in extract_xcom
   result = self.pod_manager.extract_xcom(pod)
   File 
"/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py",
 line 369, in extract_xcom
   _preload_content=False,
   File 
"/home/airflow/.local/lib/python3.7/site-packages/kubernetes/stream/stream.py", 
line 35, in _websocket_request
   return api_method(*args, **kwargs)
   File 
"/home/airflow/.local/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py",
 line 994, in connect_get_namespaced_pod_exec
   return self.connect_get_namespaced_pod_exec_with_http_info(name, namespace, 
**kwargs) # noqa: E501
   File 
"/home/airflow/.local/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py",
 line 1115, in connect_get_namespaced_pod_exec_with_http_info
   collection_formats=collection_formats)
   File 
"/home/airflow/.local/lib/python3.7/site-packages/kubernetes/client/api_client.py",
 line 353, in call_api
   _preload_content, _request_timeout, _host)
   File 
"/home/airflow/.local/lib/python3.7/site-packages/kubernetes/client/api_client.py",
 line 184, in __call_api
   _request_timeout=_request_timeout)
   File 
"/home/airflow/.local/lib/python3.7/site-packages/kubernetes/stream/ws_client.py",
 line 518, in websocket_call
   raise ApiException(status=0, reason=str(e))
   kubernetes.client.exceptions.ApiException: (0)
   Reason: Connection to remote host was lost.`
   
   OR
   
   `
   File 
"/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py",
 line 398, in execute
       result = self.extract_xcom(pod=self.pod)
     File 
"/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py",
 line 374, in extract_xcom
       return json.loads(result)
     File "/usr/local/lib/python3.7/json/__init__.py", line 348, in loads
       return _default_decoder.decode(s)
     File "/usr/local/lib/python3.7/json/decoder.py", line 337, in decode
       obj, end = self.raw_decode(s, idx=_w(s, 0).end())
     File "/usr/local/lib/python3.7/json/decoder.py", line 353, in raw_decode
       obj, end = self.scan_once(s, idx)
   json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 
4076 (char 4075)
   `
   
   We are using airflow 2.6.1 and  
apache-airflow-providers-cncf-kubernetes==4.0.2
   
   ### What you think should happen instead
   
   KubefrnetesPodOperator should not fail with these intermittent network 
connectivity issues when pulling json from xcom sidecar container. It should 
have retries and verify whether it was able to retrieve valid json before it 
kills the xcom side car container,
   
   extract_xcom should 
   * Read and try to parse return json when its read from 
/airflow/xcom/return.json  - to catch errors if say due to network connectivity 
issues we don not read  incomplete json (truncated json)
   * Add retries when we read the json  - hopefully it will also catch against 
other network errors to  (with kubernetes stream trying to talk to airflow 
container to get return json)
   * Only if the return  json can be read and parsed (if its valid) now the 
code goes ahead and kills the sidecar container.
   
   ### How to reproduce
   
   This occurs intermittently so is hard to reproduce. Happens when the 
kubernetes cluster is under load. In 7 days we see this happen  once or twice.
   
   ### Operating System
   
   Debian GNU/Linux 11 (bullseye)
   
   ### Versions of Apache Airflow Providers
   
   airflow 2.6.1 and  apache-airflow-providers-cncf-kubernetes==4.0.2
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Deployment details
   
   _No response_
   
   ### Anything else
   
   This occurs intermittently so is hard to reproduce. Happens when the 
kubernetes cluster is under load. In 7 days we see this happen  once or twice.
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to