aagateuip opened a new issue, #32111: URL: https://github.com/apache/airflow/issues/32111
### Apache Airflow version Other Airflow 2 version (please specify below) ### What happened We have seen that KubernetesPodOperator sometimes fails to retrieve json from xcom sidecar container due to network connectivity issues or in some cases retrieves incomplete json which cannot be parsed. The KubernetesPodOperator task then fails with these error stack traces e.g. `File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py", line 398, in execute result = self.extract_xcom(pod=self.pod) File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py", line 372, in extract_xcom result = self.pod_manager.extract_xcom(pod) File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py", line 369, in extract_xcom _preload_content=False, File "/home/airflow/.local/lib/python3.7/site-packages/kubernetes/stream/stream.py", line 35, in _websocket_request return api_method(*args, **kwargs) File "/home/airflow/.local/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 994, in connect_get_namespaced_pod_exec return self.connect_get_namespaced_pod_exec_with_http_info(name, namespace, **kwargs) # noqa: E501 File "/home/airflow/.local/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 1115, in connect_get_namespaced_pod_exec_with_http_info collection_formats=collection_formats) File "/home/airflow/.local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 353, in call_api _preload_content, _request_timeout, _host) File "/home/airflow/.local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 184, in __call_api _request_timeout=_request_timeout) File "/home/airflow/.local/lib/python3.7/site-packages/kubernetes/stream/ws_client.py", line 518, in websocket_call raise ApiException(status=0, reason=str(e)) kubernetes.client.exceptions.ApiException: (0) Reason: Connection to remote host was lost.` OR ` File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py", line 398, in execute result = self.extract_xcom(pod=self.pod) File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py", line 374, in extract_xcom return json.loads(result) File "/usr/local/lib/python3.7/json/__init__.py", line 348, in loads return _default_decoder.decode(s) File "/usr/local/lib/python3.7/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/local/lib/python3.7/json/decoder.py", line 353, in raw_decode obj, end = self.scan_once(s, idx) json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 4076 (char 4075) ` We are using airflow 2.6.1 and apache-airflow-providers-cncf-kubernetes==4.0.2 ### What you think should happen instead KubefrnetesPodOperator should not fail with these intermittent network connectivity issues when pulling json from xcom sidecar container. It should have retries and verify whether it was able to retrieve valid json before it kills the xcom side car container, extract_xcom should * Read and try to parse return json when its read from /airflow/xcom/return.json - to catch errors if say due to network connectivity issues we don not read incomplete json (truncated json) * Add retries when we read the json - hopefully it will also catch against other network errors to (with kubernetes stream trying to talk to airflow container to get return json) * Only if the return json can be read and parsed (if its valid) now the code goes ahead and kills the sidecar container. ### How to reproduce This occurs intermittently so is hard to reproduce. Happens when the kubernetes cluster is under load. In 7 days we see this happen once or twice. ### Operating System Debian GNU/Linux 11 (bullseye) ### Versions of Apache Airflow Providers airflow 2.6.1 and apache-airflow-providers-cncf-kubernetes==4.0.2 ### Deployment Official Apache Airflow Helm Chart ### Deployment details _No response_ ### Anything else This occurs intermittently so is hard to reproduce. Happens when the kubernetes cluster is under load. In 7 days we see this happen once or twice. ### Are you willing to submit PR? - [X] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org