jingsong commented on issue #5915: [AIRFLOW-5312] Fix timeout issue in pod
launcher / KubernetesPodOperator
URL: https://github.com/apache/airflow/pull/5915#issuecomment-527636364
We investigated this a bit more and realized that even if we were to add a
timeout, it would likely not solve the issue. Since `read_namespaced_pod_log`
is called with `follow=True`, this triggers a `keepalive` connection from the
client to the Kubernetes API in order to stream logs. Digging into the urllib3
code, we found it uses a generator to "stream" the logs from Kubernetes API
back to the client. Adding a timeout here would/could cause several things to
happen:
(1) tasks that use the KubernetesPodOperator and run longer than the
specified `kube_api_timeout_seconds` would always trigger the timeout to happen
(2) when the timeout happens, either:
a. an exception happens, a retry to read the pod logs happens, and logs
are duplicated
b. urllib3 does not respect the timeout on a `keepalive` connection
@rolanddb Replying to your comment above, yes, we also observe the same
behavior. However, having the worker pods retry ad infinitum may lead to
repeated and confusing logs, which defeats the purpose of having the logs. I'm
also not quite sure what you mean when you say `poll indefinitely for the
status of launched tasks` as `read_namespaced_pod_log` reads logs, not the
state of the task pod itself. If `read_namespaced_pod_log` is where the hanging
occurs, this PR will not address that specific issue.
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org
With regards,
Apache Git Services