jayachandrakasarla opened a new pull request, #68450:
URL: https://github.com/apache/airflow/pull/68450
Closes #68445
### Problem
When KubernetesPodOperator is configured with `init_container_logs=True`,
the task hangs indefinitely if the pod never leaves the Pending phase (e.g. due
to an invalid node_selector, missing node pool, or resource exhaustion).
With `init_container_logs=False`, `PodLaunchTimeoutException` is raised
correctly after `startup_timeout_seconds / schedule_timeout_seconds`. With
`init_container_logs=True`, the task never times out and the pod is never
cleaned up.
You can reproduce the issue using the following DAG code:
```python
from __future__ import annotations
from pendulum import datetime
from airflow.sdk import dag
from airflow.providers.cncf.kubernetes.operators.pod import
KubernetesPodOperator
from kubernetes.client import models as k8s
@dag(
dag_id="kpo_pending_init_container_logs",
start_date=datetime(2025, 1, 1),
schedule=None,
catchup=False,
)
def kpo_pending_init_container_logs():
KubernetesPodOperator(
task_id="kpo_pending_with_init_logs",
name="kpo-pending-with-init-logs",
namespace="default",
image="busybox:1.36",
cmds=["sh", "-c"],
arguments=["echo main container should never start; sleep 30"],
deferrable=False,
# setting the below value to True makes the task hang for long time
init_container_logs=True,
init_containers=[
k8s.V1Container(
name="init-hello",
image="busybox:1.36",
command=["sh", "-c"],
args=["echo init container should never start; sleep 30"],
)
],
# schedule the pod on a non-existing node to make sure the pod stays
in the pending state
node_selector={
"airflow-repro-node": "does-not-exist",
},
# Expected behavior: should fail after timeout.
# Bug: hangs forever when init_container_logs=True.
startup_timeout_seconds=30,
schedule_timeout_seconds=30,
get_logs=True,
is_delete_operator_pod=False,
in_cluster=False,
config_file="/files/kube/config"
)
kpo_pending_init_container_logs()
```
### Fix
Made `self.await_pod_start()` to run before
`self.await_init_containers_completion()` to ensure the pod has fully started
before attempting to stream init container logs, preventing KPO from hanging
when init container log streaming was triggered against a pod still in PENDING
state.
Was generative AI tooling used to co-author this PR?
[X] Yes
Used Claude Sonnet to understand the codebase and assist with implementing
the changes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]