jayachandrakasarla opened a new pull request, #68450:
URL: https://github.com/apache/airflow/pull/68450

   Closes #68445 
   
   ### Problem
   When KubernetesPodOperator is configured with `init_container_logs=True`, 
the task hangs indefinitely if the pod never leaves the Pending phase (e.g. due 
to an invalid node_selector, missing node pool, or resource exhaustion).
   
   With `init_container_logs=False`, `PodLaunchTimeoutException` is raised 
correctly after `startup_timeout_seconds / schedule_timeout_seconds`. With 
`init_container_logs=True`, the task never times out and the pod is never 
cleaned up.
   
   You can reproduce the issue using the following DAG code:
   ```python
   from __future__ import annotations
   
   from pendulum import datetime
   
   from airflow.sdk import dag
   from airflow.providers.cncf.kubernetes.operators.pod import 
KubernetesPodOperator
   from kubernetes.client import models as k8s
   
   
   @dag(
       dag_id="kpo_pending_init_container_logs",
       start_date=datetime(2025, 1, 1),
       schedule=None,
       catchup=False,
   )
   def kpo_pending_init_container_logs():
       KubernetesPodOperator(
           task_id="kpo_pending_with_init_logs",
           name="kpo-pending-with-init-logs",
           namespace="default",
           image="busybox:1.36",
           cmds=["sh", "-c"],
           arguments=["echo main container should never start; sleep 30"],
   
           deferrable=False,
   
           # setting the below value to True makes the task hang for long time
           init_container_logs=True,
           init_containers=[
               k8s.V1Container(
                   name="init-hello",
                   image="busybox:1.36",
                   command=["sh", "-c"],
                   args=["echo init container should never start; sleep 30"],
               )
           ],
   
           # schedule the pod on a non-existing node to make sure the pod stays 
in the pending state
           node_selector={
               "airflow-repro-node": "does-not-exist",
           },
   
           # Expected behavior: should fail after timeout.
           # Bug: hangs forever when init_container_logs=True.
           startup_timeout_seconds=30,
           schedule_timeout_seconds=30,
   
           get_logs=True,
           is_delete_operator_pod=False,
           in_cluster=False,
           config_file="/files/kube/config"
       )
   
   
   kpo_pending_init_container_logs()
   ```
   
   ### Fix
   Made `self.await_pod_start()` to run before 
`self.await_init_containers_completion()` to ensure the pod has fully started 
before attempting to stream init container logs, preventing KPO from hanging 
when init container log streaming was triggered against a pod still in PENDING 
state.
   
   Was generative AI tooling used to co-author this PR?
   
   [X] Yes 
   
   Used Claude Sonnet to understand the codebase and assist with implementing 
the changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to