amoghrajesh opened a new issue, #67934: URL: https://github.com/apache/airflow/issues/67934
## Problem `SparkSubmitOperator` with `track_driver_via_k8s_api=True` detects job completion by watching `pod.status.phase`. This breaks in two ways when the driver pod has sidecar containers: 1. The pod phase stays `Running` after the driver container exits (because the sidecar is still alive), so the poll loop never sees `Succeeded` and waits indefinitely. 2. On the `Failed` branch, `container_statuses[0]` is used to extract the exit code and reason — but index 0 is not guaranteed to be the driver container in a multi-container pod. ## When it occurs Only when Istio (or another sidecar container like fluentbut) is injected into the **driver pod**. ## Proposed fix Filter `container_statuses` by the driver container name (`spark-kubernetes-driver` is the Spark default) instead of relying on `pod.status.phase` or index 0: - Treat the driver container's `state.terminated.exit_code == 0` as success. - Treat `exit_code != 0` as failure, with the actual exit code and reason in the error message. - Fall back to `pod.status.phase` if the container name is not found (defensive). The driver container name could be made configurable via a new `k8s_driver_container_name` parameter defaulting to `spark-kubernetes-driver`. ## Workaround Set `execution_timeout` on the operator. This is documented in the requirements section of the operator docs. ## Related Introduced in: #67715 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
