SameerMesiah97 opened a new pull request, #60532: URL: https://github.com/apache/airflow/pull/60532
**Description** This change refactors `watch_pod_events` so that it continues watching events for the full lifecycle of the target pod, rather than stopping after a single watch stream terminates. The new implementation now: - Reconnects automatically when a watch stream terminates (e.g. server-side timeout). - Resumes watching from the last observed resourceVersion. - Handles Kubernetes 410 Gone errors by restarting the watch from the current state. - Terminates cleanly when the pod completes or is deleted. This ensures that `watch_pod_events` continues yielding events for the full lifecycle of a pod instead of silently stopping after `timeout_seconds`. **Rationale** The Kubernetes Watch API enforces server-side timeouts, meaning a single watch stream is not guaranteed to remain open indefinitely. The previous implementation treated timeout_seconds as an implicit upper bound on the total duration of event streaming, causing the generator to stop yielding events after the first watch termination — even while the pod was still running. This behavior is surprising and contradicts what users reasonably expect from the method name (`watch_pod_events`), the docstring and standard Kubernetes watch semantics. The updated implementation aligns with Kubernetes best practices by treating watch termination as a recoverable condition and transparently reconnecting until the pod reaches a terminal lifecycle state. **Backwards Compatibility** This change does **not** alter the public API or method signature. However, it does change runtime behavior: - `timeout_seconds` now applies only to individual watch connections, not the overall duration of event streaming. - Event streaming continues until pod completion or deletion instead of stopping silently after a timeout. While it is possible that some users rely on the previous behavior, it is more likely that existing deployments have implemented workarounds (e.g. external loops or polling) to compensate for the premature termination. The new behavior is consistent with documented intent and Kubernetes conventions, and therefore adheres to the principle of least surprise. **Tests** Added unit tests to validate the following expected behaviors: - Reconnects and continues streaming events after a watch stream ends (e.g. timeout). - Restarts the watch when Kubernetes returns 410 Gone due to a stale resourceVersion. - Stops cleanly when the pod is deleted (404). - Stops immediately when the pod reaches a terminal phase (Succeeded or Failed). Existing tests have been updated to account for the addition of pod state inspection in `watch_pod_events`. **Notes** - `_load_config` is now cached and is responsible only for loading configuration; it no longer returns an API client. API client instantiation is now solely the responsibility of `get_conn`, enabling reconnection in `watch_pod_events` without redundant configuration reloads. The internal helper used to construct and return an API client from `_load_config` has been removed. - The exception message raised when multiple configuration sources are supplied has been clarified to more accurately describe the error. - Polling fallback behavior is preserved and now continues until the pod reaches a terminal lifecycle state, matching the updated watch semantics. Closes: #60495 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
