wolfdn opened a new pull request, #66705:
URL: https://github.com/apache/airflow/pull/66705

    <!-- SPDX-License-Identifier: Apache-2.0
         https://www.apache.org/licenses/LICENSE-2.0 -->
   
   <!--
   Thank you for contributing!
   
   Please provide above a brief description of the changes made in this pull 
request.
   Write a good git commit message following this guide: 
http://chris.beams.io/posts/git-commit/
   
   Please make sure that your code changes are covered with tests.
   And in case of new features or big changes remember to adjust the 
documentation.
   
   Feel free to ping (in general) for the review if you do not see reaction for 
a few days
   (72 Hours is the minimum reaction time you can expect from volunteers) - we 
sometimes miss notifications.
   
   In case of an existing issue, reference it using one of the following:
   
   * closes: #ISSUE
   * related: #ISSUE
   -->
   
   ## Problem
   
   When running Kubernetes Pods in deferred mode, the triggerer can return an 
`error` event due to transient communication issues with the Kubernetes API 
(e.g. timeouts, connection resets), even though the pod is still running 
normally. In the current implementation, this causes the task to fail 
immediately — even though the pod is healthy and still executing.
   
   Additionally, if the trigger emits an `error` event but the base container 
has actually completed successfully (exit code 0), the task is unnecessarily 
marked as failed.
   
   ## Solution
   
   ### 1. Re-defer on transient errors when pod is still alive
   
   When `trigger_reentry` receives an `error` event, it now checks the actual 
pod state before failing the task. If the pod's base container is still 
running, waiting, or the pod is pending (and there are no fatal issues like 
`InvalidImageName`), the task is re-deferred to the triggerer to continue 
monitoring.
   
   A `MAX_REDEFER_ATTEMPTS = 3` limit prevents infinite re-defer loops if the 
Kubernetes API is persistently unreachable. The re-defer count is tracked via 
`trigger_kwargs` which round-trips through the trigger's emitted events.
   
   Re-deferring is scoped to `"error"` events only. `"timeout"` and `"failed"` 
events represent deliberate decisions by the trigger (pod launch timeout, 
container failure) and are not retried.
   
   ### 2. Treat error as success when container actually succeeded
   
   If the trigger emits an `error` event but the base container has already 
terminated with exit code 0, the task is now treated as successful instead of 
failing.
   
   ### 3. Remove misleading `except TaskDeferred: raise`
   
   The old code had `except TaskDeferred: raise` before the `finally` block, 
which appeared to prevent `_clean()` from running during re-deferral. In 
reality, Python's `finally` always executes — even after a re-raised exception 
— so this guard was ineffective. The re-defer logic is now placed **before** 
the `try/finally` block, ensuring `_clean()` (which may delete the 
still-running pod) does not execute on re-deferral.
   
   ---
   
   ##### Was generative AI tooling used to co-author this PR?
   
   <!--
   If generative AI tooling has been used in the process of authoring this PR, 
please
   change below checkbox to `[X]` followed by the name of the tool, uncomment 
the "Generated-by".
   -->
   
   - [x] Yes (please specify the tool below)
     GitHub Copilot - Claude Opus 4.6
   
   <!--
   Generated-by: [Tool Name] following [the 
guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#gen-ai-assisted-contributions)
   -->
   
   ---
   
   * Read the **[Pull Request 
Guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#pull-request-guidelines)**
 for more information. Note: commit author/co-author name and email in commits 
become permanently public when merged.
   * For fundamental code changes, an Airflow Improvement Proposal 
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals))
 is needed.
   * When adding dependency, check compliance with the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   * For significant user-facing changes create newsfragment: 
`{pr_number}.significant.rst`, in 
[airflow-core/newsfragments](https://github.com/apache/airflow/tree/main/airflow-core/newsfragments).
 You can add this file in a follow-up commit after the PR is created so you 
know the PR number.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to