Hi all,

I’d like to start a discussion around Task Group retries.

Issue: https://github.com/apache/airflow/issues/21867
PR: https://github.com/apache/airflow/pull/61809

This PR introduces a proof of concept for TaskGroup retries, allowing a
whole TaskGroup to be retried as a unit rather than relying only on
individual task retries.

In addition to standard retry parameters (retries, retry_delay, exponential
backoff, etc.), this proposal introduces TaskGroup-specific retry
semantics, including:


   - *retry_condition*: allows defining when a group should be retried
   (e.g., based on aggregated task states), enabling more flexible policies
   than simple failure-based retries.
   - *retry_fast_fail*: enables fail-fast behavior within the group, so
   that once a retry-triggering condition is met, the group can short-circuit
   remaining tasks and move directly to retry handling.


The implementation adds retry configuration to TaskGroup, introduces a
task_group_instance model to persist retry state per DagRun, and includes
scheduler logic to evaluate retry conditions, enforce delay/backoff, and
clear group tasks for subsequent attempts. The feature is opt-in and does
not affect existing DAGs unless configured.

I’d appreciate feedback on:


   - The proposed API.
   - The scheduler and state-management approach.
   - The new model/migration.
   - Whether the retry semantics feel intuitive and consistent with
   existing task-level retries.
   - ..


If there is general agreement on the direction, I’m happy to continue
refining the implementation.

Best,
Jorge

Reply via email to