Hi all,

I’d like to start a discussion around Task Group retries.

Issue: https://github.com/apache/airflow/issues/21867
PR: https://github.com/apache/airflow/pull/61809

This PR introduces a proof of concept for TaskGroup retries, allowing a whole 
TaskGroup to be retried as a unit rather than relying only on individual task 
retries.

In addition to standard retry parameters (retries, retry_delay, exponential 
backoff, etc.), this proposal introduces TaskGroup-specific retry semantics, 
including:


  *
retry_condition: allows defining when a group should be retried (e.g., based on 
aggregated task states), enabling more flexible policies than simple 
failure-based retries.
  *
retry_fast_fail: enables fail-fast behavior within the group, so that once a 
retry-triggering condition is met, the group can short-circuit remaining tasks 
and move directly to retry handling.

The implementation adds retry configuration to TaskGroup, introduces a 
task_group_instance model to persist retry state per DagRun, and includes 
scheduler logic to evaluate retry conditions, enforce delay/backoff, and clear 
group tasks for subsequent attempts. The feature is opt-in and does not affect 
existing DAGs unless configured.

I’d appreciate feedback on:


  *
The proposed API.
  *
The scheduler and state-management approach.
  *
The new model/migration.
  *
Whether the retry semantics feel intuitive and consistent with existing 
task-level retries.
  *
..

If there is general agreement on the direction, I’m happy to continue refining 
the implementation.

Best,
Jorge

Reply via email to