RE: Re: [DISCUSS] Task Group Retries

Jorge Rocamora García Wed, 18 Feb 2026 16:17:35 -0800

Hi all,

I’d like to clarify that several concrete use cases were already described
in the original issue: https://github.com/apache/airflow/issues/21867


One important aspect is that with the deprecation of SubDAGs in favor of
TaskGroups, some retry semantics were lost.

In my specific case, I’m using the KubernetesPodOperator, where different
steps must run in separate pods because they depend on different software.
However,  conceptually, the entire block needs to behave as a single
logical unit. For example:

- A: Create a PersistentVolumeClaim (PVC) to share data
- B: Retrieve and prepare inputs
- C: Run the analysis
- D: Remove the PVC

This pattern was previously achievable with SubDAGs, but there is currently
no straightforward mechanism that preserves this grouped execution and
retry behavior.

Best regards,
Jorge

On 2026/02/18 22:20:10 Daniel Standish via dev wrote:
> Yeah I think arguing that there’s a need for it with use cases is a good
> idea.
>
>
> On Wed, Feb 18, 2026 at 12:02 PM Natanel <[email protected]> wrote:
>
> > Hello, I have skimmed over the PR, overall I have to say that it looks
> > good.
> > I have yet to find a use case for this (as I just can't think of one)
where
> > I find the feature useful, and I will appreciate it if you could give an
> > example use case for the feature, as it looks like quite a bit of
changes
> > have been introduced (including a new table and new dependency types)
for a
> > feature which allows for task groups to be retried.
> >
> > I would love to hear about what the use case of the feature is, as I
just
> > can't think of one, I think that it might be simpler to implement if we
do
> > something like a composite task instance, yet I do not want to propose
> > anything before I hear mroe about the use case, as I am most likely just
> > missing something.
> >
> > Best regards,
> > Natanel.
> >
> > On Wed, 18 Feb 2026 at 17:49, Jorge Rocamora García <
> > [email protected]> wrote:
> >
> > > Hi all,
> > >
> > > I’d like to start a discussion around Task Group retries.
> > >
> > > Issue: https://github.com/apache/airflow/issues/21867
> > > PR: https://github.com/apache/airflow/pull/61809
> > >
> > > This PR introduces a proof of concept for TaskGroup retries, allowing
a
> > > whole TaskGroup to be retried as a unit rather than relying only on
> > > individual task retries.
> > >
> > > In addition to standard retry parameters (retries, retry_delay,
> > > exponential backoff, etc.), this proposal introduces
TaskGroup-specific
> > > retry semantics, including:
> > >
> > >
> > > *
> > > retry_condition: allows defining when a group should be retried (e.g.,
> > > based on aggregated task states), enabling more flexible policies than
> > > simple failure-based retries.
> > > *
> > > retry_fast_fail: enables fail-fast behavior within the group, so that
> > once
> > > a retry-triggering condition is met, the group can short-circuit
> > remaining
> > > tasks and move directly to retry handling.
> > >
> > > The implementation adds retry configuration to TaskGroup, introduces a
> > > task_group_instance model to persist retry state per DagRun, and
includes
> > > scheduler logic to evaluate retry conditions, enforce delay/backoff,
and
> > > clear group tasks for subsequent attempts. The feature is opt-in and
does
> > > not affect existing DAGs unless configured.
> > >
> > > I’d appreciate feedback on:
> > >
> > >
> > > *
> > > The proposed API.
> > > *
> > > The scheduler and state-management approach.
> > > *
> > > The new model/migration.
> > > *
> > > Whether the retry semantics feel intuitive and consistent with
existing
> > > task-level retries.
> > > *
> > > ..
> > >
> > > If there is general agreement on the direction, I’m happy to continue
> > > refining the implementation.
> > >
> > > Best,
> > > Jorge
> > >
> > >
> >
>

RE: Re: [DISCUSS] Task Group Retries

Reply via email to