Re: [DISCUSS] Feedback on DAG-level full-run retries (issue 60866)

Jens Scheffler Sun, 19 Apr 2026 13:25:27 -0700

Hi,

as nobody else was answering on the DISCUSS let me try to break the ice.I was commenting on the PR already.

I am not a big fan of adding more parameters for the retry as I assume alot of options are already existing. Yes and mainly on task level.

My proposal in general would be to model a pipeline in a way that alltasks are idempotent and not the full pipeline needs to be retried. Thisis in a matter of cost as well as a matter of time. If you need to runthe full chain then this either smells like the pipeline is badlymodelled as e.g. tasks are not idempotent or it is actually a re-runwith changed parameters (maybe it has been started wrong). A technicalneed to re-run all ... might be also a backfill case? So I am not seeinga strong case that would have been missed as a feature in the last 10 years.

If there actually is (and please convince me of any reason with theright arguments) then I'd still would ask to consider the following options:


 * Is the workflow actually mainly requiring to make something before
   as preparation and maybe something as finalization? Then the
   "Startup/Teardown" tasks might be a good composite. Especially if
   the pipeline is only 3 tasks then you can use this to ensure all is
   re-running
 * You could also attempt to fix this without changes in the scheduler
   via a on_failure_callback (see
   
https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/callbacks.html#callback-types)
   and hook a function that clears all tasks via API - and attach this
   callback as default to all tasks or to the Dag at the end.
 * Instead of extending the Dag and Scheduler logic I would imagine
   there might be an option to implement a "QualityCheckOperator" that
   takes a condition and in case of not meeting quality criteria then
   makes a "Clear DagRun" via API. This would not require additional
   Dag parameters and would not need any extensions on the scheduler
   but via API could be called from an Operator as alternative.
 * I could also imagine that the request raised was namig a Dag but
   then a moment later somebody will have the same with a set of tasks
   only. So an alternative as well could be having a
   "TransactionTaskGroup" which would call all tasks in that task group
   being somehow a combined transaction. If one is cleared or one needs
   a retry, all together are retried. Then you could apply this to a
   subset of tasks or if all tasks are in that group for the full Dag.

So if the reporter is silent now then we might need to get the originalvoice and see if one of the options are already a solution to theproblem. Happy to be convinced.


Jens

On 08.04.26 22:12, Przemysław Mirowski wrote:

Hello,

I checked the discussion and I don't really see any real use case where that 
could be potentially needed. The tasks currently can send some data between 
their executions via xcom or some other methods implemented in task logic, but 
these data should rather not change if the input didn't change (e.g. from 
upstream tasks), so the retrying on task level should be sufficient.

One user-side story I can picture is ML-style pipelines where a final 
validation or evaluation step fails and teams want a full rerun of the run 
instead of only retrying failed tasks.

Failure within the ML pipeline, IMHO would only require the retry on task level 
as the e.g. models, after training, should be saved and used by other tasks. 
Potential issue which I would see (within the ML pipelines) would be when the 
task itself would fail and retrying whole operation is expensive, but that part 
could be solved after AIP-103.

Maybe the only need for retrying everything (without thinking Airflow-specific) 
would be e.g. some time-series or streaming-related cases where after a failure 
somewhere, whole processing becomes invalid (basically the operations where 
there is no possibility of process design which would allow for only retrying 
the part of it).

Do you feel this need in practice?/do you see it as something that belongs in 
core?

Not really, at least for now.

How do you work around it today?

Designing the processes in a way were only task-level are needed if failure 
occur.

Regards,
PM

________________________________
From: Yuseok Jo<[email protected]>
Sent: 07 April 2026 15:07
To:[email protected] <[email protected]>
Subject: [DISCUSS] Feedback on DAG-level full-run retries (issue 60866)

Hello community,

I would like to pick up discussion on GitHub issue 60866 about DAG-level
automatic retries or rerunning a whole DAG run from the start when a
terminal task fails or the DAG run ends in a certain state.
https://github.com/apache/airflow/issues/60866

I am not the person who originally opened that issue, and the original
author may not be active now. I am unsure whether this is a real gap for
users or something we should handle with patterns we already have.

One user-side story I can picture is ML-style pipelines where a final
validation or evaluation step fails and teams want a full rerun of the run
instead of only retrying failed tasks. This is just one possible scenario.
Other domains may have similar needs.

I am not proposing a core change yet. I mainly want light feedback on three
points.
Do you feel this need in practice?
How do you work around it today?
And do you see it as something that belongs in core?

Thanks,
Yuseok Jo

Re: [DISCUSS] Feedback on DAG-level full-run retries (issue 60866)

Reply via email to