Re: [DISCUSS] Infrastructure-Aware Task Execution and Resumable Operators - Proposals for Reliability

Christos Bisias Sat, 15 Nov 2025 06:28:15 -0800

Hi Stefan,

Thank you for the work! Very well organised and easy to follow docs.


I have been thinking about infrastructure retries for a while now. Also, I
had a few discussions at the Airflow Summit last month and I know that
others are interested as well.

It looks to me too, that this will be split into multiple PRs but if there
is a code POC, I would like to take a look.

Regards,
Christos

On Fri, Nov 14, 2025 at 11:53 PM Jarek Potiuk <[email protected]> wrote:

> Also something we discussed off-line: I think the scope of it is quite
> "huge" - but there are small and incremental improvements, that might not
> even require AIP that can be implemented as PRs., I think it's great to
> keep "big hairy vision" in head (like I did several years ago when I
> proposed a "small" improvement in our dependency management that took about
> 4 years to get to the stage I thought it would take a few weeks.
>
> Getting incremental improvements and showing the dedication, merit and
> consistent pattern of improvements is a key to get - eventually - big and
> "world-changing" changes.
>
> J.
>
>
> On Fri, Nov 14, 2025 at 10:32 PM Jens Scheffler <[email protected]>
> wrote:
>
> > Hi Stefan,
> >
> > thanks for dropping the proposals!
> >
> > I'd propose to store the documents in cWiki and open them formally in
> > there as AIP proposal as then it is sollowing the AIP process.
> >
> > See
> >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals
> >
> > Jens
> >
> > On 11/14/25 12:35, Stefan Wang wrote:
> > > Hi Airflow Community,
> > >
> > > I'm excited to share two complementary proposals that address critical
> > reliability challenges in Airflow, particularly around infrastructure
> > disruptions and task resilience. These proposals build on insights from
> > managing one of the larger Airflow deployments (20k+ DAGs, 100k+ daily
> task
> > executions per cluster).
> > >
> > > Proposals
> > >
> > > 1. Infrastructure-Aware Task Execution and Context Propagation
> > >
> > >
> >
> https://docs.google.com/document/d/1BAOJTAPfWK93JnN6LQrISo8IqDiE7LpnfG2Q42fnn7M
> > >
> > > 2. Resumable Operators for Disruption Readiness
> > >
> > >
> >
> https://docs.google.com/document/d/1XPbCfuTVhyiq12tFxbyQrX_kQqrDqMo5t7M789MG4GI
> > >
> > > What We're Solving
> > >
> > > Infrastructure failures consume user retries - Pod evictions shouldn't
> > count against application retry budgets
> > > Wasted computation - Worker crashes shouldn't restart healthy 3-hour
> > Databricks jobs from zero
> > > How
> > >
> > > Execution Context: Distinguish infrastructure vs application failures
> > for smarter retry handling
> > > Resumable Operators: Checkpoint and reconnect to external jobs after
> > disruptions (follows deferral pattern)
> > > These approaches have significantly improved reliability and user
> > experience, and reduced wasted costs in our production environment.
> > >
> > > Looking forward to your feedback on both the problems we're addressing
> > and the proposed solutions. Both proposals are fully backward compatible
> > and follow existing Airflow patterns.
> > >
> > > Happy to answer any questions or dive deeper into implementation
> details.
> > >
> > > Best,
> > >
> > > Stefan Wang
> > >
> > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >
>

Re: [DISCUSS] Infrastructure-Aware Task Execution and Resumable Operators - Proposals for Reliability

Reply via email to