Re: [DISCUSS] Infrastructure-Aware Task Execution and Resumable Operators - Proposals for Reliability

Jarek Potiuk Fri, 14 Nov 2025 13:53:34 -0800

Also something we discussed off-line: I think the scope of it is quite
"huge" - but there are small and incremental improvements, that might not
even require AIP that can be implemented as PRs., I think it's great to
keep "big hairy vision" in head (like I did several years ago when I
proposed a "small" improvement in our dependency management that took about
4 years to get to the stage I thought it would take a few weeks.


Getting incremental improvements and showing the dedication, merit and
consistent pattern of improvements is a key to get - eventually - big and
"world-changing" changes.

J.


On Fri, Nov 14, 2025 at 10:32 PM Jens Scheffler <[email protected]> wrote:

> Hi Stefan,
>
> thanks for dropping the proposals!
>
> I'd propose to store the documents in cWiki and open them formally in
> there as AIP proposal as then it is sollowing the AIP process.
>
> See
>
> https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals
>
> Jens
>
> On 11/14/25 12:35, Stefan Wang wrote:
> > Hi Airflow Community,
> >
> > I'm excited to share two complementary proposals that address critical
> reliability challenges in Airflow, particularly around infrastructure
> disruptions and task resilience. These proposals build on insights from
> managing one of the larger Airflow deployments (20k+ DAGs, 100k+ daily task
> executions per cluster).
> >
> > Proposals
> >
> > 1. Infrastructure-Aware Task Execution and Context Propagation
> >
> >
> https://docs.google.com/document/d/1BAOJTAPfWK93JnN6LQrISo8IqDiE7LpnfG2Q42fnn7M
> >
> > 2. Resumable Operators for Disruption Readiness
> >
> >
> https://docs.google.com/document/d/1XPbCfuTVhyiq12tFxbyQrX_kQqrDqMo5t7M789MG4GI
> >
> > What We're Solving
> >
> > Infrastructure failures consume user retries - Pod evictions shouldn't
> count against application retry budgets
> > Wasted computation - Worker crashes shouldn't restart healthy 3-hour
> Databricks jobs from zero
> > How
> >
> > Execution Context: Distinguish infrastructure vs application failures
> for smarter retry handling
> > Resumable Operators: Checkpoint and reconnect to external jobs after
> disruptions (follows deferral pattern)
> > These approaches have significantly improved reliability and user
> experience, and reduced wasted costs in our production environment.
> >
> > Looking forward to your feedback on both the problems we're addressing
> and the proposed solutions. Both proposals are fully backward compatible
> and follow existing Airflow patterns.
> >
> > Happy to answer any questions or dive deeper into implementation details.
> >
> > Best,
> >
> > Stefan Wang
> >
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: [DISCUSS] Infrastructure-Aware Task Execution and Resumable Operators - Proposals for Reliability

Reply via email to