Hi Stefan, Thank you for the work! Very well organised and easy to follow docs.
I have been thinking about infrastructure retries for a while now. Also, I had a few discussions at the Airflow Summit last month and I know that others are interested as well. It looks to me too, that this will be split into multiple PRs but if there is a code POC, I would like to take a look. Regards, Christos On Fri, Nov 14, 2025 at 11:53 PM Jarek Potiuk <[email protected]> wrote: > Also something we discussed off-line: I think the scope of it is quite > "huge" - but there are small and incremental improvements, that might not > even require AIP that can be implemented as PRs., I think it's great to > keep "big hairy vision" in head (like I did several years ago when I > proposed a "small" improvement in our dependency management that took about > 4 years to get to the stage I thought it would take a few weeks. > > Getting incremental improvements and showing the dedication, merit and > consistent pattern of improvements is a key to get - eventually - big and > "world-changing" changes. > > J. > > > On Fri, Nov 14, 2025 at 10:32 PM Jens Scheffler <[email protected]> > wrote: > > > Hi Stefan, > > > > thanks for dropping the proposals! > > > > I'd propose to store the documents in cWiki and open them formally in > > there as AIP proposal as then it is sollowing the AIP process. > > > > See > > > > > https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals > > > > Jens > > > > On 11/14/25 12:35, Stefan Wang wrote: > > > Hi Airflow Community, > > > > > > I'm excited to share two complementary proposals that address critical > > reliability challenges in Airflow, particularly around infrastructure > > disruptions and task resilience. These proposals build on insights from > > managing one of the larger Airflow deployments (20k+ DAGs, 100k+ daily > task > > executions per cluster). > > > > > > Proposals > > > > > > 1. Infrastructure-Aware Task Execution and Context Propagation > > > > > > > > > https://docs.google.com/document/d/1BAOJTAPfWK93JnN6LQrISo8IqDiE7LpnfG2Q42fnn7M > > > > > > 2. Resumable Operators for Disruption Readiness > > > > > > > > > https://docs.google.com/document/d/1XPbCfuTVhyiq12tFxbyQrX_kQqrDqMo5t7M789MG4GI > > > > > > What We're Solving > > > > > > Infrastructure failures consume user retries - Pod evictions shouldn't > > count against application retry budgets > > > Wasted computation - Worker crashes shouldn't restart healthy 3-hour > > Databricks jobs from zero > > > How > > > > > > Execution Context: Distinguish infrastructure vs application failures > > for smarter retry handling > > > Resumable Operators: Checkpoint and reconnect to external jobs after > > disruptions (follows deferral pattern) > > > These approaches have significantly improved reliability and user > > experience, and reduced wasted costs in our production environment. > > > > > > Looking forward to your feedback on both the problems we're addressing > > and the proposed solutions. Both proposals are fully backward compatible > > and follow existing Airflow patterns. > > > > > > Happy to answer any questions or dive deeper into implementation > details. > > > > > > Best, > > > > > > Stefan Wang > > > > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [email protected] > > For additional commands, e-mail: [email protected] > > > > >
