Also something we discussed off-line: I think the scope of it is quite "huge" - but there are small and incremental improvements, that might not even require AIP that can be implemented as PRs., I think it's great to keep "big hairy vision" in head (like I did several years ago when I proposed a "small" improvement in our dependency management that took about 4 years to get to the stage I thought it would take a few weeks.
Getting incremental improvements and showing the dedication, merit and consistent pattern of improvements is a key to get - eventually - big and "world-changing" changes. J. On Fri, Nov 14, 2025 at 10:32 PM Jens Scheffler <[email protected]> wrote: > Hi Stefan, > > thanks for dropping the proposals! > > I'd propose to store the documents in cWiki and open them formally in > there as AIP proposal as then it is sollowing the AIP process. > > See > > https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals > > Jens > > On 11/14/25 12:35, Stefan Wang wrote: > > Hi Airflow Community, > > > > I'm excited to share two complementary proposals that address critical > reliability challenges in Airflow, particularly around infrastructure > disruptions and task resilience. These proposals build on insights from > managing one of the larger Airflow deployments (20k+ DAGs, 100k+ daily task > executions per cluster). > > > > Proposals > > > > 1. Infrastructure-Aware Task Execution and Context Propagation > > > > > https://docs.google.com/document/d/1BAOJTAPfWK93JnN6LQrISo8IqDiE7LpnfG2Q42fnn7M > > > > 2. Resumable Operators for Disruption Readiness > > > > > https://docs.google.com/document/d/1XPbCfuTVhyiq12tFxbyQrX_kQqrDqMo5t7M789MG4GI > > > > What We're Solving > > > > Infrastructure failures consume user retries - Pod evictions shouldn't > count against application retry budgets > > Wasted computation - Worker crashes shouldn't restart healthy 3-hour > Databricks jobs from zero > > How > > > > Execution Context: Distinguish infrastructure vs application failures > for smarter retry handling > > Resumable Operators: Checkpoint and reconnect to external jobs after > disruptions (follows deferral pattern) > > These approaches have significantly improved reliability and user > experience, and reduced wasted costs in our production environment. > > > > Looking forward to your feedback on both the problems we're addressing > and the proposed solutions. Both proposals are fully backward compatible > and follow existing Airflow patterns. > > > > Happy to answer any questions or dive deeper into implementation details. > > > > Best, > > > > Stefan Wang > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
