Hi Jarek, Maciej, and everyone, Thank you all for the incredibly thoughtful analysis! After diving deep into the source code and carefully comparing the different proposals, I want to understand a bit more about what problems we are solving, and share my learnings on this topic. (btw I also wanted to give a quick acknowledge that the ideas in my proposals incorporated valuable insights from offline conversations with Ping Zhang and Howie Wang from OpenAI's Airflow team)
Problem Space: After reviewing the current airflow src code and proposals in detail, I realize these aren't exactly solving the same problem. Let me break down what I see: Category 1: Scoped Intra-Task-Instance State (Retries within the SAME task instance) XD's AIP draft <https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=399278333>: Preserve external job (ID) across retry to avoid duplicate submissions (via XCOM) Resumable Operators <https://docs.google.com/document/d/1XPbCfuTVhyiq12tFxbyQrX_kQqrDqMo5t7M789MG4GI>: Preserve external job (ID) during infra disruptions (with context-aware cleanup) via TaskInstance’s next_method and next_method_kwargs (and introduce a RESUMING state to leverage the same “re-launch” logic from Deferrable - this is an existing persistence mechanism) Category 2: Cross-Run State (DIFFERENT DAG runs) Asset Watermarks: Triggers need "last processed" state across multiple DAG runs (e.g., "last S3 file timestamp") This to me looks different—it's not about retries within a task instance, but incremental processing across separate executions. (pls correct me if this is wrong) Category 3: Post-Execution Async State OpenLineage use case: Store lineage data for async emission after task completion (pls correct me if this is wrong) So rather than claiming to drive "all state persistence”, would it be better for me to volunteer to drive specifically - Intra-Task-Instance State Persistence (Retry Scope) story, potentially jointly with XD And looks like Jake and Guangyang are already driving the cross-run state (asset watermarks) story, which I agree is the right separation of concerns. Very open to feedback here and would love to hear thoughts on this. Thanks a lot folks! Best, Stefan > On Dec 2, 2025, at 4:40 AM, Jarek Potiuk <[email protected]> wrote: > > So looks like we have MORE people who would like to join the efforts :D > > On Tue, Dec 2, 2025 at 1:35 PM Maciej Obuchowski <[email protected]> > wrote: > >> Just to add to the pile of use cases: >> that mechanism would also be useful for listeners/OpenLineage integration, >> to store the necessary lineage data post-execution, to be able to send the >> OpenLineage events asynchronously, rather than running on >> worker and blocking execution slot. >> >> Thanks, >> Maciej >> >> wt., 2 gru 2025 o 10:45 Jarek Potiuk <[email protected]> napisał(a): >> >>> One comment here. I looked yesterday again at your proposals, and they >> are >>> really well thought out. >>> One thing however that I see in it is something of a recurring pattern we >>> have in many discussions: >>> >>> *Storing state in Airflow* >>> >>> This has been discussed in a number of discussions in the past (recent >> and >>> not-so-recent). I tried to put them together here (in reverse >> chronological >>> order): >>> >>> * XD's discussion: `Add "persist_xcom_through_retry" Parameter to Airflow >>> Operators` here >>> https://lists.apache.org/thread/yqbtw5l8cpjln4sw7m4x73qb9tffysky >>> * Your proposal here - partially - Infrastructure-Aware Task Execution >> and >>> Resumable Operators >>> * Jake and Guangyang Li - [WIP] AIP-93 Asset Watermarks and State >>> Variables >>> https://lists.apache.org/thread/vftpzrwb34xr2xbfsx7qtbxn5w6h3f2b >>> * Daniels old "State Persistence" AIP -> >>> >>> >> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-30%3A+State+persistence >>> >>> Likely more. >>> >>> I think it's fairly clear that we need State persistence. And there are >>> various way people wanted to address it: >>> >>> * XD's proposal was to piggyback on Xcoms and add options to not delete >>> them on resume >>> * Jake and Guangyang - proposed State Variables that would be bound with >>> Assets >>> * Daniel proposed a broader AIP that solves persistence need potentially >> on >>> various levels (task, dag variable, etc. ) - with proposal to use >> separate >>> ProcessState, TaskState, and TaskInstanceState (solutions 3, 5 and 6). >> Also >>> probably now that would extend to AssetState if it is followed >>> >>> Maybe it's a good time to join the efforts and propose a single solution >>> that can help to address all those "state persistence" needs ? >>> >>> I think we have now enough concrete use cases - from the above proposals >>> and probably more, to make a single proposal that will be usable to >> address >>> all of the needs. We have a number of smart people who - if they discuss >>> and work together on a single solution, might likely come to a good >>> proposal **just** on state persistence that will be usable for all those >>> cases ? >>> >>> If you were to break your proposals Stefan into smaller pieces and >>> incremental deliverables, I would say - getting this one done is not only >>> moving your ideas forward, but also it moves many other ideas forward >> that >>> could be implemented in parallel as next step after this "foundational" >>> state persistence is added with some very simple use case to start with. >>> That would make it perfect approach - band together to make a >> foundational >>> feature, so that then you can split off and work on all those other ideas >>> in parallel. >>> >>> We just need someone to volunteer and lead the efforts - and others here >> to >>> join and do the work together. >>> >>> J. >>> >>> >>> On Tue, Dec 2, 2025 at 9:49 AM Stefan Wang <[email protected]> wrote: >>> >>>> Re: https://lists.apache.org/thread/jk1wkt1wh0lm2ovlldnfcpbzr3brxsy1 >>>> >>>> Thank you Jarek for the thoughtful guidance — I really appreciate you >>>> taking the time to guide me through this. Totally agree with your >> advice >>>> about starting small and building things incrementally, and I'll keep >> it >>> in >>>> mind throughout this effort. >>>> >>>> The proposals aims to address shared reliability challenges that have >>> been >>>> seeing across medium to large scale Airflow deployments in the >> community >>>> (ref: OpenAI 2025 Airflow Summit Talk < >>>> https://airflowsummit.org/sessions/2025/airflow-openai/> (Reliability >>>> Section), LinkedIn (here in this thread), and Apple with Xiaodong's >>>> thread/AIP < >>>> https://lists.apache.org/thread/yqbtw5l8cpjln4sw7m4x73qb9tffysky> >>>> (specifically External Job Tracking and Polling) - I’ll follow up in >>> there >>>> as well to collaborate): >>>> >>>> Better Context propagation and Infra Retry budget: Help distinguish >>>> infrastructure failures (pod evictions, worker crashes) from >> application >>>> errors for smarter cleanup decisions and protected user retry budgets - >>> we >>>> already have access to the SOT context - just need to propagate it >> better >>>> in the existing ecosystem (through passing additional optional msg or >>>> exception handling, or something else) >>>> >>>> Resumable Operators (in parallel with Deferrable Operators): Let >>> operators >>>> reconnect to healthy external jobs (Databricks, EMR) after worker >>>> disruptions instead of wastefully restarting >>>> >>>> Both are designed to be completely backward compatible, opt-in only, >> and >>>> designed with specific leverage on existing well-established Airflow >>>> features, hooks, and patterns (deferral mechanism, execution context). >>>> >>>> Rather than pushing for big changes upfront in one go, throughout this >>>> effort, things will be broken into small, incremental pieces that each >>>> provide standalone value. Start with the tiniest possible change (e.g., >>>> optional execution_context parameter — purely additive). Continue >>>> contributing in other areas especially reliability-related, to maintain >>>> consistency and trust. Keep the broader vision in the design proposal, >>> but >>>> let the implementation evolve based on community feedback. >>>> >>>> I want to make sure this is done in a way that's most beneficial to the >>>> community. Guidance and support from you and others in the community >>>> overall will help us a lot in approaching this the right way. Thank >> you! >>>> >>>> Best, >>>> Stefan >>>> >>>> >>>>> On Dec 2, 2025, at 12:24 AM, Stefan Wang <[email protected]> wrote: >>>>> >>>>> Hi Jens, >>>>> >>>>> Thank you so much for the help and for being so supportive — it’s >>>> working for me now! >>>>> >>>>> Really appreciate you stepping in. >>>>> >>>>> Best, >>>>> Stefan >>>>> >>>>> >>>>>> On Nov 30, 2025, at 12:27 AM, Jens Scheffler <[email protected]> >>>> wrote: >>>>>> >>>>>> As PMC we are space owners, added your permissions for the user >>>> stefwang to the Airflow space. Hope now it is working. >>>>>> >>>>>> On 11/30/25 04:54, Stefan Wang wrote: >>>>>>> Apologies for the late response folks while I had oncall shifts. >>>> Catching up here and will respond to each comment in order. >>>>>>> >>>>>>> >>>>>>> >>>>>>> — >>>>>>> >>>>>>> >>>>>>> >>>>>>> Re: >> https://lists.apache.org/thread/j02owr28cjw7zyyrp938fqt69nbmyxy4 >>>> from Jens Scheffler >>>>>>> >>>>>>> Hi Jens, >>>>>>> >>>>>>> Thanks for the suggestion! I completely agree that following the >>>> formal AIP process is the right approach. >>>>>>> >>>>>>> I've been trying to create the AIPs on the Confluence wiki, but I'm >>>> running into permission issues. When I click the "Create new AIP" >> button >>> on >>>> the AIP page, I get a "Sorry, you don't have permission to create >>> content" >>>> error. >>>>>>> >>>>>>> I've tried following the exact step listed to create ASF confluence >>>> account however neither has EDIT access granted under the AIRFLOW >> space, >>>> created two accounts (stewang and stefwang) to rule out any >>>> account-specific issues, but both accounts have the same problem. Would >>>> really appreciate some expertise in this area to help point me to who >> we >>>> should contact to get the appropriate permissions, or is there a >> specific >>>> access request process I should follow? - Or if someone else with edit >>>> access could help copy paste the google doc content into Confluence for >>>> comments, thanks a lot! >>>>>>> >>>>>>> I’ll try to contact ASF infra support in the mean time, and will >> work >>>> on migrate the Google Docs to Confluence once I have access. >>>>>>> >>>>>>> Thanks, Stefan >>>>>>> >>>>>>> >>>>>>>> On Nov 15, 2025, at 6:27 AM, Christos Bisias < >> [email protected] >>>> >>>> wrote: >>>>>>>> >>>>>>>> Hi Stefan, >>>>>>>> >>>>>>>> Thank you for the work! Very well organised and easy to follow >> docs. >>>>>>>> >>>>>>>> I have been thinking about infrastructure retries for a while now. >>>> Also, I >>>>>>>> had a few discussions at the Airflow Summit last month and I know >>> that >>>>>>>> others are interested as well. >>>>>>>> >>>>>>>> It looks to me too, that this will be split into multiple PRs but >> if >>>> there >>>>>>>> is a code POC, I would like to take a look. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Christos >>>>>>>> >>>>>>>> On Fri, Nov 14, 2025 at 11:53 PM Jarek Potiuk <[email protected]> >>>> wrote: >>>>>>>> >>>>>>>>> Also something we discussed off-line: I think the scope of it is >>>> quite >>>>>>>>> "huge" - but there are small and incremental improvements, that >>>> might not >>>>>>>>> even require AIP that can be implemented as PRs., I think it's >>> great >>>> to >>>>>>>>> keep "big hairy vision" in head (like I did several years ago >> when >>> I >>>>>>>>> proposed a "small" improvement in our dependency management that >>>> took about >>>>>>>>> 4 years to get to the stage I thought it would take a few weeks. >>>>>>>>> >>>>>>>>> Getting incremental improvements and showing the dedication, >> merit >>>> and >>>>>>>>> consistent pattern of improvements is a key to get - eventually - >>>> big and >>>>>>>>> "world-changing" changes. >>>>>>>>> >>>>>>>>> J. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Nov 14, 2025 at 10:32 PM Jens Scheffler < >>> [email protected] >>>>> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi Stefan, >>>>>>>>>> >>>>>>>>>> thanks for dropping the proposals! >>>>>>>>>> >>>>>>>>>> I'd propose to store the documents in cWiki and open them >> formally >>>> in >>>>>>>>>> there as AIP proposal as then it is sollowing the AIP process. >>>>>>>>>> >>>>>>>>>> See >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>> >>> >> https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals >>>>>>>>>> Jens >>>>>>>>>> >>>>>>>>>> On 11/14/25 12:35, Stefan Wang wrote: >>>>>>>>>>> Hi Airflow Community, >>>>>>>>>>> >>>>>>>>>>> I'm excited to share two complementary proposals that address >>>> critical >>>>>>>>>> reliability challenges in Airflow, particularly around >>>> infrastructure >>>>>>>>>> disruptions and task resilience. These proposals build on >> insights >>>> from >>>>>>>>>> managing one of the larger Airflow deployments (20k+ DAGs, 100k+ >>>> daily >>>>>>>>> task >>>>>>>>>> executions per cluster). >>>>>>>>>>> Proposals >>>>>>>>>>> >>>>>>>>>>> 1. Infrastructure-Aware Task Execution and Context Propagation >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>> >>> >> https://docs.google.com/document/d/1BAOJTAPfWK93JnN6LQrISo8IqDiE7LpnfG2Q42fnn7M >>>>>>>>>>> 2. Resumable Operators for Disruption Readiness >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>> >>> >> https://docs.google.com/document/d/1XPbCfuTVhyiq12tFxbyQrX_kQqrDqMo5t7M789MG4GI >>>>>>>>>>> What We're Solving >>>>>>>>>>> >>>>>>>>>>> Infrastructure failures consume user retries - Pod evictions >>>> shouldn't >>>>>>>>>> count against application retry budgets >>>>>>>>>>> Wasted computation - Worker crashes shouldn't restart healthy >>>> 3-hour >>>>>>>>>> Databricks jobs from zero >>>>>>>>>>> How >>>>>>>>>>> >>>>>>>>>>> Execution Context: Distinguish infrastructure vs application >>>> failures >>>>>>>>>> for smarter retry handling >>>>>>>>>>> Resumable Operators: Checkpoint and reconnect to external jobs >>>> after >>>>>>>>>> disruptions (follows deferral pattern) >>>>>>>>>>> These approaches have significantly improved reliability and >> user >>>>>>>>>> experience, and reduced wasted costs in our production >>> environment. >>>>>>>>>>> Looking forward to your feedback on both the problems we're >>>> addressing >>>>>>>>>> and the proposed solutions. Both proposals are fully backward >>>> compatible >>>>>>>>>> and follow existing Airflow patterns. >>>>>>>>>>> Happy to answer any questions or dive deeper into >> implementation >>>>>>>>> details. >>>>>>>>>>> Best, >>>>>>>>>>> >>>>>>>>>>> Stefan Wang >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>> --------------------------------------------------------------------- >>>>>>>>>> To unsubscribe, e-mail: [email protected] >>>>>>>>>> For additional commands, e-mail: [email protected] >>>>>>>>>> >>>>>>>>>> >>>>>>> >>>>>> >>>>>> >> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: [email protected] >>>>>> For additional commands, e-mail: [email protected] >>>>>> >>>>> >>>> >>>> >>> >>
