Just to add to the pile of use cases: that mechanism would also be useful for listeners/OpenLineage integration, to store the necessary lineage data post-execution, to be able to send the OpenLineage events asynchronously, rather than running on worker and blocking execution slot.
Thanks, Maciej wt., 2 gru 2025 o 10:45 Jarek Potiuk <[email protected]> napisał(a): > One comment here. I looked yesterday again at your proposals, and they are > really well thought out. > One thing however that I see in it is something of a recurring pattern we > have in many discussions: > > *Storing state in Airflow* > > This has been discussed in a number of discussions in the past (recent and > not-so-recent). I tried to put them together here (in reverse chronological > order): > > * XD's discussion: `Add "persist_xcom_through_retry" Parameter to Airflow > Operators` here > https://lists.apache.org/thread/yqbtw5l8cpjln4sw7m4x73qb9tffysky > * Your proposal here - partially - Infrastructure-Aware Task Execution and > Resumable Operators > * Jake and Guangyang Li - [WIP] AIP-93 Asset Watermarks and State > Variables > https://lists.apache.org/thread/vftpzrwb34xr2xbfsx7qtbxn5w6h3f2b > * Daniels old "State Persistence" AIP -> > > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-30%3A+State+persistence > > Likely more. > > I think it's fairly clear that we need State persistence. And there are > various way people wanted to address it: > > * XD's proposal was to piggyback on Xcoms and add options to not delete > them on resume > * Jake and Guangyang - proposed State Variables that would be bound with > Assets > * Daniel proposed a broader AIP that solves persistence need potentially on > various levels (task, dag variable, etc. ) - with proposal to use separate > ProcessState, TaskState, and TaskInstanceState (solutions 3, 5 and 6). Also > probably now that would extend to AssetState if it is followed > > Maybe it's a good time to join the efforts and propose a single solution > that can help to address all those "state persistence" needs ? > > I think we have now enough concrete use cases - from the above proposals > and probably more, to make a single proposal that will be usable to address > all of the needs. We have a number of smart people who - if they discuss > and work together on a single solution, might likely come to a good > proposal **just** on state persistence that will be usable for all those > cases ? > > If you were to break your proposals Stefan into smaller pieces and > incremental deliverables, I would say - getting this one done is not only > moving your ideas forward, but also it moves many other ideas forward that > could be implemented in parallel as next step after this "foundational" > state persistence is added with some very simple use case to start with. > That would make it perfect approach - band together to make a foundational > feature, so that then you can split off and work on all those other ideas > in parallel. > > We just need someone to volunteer and lead the efforts - and others here to > join and do the work together. > > J. > > > On Tue, Dec 2, 2025 at 9:49 AM Stefan Wang <[email protected]> wrote: > > > Re: https://lists.apache.org/thread/jk1wkt1wh0lm2ovlldnfcpbzr3brxsy1 > > > > Thank you Jarek for the thoughtful guidance — I really appreciate you > > taking the time to guide me through this. Totally agree with your advice > > about starting small and building things incrementally, and I'll keep it > in > > mind throughout this effort. > > > > The proposals aims to address shared reliability challenges that have > been > > seeing across medium to large scale Airflow deployments in the community > > (ref: OpenAI 2025 Airflow Summit Talk < > > https://airflowsummit.org/sessions/2025/airflow-openai/> (Reliability > > Section), LinkedIn (here in this thread), and Apple with Xiaodong's > > thread/AIP < > > https://lists.apache.org/thread/yqbtw5l8cpjln4sw7m4x73qb9tffysky> > > (specifically External Job Tracking and Polling) - I’ll follow up in > there > > as well to collaborate): > > > > Better Context propagation and Infra Retry budget: Help distinguish > > infrastructure failures (pod evictions, worker crashes) from application > > errors for smarter cleanup decisions and protected user retry budgets - > we > > already have access to the SOT context - just need to propagate it better > > in the existing ecosystem (through passing additional optional msg or > > exception handling, or something else) > > > > Resumable Operators (in parallel with Deferrable Operators): Let > operators > > reconnect to healthy external jobs (Databricks, EMR) after worker > > disruptions instead of wastefully restarting > > > > Both are designed to be completely backward compatible, opt-in only, and > > designed with specific leverage on existing well-established Airflow > > features, hooks, and patterns (deferral mechanism, execution context). > > > > Rather than pushing for big changes upfront in one go, throughout this > > effort, things will be broken into small, incremental pieces that each > > provide standalone value. Start with the tiniest possible change (e.g., > > optional execution_context parameter — purely additive). Continue > > contributing in other areas especially reliability-related, to maintain > > consistency and trust. Keep the broader vision in the design proposal, > but > > let the implementation evolve based on community feedback. > > > > I want to make sure this is done in a way that's most beneficial to the > > community. Guidance and support from you and others in the community > > overall will help us a lot in approaching this the right way. Thank you! > > > > Best, > > Stefan > > > > > > > On Dec 2, 2025, at 12:24 AM, Stefan Wang <[email protected]> wrote: > > > > > > Hi Jens, > > > > > > Thank you so much for the help and for being so supportive — it’s > > working for me now! > > > > > > Really appreciate you stepping in. > > > > > > Best, > > > Stefan > > > > > > > > >> On Nov 30, 2025, at 12:27 AM, Jens Scheffler <[email protected]> > > wrote: > > >> > > >> As PMC we are space owners, added your permissions for the user > > stefwang to the Airflow space. Hope now it is working. > > >> > > >> On 11/30/25 04:54, Stefan Wang wrote: > > >>> Apologies for the late response folks while I had oncall shifts. > > Catching up here and will respond to each comment in order. > > >>> > > >>> > > >>> > > >>> — > > >>> > > >>> > > >>> > > >>> Re: https://lists.apache.org/thread/j02owr28cjw7zyyrp938fqt69nbmyxy4 > > from Jens Scheffler > > >>> > > >>> Hi Jens, > > >>> > > >>> Thanks for the suggestion! I completely agree that following the > > formal AIP process is the right approach. > > >>> > > >>> I've been trying to create the AIPs on the Confluence wiki, but I'm > > running into permission issues. When I click the "Create new AIP" button > on > > the AIP page, I get a "Sorry, you don't have permission to create > content" > > error. > > >>> > > >>> I've tried following the exact step listed to create ASF confluence > > account however neither has EDIT access granted under the AIRFLOW space, > > created two accounts (stewang and stefwang) to rule out any > > account-specific issues, but both accounts have the same problem. Would > > really appreciate some expertise in this area to help point me to who we > > should contact to get the appropriate permissions, or is there a specific > > access request process I should follow? - Or if someone else with edit > > access could help copy paste the google doc content into Confluence for > > comments, thanks a lot! > > >>> > > >>> I’ll try to contact ASF infra support in the mean time, and will work > > on migrate the Google Docs to Confluence once I have access. > > >>> > > >>> Thanks, Stefan > > >>> > > >>> > > >>>> On Nov 15, 2025, at 6:27 AM, Christos Bisias <[email protected] > > > > wrote: > > >>>> > > >>>> Hi Stefan, > > >>>> > > >>>> Thank you for the work! Very well organised and easy to follow docs. > > >>>> > > >>>> I have been thinking about infrastructure retries for a while now. > > Also, I > > >>>> had a few discussions at the Airflow Summit last month and I know > that > > >>>> others are interested as well. > > >>>> > > >>>> It looks to me too, that this will be split into multiple PRs but if > > there > > >>>> is a code POC, I would like to take a look. > > >>>> > > >>>> Regards, > > >>>> Christos > > >>>> > > >>>> On Fri, Nov 14, 2025 at 11:53 PM Jarek Potiuk <[email protected]> > > wrote: > > >>>> > > >>>>> Also something we discussed off-line: I think the scope of it is > > quite > > >>>>> "huge" - but there are small and incremental improvements, that > > might not > > >>>>> even require AIP that can be implemented as PRs., I think it's > great > > to > > >>>>> keep "big hairy vision" in head (like I did several years ago when > I > > >>>>> proposed a "small" improvement in our dependency management that > > took about > > >>>>> 4 years to get to the stage I thought it would take a few weeks. > > >>>>> > > >>>>> Getting incremental improvements and showing the dedication, merit > > and > > >>>>> consistent pattern of improvements is a key to get - eventually - > > big and > > >>>>> "world-changing" changes. > > >>>>> > > >>>>> J. > > >>>>> > > >>>>> > > >>>>> On Fri, Nov 14, 2025 at 10:32 PM Jens Scheffler < > [email protected] > > > > > >>>>> wrote: > > >>>>> > > >>>>>> Hi Stefan, > > >>>>>> > > >>>>>> thanks for dropping the proposals! > > >>>>>> > > >>>>>> I'd propose to store the documents in cWiki and open them formally > > in > > >>>>>> there as AIP proposal as then it is sollowing the AIP process. > > >>>>>> > > >>>>>> See > > >>>>>> > > >>>>>> > > >>>>> > > > https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals > > >>>>>> Jens > > >>>>>> > > >>>>>> On 11/14/25 12:35, Stefan Wang wrote: > > >>>>>>> Hi Airflow Community, > > >>>>>>> > > >>>>>>> I'm excited to share two complementary proposals that address > > critical > > >>>>>> reliability challenges in Airflow, particularly around > > infrastructure > > >>>>>> disruptions and task resilience. These proposals build on insights > > from > > >>>>>> managing one of the larger Airflow deployments (20k+ DAGs, 100k+ > > daily > > >>>>> task > > >>>>>> executions per cluster). > > >>>>>>> Proposals > > >>>>>>> > > >>>>>>> 1. Infrastructure-Aware Task Execution and Context Propagation > > >>>>>>> > > >>>>>>> > > >>>>> > > > https://docs.google.com/document/d/1BAOJTAPfWK93JnN6LQrISo8IqDiE7LpnfG2Q42fnn7M > > >>>>>>> 2. Resumable Operators for Disruption Readiness > > >>>>>>> > > >>>>>>> > > >>>>> > > > https://docs.google.com/document/d/1XPbCfuTVhyiq12tFxbyQrX_kQqrDqMo5t7M789MG4GI > > >>>>>>> What We're Solving > > >>>>>>> > > >>>>>>> Infrastructure failures consume user retries - Pod evictions > > shouldn't > > >>>>>> count against application retry budgets > > >>>>>>> Wasted computation - Worker crashes shouldn't restart healthy > > 3-hour > > >>>>>> Databricks jobs from zero > > >>>>>>> How > > >>>>>>> > > >>>>>>> Execution Context: Distinguish infrastructure vs application > > failures > > >>>>>> for smarter retry handling > > >>>>>>> Resumable Operators: Checkpoint and reconnect to external jobs > > after > > >>>>>> disruptions (follows deferral pattern) > > >>>>>>> These approaches have significantly improved reliability and user > > >>>>>> experience, and reduced wasted costs in our production > environment. > > >>>>>>> Looking forward to your feedback on both the problems we're > > addressing > > >>>>>> and the proposed solutions. Both proposals are fully backward > > compatible > > >>>>>> and follow existing Airflow patterns. > > >>>>>>> Happy to answer any questions or dive deeper into implementation > > >>>>> details. > > >>>>>>> Best, > > >>>>>>> > > >>>>>>> Stefan Wang > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>> > > --------------------------------------------------------------------- > > >>>>>> To unsubscribe, e-mail: [email protected] > > >>>>>> For additional commands, e-mail: [email protected] > > >>>>>> > > >>>>>> > > >>> > > >> > > >> --------------------------------------------------------------------- > > >> To unsubscribe, e-mail: [email protected] > > >> For additional commands, e-mail: [email protected] > > >> > > > > > > > >
