Just to add to the pile of use cases:
that mechanism would also be useful for listeners/OpenLineage integration,
to store the necessary lineage data post-execution, to be able to send the
OpenLineage events asynchronously, rather than running on
worker and blocking execution slot.

Thanks,
Maciej

wt., 2 gru 2025 o 10:45 Jarek Potiuk <[email protected]> napisał(a):

> One comment here. I looked yesterday again at your proposals, and they are
> really well thought out.
> One thing however that I see in it is something of a recurring pattern we
> have in many discussions:
>
> *Storing state in Airflow*
>
> This has been discussed in a number of discussions in the past (recent and
> not-so-recent). I tried to put them together here (in reverse chronological
> order):
>
> * XD's discussion: `Add "persist_xcom_through_retry" Parameter to Airflow
> Operators` here
> https://lists.apache.org/thread/yqbtw5l8cpjln4sw7m4x73qb9tffysky
> * Your proposal here - partially -  Infrastructure-Aware Task Execution and
> Resumable Operators
> * Jake and Guangyang Li  - [WIP] AIP-93 Asset Watermarks and State
> Variables
> https://lists.apache.org/thread/vftpzrwb34xr2xbfsx7qtbxn5w6h3f2b
> * Daniels old "State Persistence" AIP ->
>
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-30%3A+State+persistence
>
> Likely more.
>
> I think it's fairly clear that we need State persistence. And there are
> various way people wanted to address it:
>
> * XD's proposal was to piggyback on Xcoms and add options to not delete
> them on resume
> * Jake and Guangyang - proposed State Variables that would be bound with
> Assets
> * Daniel proposed a broader AIP that solves persistence need potentially on
> various levels (task, dag variable, etc. ) - with proposal to use separate
> ProcessState, TaskState, and TaskInstanceState (solutions 3, 5 and 6). Also
> probably now that would extend to AssetState if it is followed
>
> Maybe it's a good time to join the efforts and propose a single solution
> that can help to address all those "state persistence" needs ?
>
> I think we have now enough concrete use cases - from the above proposals
> and probably more, to make a single proposal that will be usable to address
> all of the needs. We have a number of smart people who - if they discuss
> and work together on a single solution, might likely come to a good
> proposal **just** on state persistence that will be usable for all those
> cases ?
>
> If you were to break your proposals Stefan into smaller pieces and
> incremental deliverables, I would say - getting this one done is not only
> moving your ideas forward, but also it moves many other ideas forward that
> could be implemented in parallel as next step after this "foundational"
> state persistence is added with some very simple use case to start with.
> That would make it perfect approach - band together to make a foundational
> feature, so that then you can split off and work on all those other ideas
> in parallel.
>
> We just need someone to volunteer and lead the efforts - and others here to
> join and do the work together.
>
> J.
>
>
> On Tue, Dec 2, 2025 at 9:49 AM Stefan Wang <[email protected]> wrote:
>
> > Re: https://lists.apache.org/thread/jk1wkt1wh0lm2ovlldnfcpbzr3brxsy1
> >
> > Thank you Jarek for the thoughtful guidance — I really appreciate you
> > taking the time to guide me through this. Totally agree with your advice
> > about starting small and building things incrementally, and I'll keep it
> in
> > mind throughout this effort.
> >
> > The proposals aims to address shared reliability challenges that have
> been
> > seeing across medium to large scale Airflow deployments in the community
> > (ref: OpenAI 2025 Airflow Summit Talk <
> > https://airflowsummit.org/sessions/2025/airflow-openai/> (Reliability
> > Section), LinkedIn (here in this thread), and Apple with Xiaodong's
> > thread/AIP <
> > https://lists.apache.org/thread/yqbtw5l8cpjln4sw7m4x73qb9tffysky>
> > (specifically External Job Tracking and Polling) - I’ll follow up in
> there
> > as well to collaborate):
> >
> > Better Context propagation and Infra Retry budget: Help distinguish
> > infrastructure failures (pod evictions, worker crashes) from application
> > errors for smarter cleanup decisions and protected user retry budgets -
> we
> > already have access to the SOT context - just need to propagate it better
> > in the existing ecosystem (through passing additional optional msg or
> > exception handling, or something else)
> >
> > Resumable Operators (in parallel with Deferrable Operators): Let
> operators
> > reconnect to healthy external jobs (Databricks, EMR) after worker
> > disruptions instead of wastefully restarting
> >
> > Both are designed to be completely backward compatible, opt-in only, and
> > designed with specific leverage on existing well-established Airflow
> > features, hooks, and patterns (deferral mechanism, execution context).
> >
> > Rather than pushing for big changes upfront in one go, throughout this
> > effort, things will be broken into small, incremental pieces that each
> > provide standalone value. Start with the tiniest possible change (e.g.,
> > optional execution_context parameter — purely additive). Continue
> > contributing in other areas especially reliability-related, to maintain
> > consistency and trust. Keep the broader vision in the design proposal,
> but
> > let the implementation evolve based on community feedback.
> >
> > I want to make sure this is done in a way that's most beneficial to the
> > community. Guidance and support from you and others in the community
> > overall will help us a lot in approaching this the right way. Thank you!
> >
> > Best,
> > Stefan
> >
> >
> > > On Dec 2, 2025, at 12:24 AM, Stefan Wang <[email protected]> wrote:
> > >
> > > Hi Jens,
> > >
> > > Thank you so much for the help and for being so supportive — it’s
> > working for me now!
> > >
> > > Really appreciate you stepping in.
> > >
> > > Best,
> > > Stefan
> > >
> > >
> > >> On Nov 30, 2025, at 12:27 AM, Jens Scheffler <[email protected]>
> > wrote:
> > >>
> > >> As PMC we are space owners, added your permissions for the user
> > stefwang to the Airflow space. Hope now it is working.
> > >>
> > >> On 11/30/25 04:54, Stefan Wang wrote:
> > >>> Apologies for the late response folks while I had oncall shifts.
> > Catching up here and will respond to each comment in order.
> > >>>
> > >>>
> > >>>
> > >>> —
> > >>>
> > >>>
> > >>>
> > >>> Re: https://lists.apache.org/thread/j02owr28cjw7zyyrp938fqt69nbmyxy4
> > from Jens Scheffler
> > >>>
> > >>> Hi Jens,
> > >>>
> > >>> Thanks for the suggestion! I completely agree that following the
> > formal AIP process is the right approach.
> > >>>
> > >>> I've been trying to create the AIPs on the Confluence wiki, but I'm
> > running into permission issues. When I click the "Create new AIP" button
> on
> > the AIP page, I get a "Sorry, you don't have permission to create
> content"
> > error.
> > >>>
> > >>> I've tried following the exact step listed to create ASF confluence
> > account however neither has EDIT access granted under the AIRFLOW space,
> > created two accounts (stewang and stefwang) to rule out any
> > account-specific issues, but both accounts have the same problem. Would
> > really appreciate some expertise in this area to help point me to who we
> > should contact to get the appropriate permissions, or is there a specific
> > access request process I should follow? - Or if someone else with edit
> > access could help copy paste the google doc content into Confluence for
> > comments, thanks a lot!
> > >>>
> > >>> I’ll try to contact ASF infra support in the mean time, and will work
> > on migrate the Google Docs to Confluence once I have access.
> > >>>
> > >>> Thanks, Stefan
> > >>>
> > >>>
> > >>>> On Nov 15, 2025, at 6:27 AM, Christos Bisias <[email protected]
> >
> > wrote:
> > >>>>
> > >>>> Hi Stefan,
> > >>>>
> > >>>> Thank you for the work! Very well organised and easy to follow docs.
> > >>>>
> > >>>> I have been thinking about infrastructure retries for a while now.
> > Also, I
> > >>>> had a few discussions at the Airflow Summit last month and I know
> that
> > >>>> others are interested as well.
> > >>>>
> > >>>> It looks to me too, that this will be split into multiple PRs but if
> > there
> > >>>> is a code POC, I would like to take a look.
> > >>>>
> > >>>> Regards,
> > >>>> Christos
> > >>>>
> > >>>> On Fri, Nov 14, 2025 at 11:53 PM Jarek Potiuk <[email protected]>
> > wrote:
> > >>>>
> > >>>>> Also something we discussed off-line: I think the scope of it is
> > quite
> > >>>>> "huge" - but there are small and incremental improvements, that
> > might not
> > >>>>> even require AIP that can be implemented as PRs., I think it's
> great
> > to
> > >>>>> keep "big hairy vision" in head (like I did several years ago when
> I
> > >>>>> proposed a "small" improvement in our dependency management that
> > took about
> > >>>>> 4 years to get to the stage I thought it would take a few weeks.
> > >>>>>
> > >>>>> Getting incremental improvements and showing the dedication, merit
> > and
> > >>>>> consistent pattern of improvements is a key to get - eventually -
> > big and
> > >>>>> "world-changing" changes.
> > >>>>>
> > >>>>> J.
> > >>>>>
> > >>>>>
> > >>>>> On Fri, Nov 14, 2025 at 10:32 PM Jens Scheffler <
> [email protected]
> > >
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Hi Stefan,
> > >>>>>>
> > >>>>>> thanks for dropping the proposals!
> > >>>>>>
> > >>>>>> I'd propose to store the documents in cWiki and open them formally
> > in
> > >>>>>> there as AIP proposal as then it is sollowing the AIP process.
> > >>>>>>
> > >>>>>> See
> > >>>>>>
> > >>>>>>
> > >>>>>
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals
> > >>>>>> Jens
> > >>>>>>
> > >>>>>> On 11/14/25 12:35, Stefan Wang wrote:
> > >>>>>>> Hi Airflow Community,
> > >>>>>>>
> > >>>>>>> I'm excited to share two complementary proposals that address
> > critical
> > >>>>>> reliability challenges in Airflow, particularly around
> > infrastructure
> > >>>>>> disruptions and task resilience. These proposals build on insights
> > from
> > >>>>>> managing one of the larger Airflow deployments (20k+ DAGs, 100k+
> > daily
> > >>>>> task
> > >>>>>> executions per cluster).
> > >>>>>>> Proposals
> > >>>>>>>
> > >>>>>>> 1. Infrastructure-Aware Task Execution and Context Propagation
> > >>>>>>>
> > >>>>>>>
> > >>>>>
> >
> https://docs.google.com/document/d/1BAOJTAPfWK93JnN6LQrISo8IqDiE7LpnfG2Q42fnn7M
> > >>>>>>> 2. Resumable Operators for Disruption Readiness
> > >>>>>>>
> > >>>>>>>
> > >>>>>
> >
> https://docs.google.com/document/d/1XPbCfuTVhyiq12tFxbyQrX_kQqrDqMo5t7M789MG4GI
> > >>>>>>> What We're Solving
> > >>>>>>>
> > >>>>>>> Infrastructure failures consume user retries - Pod evictions
> > shouldn't
> > >>>>>> count against application retry budgets
> > >>>>>>> Wasted computation - Worker crashes shouldn't restart healthy
> > 3-hour
> > >>>>>> Databricks jobs from zero
> > >>>>>>> How
> > >>>>>>>
> > >>>>>>> Execution Context: Distinguish infrastructure vs application
> > failures
> > >>>>>> for smarter retry handling
> > >>>>>>> Resumable Operators: Checkpoint and reconnect to external jobs
> > after
> > >>>>>> disruptions (follows deferral pattern)
> > >>>>>>> These approaches have significantly improved reliability and user
> > >>>>>> experience, and reduced wasted costs in our production
> environment.
> > >>>>>>> Looking forward to your feedback on both the problems we're
> > addressing
> > >>>>>> and the proposed solutions. Both proposals are fully backward
> > compatible
> > >>>>>> and follow existing Airflow patterns.
> > >>>>>>> Happy to answer any questions or dive deeper into implementation
> > >>>>> details.
> > >>>>>>> Best,
> > >>>>>>>
> > >>>>>>> Stefan Wang
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>
> > ---------------------------------------------------------------------
> > >>>>>> To unsubscribe, e-mail: [email protected]
> > >>>>>> For additional commands, e-mail: [email protected]
> > >>>>>>
> > >>>>>>
> > >>>
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: [email protected]
> > >> For additional commands, e-mail: [email protected]
> > >>
> > >
> >
> >
>

Reply via email to