Re: [DISCUSS] Infrastructure-Aware Task Execution and Resumable Operators - Proposals for Reliability

Jarek Potiuk Tue, 02 Dec 2025 01:46:59 -0800

One comment here. I looked yesterday again at your proposals, and they are
really well thought out.
One thing however that I see in it is something of a recurring pattern we
have in many discussions:


*Storing state in Airflow*

This has been discussed in a number of discussions in the past (recent and
not-so-recent). I tried to put them together here (in reverse chronological
order):

* XD's discussion: `Add "persist_xcom_through_retry" Parameter to Airflow
Operators` here
https://lists.apache.org/thread/yqbtw5l8cpjln4sw7m4x73qb9tffysky
* Your proposal here - partially -  Infrastructure-Aware Task Execution and
Resumable Operators
* Jake and Guangyang Li  - [WIP] AIP-93 Asset Watermarks and State
Variables  https://lists.apache.org/thread/vftpzrwb34xr2xbfsx7qtbxn5w6h3f2b
* Daniels old "State Persistence" AIP ->
https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-30%3A+State+persistence

Likely more.

I think it's fairly clear that we need State persistence. And there are
various way people wanted to address it:

* XD's proposal was to piggyback on Xcoms and add options to not delete
them on resume
* Jake and Guangyang - proposed State Variables that would be bound with
Assets
* Daniel proposed a broader AIP that solves persistence need potentially on
various levels (task, dag variable, etc. ) - with proposal to use separate
ProcessState, TaskState, and TaskInstanceState (solutions 3, 5 and 6). Also
probably now that would extend to AssetState if it is followed

Maybe it's a good time to join the efforts and propose a single solution
that can help to address all those "state persistence" needs ?

I think we have now enough concrete use cases - from the above proposals
and probably more, to make a single proposal that will be usable to address
all of the needs. We have a number of smart people who - if they discuss
and work together on a single solution, might likely come to a good
proposal **just** on state persistence that will be usable for all those
cases ?

If you were to break your proposals Stefan into smaller pieces and
incremental deliverables, I would say - getting this one done is not only
moving your ideas forward, but also it moves many other ideas forward that
could be implemented in parallel as next step after this "foundational"
state persistence is added with some very simple use case to start with.
That would make it perfect approach - band together to make a foundational
feature, so that then you can split off and work on all those other ideas
in parallel.

We just need someone to volunteer and lead the efforts - and others here to
join and do the work together.

J.


On Tue, Dec 2, 2025 at 9:49 AM Stefan Wang <[email protected]> wrote:

> Re: https://lists.apache.org/thread/jk1wkt1wh0lm2ovlldnfcpbzr3brxsy1
>
> Thank you Jarek for the thoughtful guidance — I really appreciate you
> taking the time to guide me through this. Totally agree with your advice
> about starting small and building things incrementally, and I'll keep it in
> mind throughout this effort.
>
> The proposals aims to address shared reliability challenges that have been
> seeing across medium to large scale Airflow deployments in the community
> (ref: OpenAI 2025 Airflow Summit Talk <
> https://airflowsummit.org/sessions/2025/airflow-openai/> (Reliability
> Section), LinkedIn (here in this thread), and Apple with Xiaodong's
> thread/AIP <
> https://lists.apache.org/thread/yqbtw5l8cpjln4sw7m4x73qb9tffysky>
> (specifically External Job Tracking and Polling) - I’ll follow up in there
> as well to collaborate):
>
> Better Context propagation and Infra Retry budget: Help distinguish
> infrastructure failures (pod evictions, worker crashes) from application
> errors for smarter cleanup decisions and protected user retry budgets - we
> already have access to the SOT context - just need to propagate it better
> in the existing ecosystem (through passing additional optional msg or
> exception handling, or something else)
>
> Resumable Operators (in parallel with Deferrable Operators): Let operators
> reconnect to healthy external jobs (Databricks, EMR) after worker
> disruptions instead of wastefully restarting
>
> Both are designed to be completely backward compatible, opt-in only, and
> designed with specific leverage on existing well-established Airflow
> features, hooks, and patterns (deferral mechanism, execution context).
>
> Rather than pushing for big changes upfront in one go, throughout this
> effort, things will be broken into small, incremental pieces that each
> provide standalone value. Start with the tiniest possible change (e.g.,
> optional execution_context parameter — purely additive). Continue
> contributing in other areas especially reliability-related, to maintain
> consistency and trust. Keep the broader vision in the design proposal, but
> let the implementation evolve based on community feedback.
>
> I want to make sure this is done in a way that's most beneficial to the
> community. Guidance and support from you and others in the community
> overall will help us a lot in approaching this the right way. Thank you!
>
> Best,
> Stefan
>
>
> > On Dec 2, 2025, at 12:24 AM, Stefan Wang <[email protected]> wrote:
> >
> > Hi Jens,
> >
> > Thank you so much for the help and for being so supportive — it’s
> working for me now!
> >
> > Really appreciate you stepping in.
> >
> > Best,
> > Stefan
> >
> >
> >> On Nov 30, 2025, at 12:27 AM, Jens Scheffler <[email protected]>
> wrote:
> >>
> >> As PMC we are space owners, added your permissions for the user
> stefwang to the Airflow space. Hope now it is working.
> >>
> >> On 11/30/25 04:54, Stefan Wang wrote:
> >>> Apologies for the late response folks while I had oncall shifts.
> Catching up here and will respond to each comment in order.
> >>>
> >>>
> >>>
> >>> —
> >>>
> >>>
> >>>
> >>> Re: https://lists.apache.org/thread/j02owr28cjw7zyyrp938fqt69nbmyxy4
> from Jens Scheffler
> >>>
> >>> Hi Jens,
> >>>
> >>> Thanks for the suggestion! I completely agree that following the
> formal AIP process is the right approach.
> >>>
> >>> I've been trying to create the AIPs on the Confluence wiki, but I'm
> running into permission issues. When I click the "Create new AIP" button on
> the AIP page, I get a "Sorry, you don't have permission to create content"
> error.
> >>>
> >>> I've tried following the exact step listed to create ASF confluence
> account however neither has EDIT access granted under the AIRFLOW space,
> created two accounts (stewang and stefwang) to rule out any
> account-specific issues, but both accounts have the same problem. Would
> really appreciate some expertise in this area to help point me to who we
> should contact to get the appropriate permissions, or is there a specific
> access request process I should follow? - Or if someone else with edit
> access could help copy paste the google doc content into Confluence for
> comments, thanks a lot!
> >>>
> >>> I’ll try to contact ASF infra support in the mean time, and will work
> on migrate the Google Docs to Confluence once I have access.
> >>>
> >>> Thanks, Stefan
> >>>
> >>>
> >>>> On Nov 15, 2025, at 6:27 AM, Christos Bisias <[email protected]>
> wrote:
> >>>>
> >>>> Hi Stefan,
> >>>>
> >>>> Thank you for the work! Very well organised and easy to follow docs.
> >>>>
> >>>> I have been thinking about infrastructure retries for a while now.
> Also, I
> >>>> had a few discussions at the Airflow Summit last month and I know that
> >>>> others are interested as well.
> >>>>
> >>>> It looks to me too, that this will be split into multiple PRs but if
> there
> >>>> is a code POC, I would like to take a look.
> >>>>
> >>>> Regards,
> >>>> Christos
> >>>>
> >>>> On Fri, Nov 14, 2025 at 11:53 PM Jarek Potiuk <[email protected]>
> wrote:
> >>>>
> >>>>> Also something we discussed off-line: I think the scope of it is
> quite
> >>>>> "huge" - but there are small and incremental improvements, that
> might not
> >>>>> even require AIP that can be implemented as PRs., I think it's great
> to
> >>>>> keep "big hairy vision" in head (like I did several years ago when I
> >>>>> proposed a "small" improvement in our dependency management that
> took about
> >>>>> 4 years to get to the stage I thought it would take a few weeks.
> >>>>>
> >>>>> Getting incremental improvements and showing the dedication, merit
> and
> >>>>> consistent pattern of improvements is a key to get - eventually -
> big and
> >>>>> "world-changing" changes.
> >>>>>
> >>>>> J.
> >>>>>
> >>>>>
> >>>>> On Fri, Nov 14, 2025 at 10:32 PM Jens Scheffler <[email protected]
> >
> >>>>> wrote:
> >>>>>
> >>>>>> Hi Stefan,
> >>>>>>
> >>>>>> thanks for dropping the proposals!
> >>>>>>
> >>>>>> I'd propose to store the documents in cWiki and open them formally
> in
> >>>>>> there as AIP proposal as then it is sollowing the AIP process.
> >>>>>>
> >>>>>> See
> >>>>>>
> >>>>>>
> >>>>>
> https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals
> >>>>>> Jens
> >>>>>>
> >>>>>> On 11/14/25 12:35, Stefan Wang wrote:
> >>>>>>> Hi Airflow Community,
> >>>>>>>
> >>>>>>> I'm excited to share two complementary proposals that address
> critical
> >>>>>> reliability challenges in Airflow, particularly around
> infrastructure
> >>>>>> disruptions and task resilience. These proposals build on insights
> from
> >>>>>> managing one of the larger Airflow deployments (20k+ DAGs, 100k+
> daily
> >>>>> task
> >>>>>> executions per cluster).
> >>>>>>> Proposals
> >>>>>>>
> >>>>>>> 1. Infrastructure-Aware Task Execution and Context Propagation
> >>>>>>>
> >>>>>>>
> >>>>>
> https://docs.google.com/document/d/1BAOJTAPfWK93JnN6LQrISo8IqDiE7LpnfG2Q42fnn7M
> >>>>>>> 2. Resumable Operators for Disruption Readiness
> >>>>>>>
> >>>>>>>
> >>>>>
> https://docs.google.com/document/d/1XPbCfuTVhyiq12tFxbyQrX_kQqrDqMo5t7M789MG4GI
> >>>>>>> What We're Solving
> >>>>>>>
> >>>>>>> Infrastructure failures consume user retries - Pod evictions
> shouldn't
> >>>>>> count against application retry budgets
> >>>>>>> Wasted computation - Worker crashes shouldn't restart healthy
> 3-hour
> >>>>>> Databricks jobs from zero
> >>>>>>> How
> >>>>>>>
> >>>>>>> Execution Context: Distinguish infrastructure vs application
> failures
> >>>>>> for smarter retry handling
> >>>>>>> Resumable Operators: Checkpoint and reconnect to external jobs
> after
> >>>>>> disruptions (follows deferral pattern)
> >>>>>>> These approaches have significantly improved reliability and user
> >>>>>> experience, and reduced wasted costs in our production environment.
> >>>>>>> Looking forward to your feedback on both the problems we're
> addressing
> >>>>>> and the proposed solutions. Both proposals are fully backward
> compatible
> >>>>>> and follow existing Airflow patterns.
> >>>>>>> Happy to answer any questions or dive deeper into implementation
> >>>>> details.
> >>>>>>> Best,
> >>>>>>>
> >>>>>>> Stefan Wang
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: [email protected]
> >>>>>> For additional commands, e-mail: [email protected]
> >>>>>>
> >>>>>>
> >>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >
>
>

Re: [DISCUSS] Infrastructure-Aware Task Execution and Resumable Operators - Proposals for Reliability

Reply via email to