So looks like we have MORE people who would like to join the efforts :D

On Tue, Dec 2, 2025 at 1:35 PM Maciej Obuchowski <[email protected]>
wrote:

> Just to add to the pile of use cases:
> that mechanism would also be useful for listeners/OpenLineage integration,
> to store the necessary lineage data post-execution, to be able to send the
> OpenLineage events asynchronously, rather than running on
> worker and blocking execution slot.
>
> Thanks,
> Maciej
>
> wt., 2 gru 2025 o 10:45 Jarek Potiuk <[email protected]> napisał(a):
>
> > One comment here. I looked yesterday again at your proposals, and they
> are
> > really well thought out.
> > One thing however that I see in it is something of a recurring pattern we
> > have in many discussions:
> >
> > *Storing state in Airflow*
> >
> > This has been discussed in a number of discussions in the past (recent
> and
> > not-so-recent). I tried to put them together here (in reverse
> chronological
> > order):
> >
> > * XD's discussion: `Add "persist_xcom_through_retry" Parameter to Airflow
> > Operators` here
> > https://lists.apache.org/thread/yqbtw5l8cpjln4sw7m4x73qb9tffysky
> > * Your proposal here - partially -  Infrastructure-Aware Task Execution
> and
> > Resumable Operators
> > * Jake and Guangyang Li  - [WIP] AIP-93 Asset Watermarks and State
> > Variables
> > https://lists.apache.org/thread/vftpzrwb34xr2xbfsx7qtbxn5w6h3f2b
> > * Daniels old "State Persistence" AIP ->
> >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-30%3A+State+persistence
> >
> > Likely more.
> >
> > I think it's fairly clear that we need State persistence. And there are
> > various way people wanted to address it:
> >
> > * XD's proposal was to piggyback on Xcoms and add options to not delete
> > them on resume
> > * Jake and Guangyang - proposed State Variables that would be bound with
> > Assets
> > * Daniel proposed a broader AIP that solves persistence need potentially
> on
> > various levels (task, dag variable, etc. ) - with proposal to use
> separate
> > ProcessState, TaskState, and TaskInstanceState (solutions 3, 5 and 6).
> Also
> > probably now that would extend to AssetState if it is followed
> >
> > Maybe it's a good time to join the efforts and propose a single solution
> > that can help to address all those "state persistence" needs ?
> >
> > I think we have now enough concrete use cases - from the above proposals
> > and probably more, to make a single proposal that will be usable to
> address
> > all of the needs. We have a number of smart people who - if they discuss
> > and work together on a single solution, might likely come to a good
> > proposal **just** on state persistence that will be usable for all those
> > cases ?
> >
> > If you were to break your proposals Stefan into smaller pieces and
> > incremental deliverables, I would say - getting this one done is not only
> > moving your ideas forward, but also it moves many other ideas forward
> that
> > could be implemented in parallel as next step after this "foundational"
> > state persistence is added with some very simple use case to start with.
> > That would make it perfect approach - band together to make a
> foundational
> > feature, so that then you can split off and work on all those other ideas
> > in parallel.
> >
> > We just need someone to volunteer and lead the efforts - and others here
> to
> > join and do the work together.
> >
> > J.
> >
> >
> > On Tue, Dec 2, 2025 at 9:49 AM Stefan Wang <[email protected]> wrote:
> >
> > > Re: https://lists.apache.org/thread/jk1wkt1wh0lm2ovlldnfcpbzr3brxsy1
> > >
> > > Thank you Jarek for the thoughtful guidance — I really appreciate you
> > > taking the time to guide me through this. Totally agree with your
> advice
> > > about starting small and building things incrementally, and I'll keep
> it
> > in
> > > mind throughout this effort.
> > >
> > > The proposals aims to address shared reliability challenges that have
> > been
> > > seeing across medium to large scale Airflow deployments in the
> community
> > > (ref: OpenAI 2025 Airflow Summit Talk <
> > > https://airflowsummit.org/sessions/2025/airflow-openai/> (Reliability
> > > Section), LinkedIn (here in this thread), and Apple with Xiaodong's
> > > thread/AIP <
> > > https://lists.apache.org/thread/yqbtw5l8cpjln4sw7m4x73qb9tffysky>
> > > (specifically External Job Tracking and Polling) - I’ll follow up in
> > there
> > > as well to collaborate):
> > >
> > > Better Context propagation and Infra Retry budget: Help distinguish
> > > infrastructure failures (pod evictions, worker crashes) from
> application
> > > errors for smarter cleanup decisions and protected user retry budgets -
> > we
> > > already have access to the SOT context - just need to propagate it
> better
> > > in the existing ecosystem (through passing additional optional msg or
> > > exception handling, or something else)
> > >
> > > Resumable Operators (in parallel with Deferrable Operators): Let
> > operators
> > > reconnect to healthy external jobs (Databricks, EMR) after worker
> > > disruptions instead of wastefully restarting
> > >
> > > Both are designed to be completely backward compatible, opt-in only,
> and
> > > designed with specific leverage on existing well-established Airflow
> > > features, hooks, and patterns (deferral mechanism, execution context).
> > >
> > > Rather than pushing for big changes upfront in one go, throughout this
> > > effort, things will be broken into small, incremental pieces that each
> > > provide standalone value. Start with the tiniest possible change (e.g.,
> > > optional execution_context parameter — purely additive). Continue
> > > contributing in other areas especially reliability-related, to maintain
> > > consistency and trust. Keep the broader vision in the design proposal,
> > but
> > > let the implementation evolve based on community feedback.
> > >
> > > I want to make sure this is done in a way that's most beneficial to the
> > > community. Guidance and support from you and others in the community
> > > overall will help us a lot in approaching this the right way. Thank
> you!
> > >
> > > Best,
> > > Stefan
> > >
> > >
> > > > On Dec 2, 2025, at 12:24 AM, Stefan Wang <[email protected]> wrote:
> > > >
> > > > Hi Jens,
> > > >
> > > > Thank you so much for the help and for being so supportive — it’s
> > > working for me now!
> > > >
> > > > Really appreciate you stepping in.
> > > >
> > > > Best,
> > > > Stefan
> > > >
> > > >
> > > >> On Nov 30, 2025, at 12:27 AM, Jens Scheffler <[email protected]>
> > > wrote:
> > > >>
> > > >> As PMC we are space owners, added your permissions for the user
> > > stefwang to the Airflow space. Hope now it is working.
> > > >>
> > > >> On 11/30/25 04:54, Stefan Wang wrote:
> > > >>> Apologies for the late response folks while I had oncall shifts.
> > > Catching up here and will respond to each comment in order.
> > > >>>
> > > >>>
> > > >>>
> > > >>> —
> > > >>>
> > > >>>
> > > >>>
> > > >>> Re:
> https://lists.apache.org/thread/j02owr28cjw7zyyrp938fqt69nbmyxy4
> > > from Jens Scheffler
> > > >>>
> > > >>> Hi Jens,
> > > >>>
> > > >>> Thanks for the suggestion! I completely agree that following the
> > > formal AIP process is the right approach.
> > > >>>
> > > >>> I've been trying to create the AIPs on the Confluence wiki, but I'm
> > > running into permission issues. When I click the "Create new AIP"
> button
> > on
> > > the AIP page, I get a "Sorry, you don't have permission to create
> > content"
> > > error.
> > > >>>
> > > >>> I've tried following the exact step listed to create ASF confluence
> > > account however neither has EDIT access granted under the AIRFLOW
> space,
> > > created two accounts (stewang and stefwang) to rule out any
> > > account-specific issues, but both accounts have the same problem. Would
> > > really appreciate some expertise in this area to help point me to who
> we
> > > should contact to get the appropriate permissions, or is there a
> specific
> > > access request process I should follow? - Or if someone else with edit
> > > access could help copy paste the google doc content into Confluence for
> > > comments, thanks a lot!
> > > >>>
> > > >>> I’ll try to contact ASF infra support in the mean time, and will
> work
> > > on migrate the Google Docs to Confluence once I have access.
> > > >>>
> > > >>> Thanks, Stefan
> > > >>>
> > > >>>
> > > >>>> On Nov 15, 2025, at 6:27 AM, Christos Bisias <
> [email protected]
> > >
> > > wrote:
> > > >>>>
> > > >>>> Hi Stefan,
> > > >>>>
> > > >>>> Thank you for the work! Very well organised and easy to follow
> docs.
> > > >>>>
> > > >>>> I have been thinking about infrastructure retries for a while now.
> > > Also, I
> > > >>>> had a few discussions at the Airflow Summit last month and I know
> > that
> > > >>>> others are interested as well.
> > > >>>>
> > > >>>> It looks to me too, that this will be split into multiple PRs but
> if
> > > there
> > > >>>> is a code POC, I would like to take a look.
> > > >>>>
> > > >>>> Regards,
> > > >>>> Christos
> > > >>>>
> > > >>>> On Fri, Nov 14, 2025 at 11:53 PM Jarek Potiuk <[email protected]>
> > > wrote:
> > > >>>>
> > > >>>>> Also something we discussed off-line: I think the scope of it is
> > > quite
> > > >>>>> "huge" - but there are small and incremental improvements, that
> > > might not
> > > >>>>> even require AIP that can be implemented as PRs., I think it's
> > great
> > > to
> > > >>>>> keep "big hairy vision" in head (like I did several years ago
> when
> > I
> > > >>>>> proposed a "small" improvement in our dependency management that
> > > took about
> > > >>>>> 4 years to get to the stage I thought it would take a few weeks.
> > > >>>>>
> > > >>>>> Getting incremental improvements and showing the dedication,
> merit
> > > and
> > > >>>>> consistent pattern of improvements is a key to get - eventually -
> > > big and
> > > >>>>> "world-changing" changes.
> > > >>>>>
> > > >>>>> J.
> > > >>>>>
> > > >>>>>
> > > >>>>> On Fri, Nov 14, 2025 at 10:32 PM Jens Scheffler <
> > [email protected]
> > > >
> > > >>>>> wrote:
> > > >>>>>
> > > >>>>>> Hi Stefan,
> > > >>>>>>
> > > >>>>>> thanks for dropping the proposals!
> > > >>>>>>
> > > >>>>>> I'd propose to store the documents in cWiki and open them
> formally
> > > in
> > > >>>>>> there as AIP proposal as then it is sollowing the AIP process.
> > > >>>>>>
> > > >>>>>> See
> > > >>>>>>
> > > >>>>>>
> > > >>>>>
> > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals
> > > >>>>>> Jens
> > > >>>>>>
> > > >>>>>> On 11/14/25 12:35, Stefan Wang wrote:
> > > >>>>>>> Hi Airflow Community,
> > > >>>>>>>
> > > >>>>>>> I'm excited to share two complementary proposals that address
> > > critical
> > > >>>>>> reliability challenges in Airflow, particularly around
> > > infrastructure
> > > >>>>>> disruptions and task resilience. These proposals build on
> insights
> > > from
> > > >>>>>> managing one of the larger Airflow deployments (20k+ DAGs, 100k+
> > > daily
> > > >>>>> task
> > > >>>>>> executions per cluster).
> > > >>>>>>> Proposals
> > > >>>>>>>
> > > >>>>>>> 1. Infrastructure-Aware Task Execution and Context Propagation
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>
> > >
> >
> https://docs.google.com/document/d/1BAOJTAPfWK93JnN6LQrISo8IqDiE7LpnfG2Q42fnn7M
> > > >>>>>>> 2. Resumable Operators for Disruption Readiness
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>
> > >
> >
> https://docs.google.com/document/d/1XPbCfuTVhyiq12tFxbyQrX_kQqrDqMo5t7M789MG4GI
> > > >>>>>>> What We're Solving
> > > >>>>>>>
> > > >>>>>>> Infrastructure failures consume user retries - Pod evictions
> > > shouldn't
> > > >>>>>> count against application retry budgets
> > > >>>>>>> Wasted computation - Worker crashes shouldn't restart healthy
> > > 3-hour
> > > >>>>>> Databricks jobs from zero
> > > >>>>>>> How
> > > >>>>>>>
> > > >>>>>>> Execution Context: Distinguish infrastructure vs application
> > > failures
> > > >>>>>> for smarter retry handling
> > > >>>>>>> Resumable Operators: Checkpoint and reconnect to external jobs
> > > after
> > > >>>>>> disruptions (follows deferral pattern)
> > > >>>>>>> These approaches have significantly improved reliability and
> user
> > > >>>>>> experience, and reduced wasted costs in our production
> > environment.
> > > >>>>>>> Looking forward to your feedback on both the problems we're
> > > addressing
> > > >>>>>> and the proposed solutions. Both proposals are fully backward
> > > compatible
> > > >>>>>> and follow existing Airflow patterns.
> > > >>>>>>> Happy to answer any questions or dive deeper into
> implementation
> > > >>>>> details.
> > > >>>>>>> Best,
> > > >>>>>>>
> > > >>>>>>> Stefan Wang
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > ---------------------------------------------------------------------
> > > >>>>>> To unsubscribe, e-mail: [email protected]
> > > >>>>>> For additional commands, e-mail: [email protected]
> > > >>>>>>
> > > >>>>>>
> > > >>>
> > > >>
> > > >>
> ---------------------------------------------------------------------
> > > >> To unsubscribe, e-mail: [email protected]
> > > >> For additional commands, e-mail: [email protected]
> > > >>
> > > >
> > >
> > >
> >
>

Reply via email to