Re: https://lists.apache.org/thread/jk1wkt1wh0lm2ovlldnfcpbzr3brxsy1

Thank you Jarek for the thoughtful guidance — I really appreciate you taking 
the time to guide me through this. Totally agree with your advice about 
starting small and building things incrementally, and I'll keep it in mind 
throughout this effort.

The proposals aims to address shared reliability challenges that have been 
seeing across medium to large scale Airflow deployments in the community (ref: 
OpenAI 2025 Airflow Summit Talk 
<https://airflowsummit.org/sessions/2025/airflow-openai/> (Reliability 
Section), LinkedIn (here in this thread), and Apple with Xiaodong's thread/AIP 
<https://lists.apache.org/thread/yqbtw5l8cpjln4sw7m4x73qb9tffysky> 
(specifically External Job Tracking and Polling) - I’ll follow up in there as 
well to collaborate):

Better Context propagation and Infra Retry budget: Help distinguish 
infrastructure failures (pod evictions, worker crashes) from application errors 
for smarter cleanup decisions and protected user retry budgets - we already 
have access to the SOT context - just need to propagate it better in the 
existing ecosystem (through passing additional optional msg or exception 
handling, or something else)

Resumable Operators (in parallel with Deferrable Operators): Let operators 
reconnect to healthy external jobs (Databricks, EMR) after worker disruptions 
instead of wastefully restarting

Both are designed to be completely backward compatible, opt-in only, and 
designed with specific leverage on existing well-established Airflow features, 
hooks, and patterns (deferral mechanism, execution context).

Rather than pushing for big changes upfront in one go, throughout this effort, 
things will be broken into small, incremental pieces that each provide 
standalone value. Start with the tiniest possible change (e.g., optional 
execution_context parameter — purely additive). Continue contributing in other 
areas especially reliability-related, to maintain consistency and trust. Keep 
the broader vision in the design proposal, but let the implementation evolve 
based on community feedback.

I want to make sure this is done in a way that's most beneficial to the 
community. Guidance and support from you and others in the community overall 
will help us a lot in approaching this the right way. Thank you!

Best,
Stefan


> On Dec 2, 2025, at 12:24 AM, Stefan Wang <[email protected]> wrote:
> 
> Hi Jens,
> 
> Thank you so much for the help and for being so supportive — it’s working for 
> me now!
> 
> Really appreciate you stepping in.
> 
> Best,
> Stefan
> 
> 
>> On Nov 30, 2025, at 12:27 AM, Jens Scheffler <[email protected]> wrote:
>> 
>> As PMC we are space owners, added your permissions for the user stefwang to 
>> the Airflow space. Hope now it is working.
>> 
>> On 11/30/25 04:54, Stefan Wang wrote:
>>> Apologies for the late response folks while I had oncall shifts. Catching 
>>> up here and will respond to each comment in order.
>>> 
>>> 
>>> 
>>> —
>>> 
>>> 
>>> 
>>> Re: https://lists.apache.org/thread/j02owr28cjw7zyyrp938fqt69nbmyxy4 from 
>>> Jens Scheffler
>>> 
>>> Hi Jens,
>>> 
>>> Thanks for the suggestion! I completely agree that following the formal AIP 
>>> process is the right approach.
>>> 
>>> I've been trying to create the AIPs on the Confluence wiki, but I'm running 
>>> into permission issues. When I click the "Create new AIP" button on the AIP 
>>> page, I get a "Sorry, you don't have permission to create content" error.
>>> 
>>> I've tried following the exact step listed to create ASF confluence account 
>>> however neither has EDIT access granted under the AIRFLOW space, created 
>>> two accounts (stewang and stefwang) to rule out any account-specific 
>>> issues, but both accounts have the same problem. Would really appreciate 
>>> some expertise in this area to help point me to who we should contact to 
>>> get the appropriate permissions, or is there a specific access request 
>>> process I should follow? - Or if someone else with edit access could help 
>>> copy paste the google doc content into Confluence for comments, thanks a 
>>> lot!
>>> 
>>> I’ll try to contact ASF infra support in the mean time, and will work on 
>>> migrate the Google Docs to Confluence once I have access.
>>> 
>>> Thanks, Stefan
>>> 
>>> 
>>>> On Nov 15, 2025, at 6:27 AM, Christos Bisias <[email protected]> wrote:
>>>> 
>>>> Hi Stefan,
>>>> 
>>>> Thank you for the work! Very well organised and easy to follow docs.
>>>> 
>>>> I have been thinking about infrastructure retries for a while now. Also, I
>>>> had a few discussions at the Airflow Summit last month and I know that
>>>> others are interested as well.
>>>> 
>>>> It looks to me too, that this will be split into multiple PRs but if there
>>>> is a code POC, I would like to take a look.
>>>> 
>>>> Regards,
>>>> Christos
>>>> 
>>>> On Fri, Nov 14, 2025 at 11:53 PM Jarek Potiuk <[email protected]> wrote:
>>>> 
>>>>> Also something we discussed off-line: I think the scope of it is quite
>>>>> "huge" - but there are small and incremental improvements, that might not
>>>>> even require AIP that can be implemented as PRs., I think it's great to
>>>>> keep "big hairy vision" in head (like I did several years ago when I
>>>>> proposed a "small" improvement in our dependency management that took 
>>>>> about
>>>>> 4 years to get to the stage I thought it would take a few weeks.
>>>>> 
>>>>> Getting incremental improvements and showing the dedication, merit and
>>>>> consistent pattern of improvements is a key to get - eventually - big and
>>>>> "world-changing" changes.
>>>>> 
>>>>> J.
>>>>> 
>>>>> 
>>>>> On Fri, Nov 14, 2025 at 10:32 PM Jens Scheffler <[email protected]>
>>>>> wrote:
>>>>> 
>>>>>> Hi Stefan,
>>>>>> 
>>>>>> thanks for dropping the proposals!
>>>>>> 
>>>>>> I'd propose to store the documents in cWiki and open them formally in
>>>>>> there as AIP proposal as then it is sollowing the AIP process.
>>>>>> 
>>>>>> See
>>>>>> 
>>>>>> 
>>>>> https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals
>>>>>> Jens
>>>>>> 
>>>>>> On 11/14/25 12:35, Stefan Wang wrote:
>>>>>>> Hi Airflow Community,
>>>>>>> 
>>>>>>> I'm excited to share two complementary proposals that address critical
>>>>>> reliability challenges in Airflow, particularly around infrastructure
>>>>>> disruptions and task resilience. These proposals build on insights from
>>>>>> managing one of the larger Airflow deployments (20k+ DAGs, 100k+ daily
>>>>> task
>>>>>> executions per cluster).
>>>>>>> Proposals
>>>>>>> 
>>>>>>> 1. Infrastructure-Aware Task Execution and Context Propagation
>>>>>>> 
>>>>>>> 
>>>>> https://docs.google.com/document/d/1BAOJTAPfWK93JnN6LQrISo8IqDiE7LpnfG2Q42fnn7M
>>>>>>> 2. Resumable Operators for Disruption Readiness
>>>>>>> 
>>>>>>> 
>>>>> https://docs.google.com/document/d/1XPbCfuTVhyiq12tFxbyQrX_kQqrDqMo5t7M789MG4GI
>>>>>>> What We're Solving
>>>>>>> 
>>>>>>> Infrastructure failures consume user retries - Pod evictions shouldn't
>>>>>> count against application retry budgets
>>>>>>> Wasted computation - Worker crashes shouldn't restart healthy 3-hour
>>>>>> Databricks jobs from zero
>>>>>>> How
>>>>>>> 
>>>>>>> Execution Context: Distinguish infrastructure vs application failures
>>>>>> for smarter retry handling
>>>>>>> Resumable Operators: Checkpoint and reconnect to external jobs after
>>>>>> disruptions (follows deferral pattern)
>>>>>>> These approaches have significantly improved reliability and user
>>>>>> experience, and reduced wasted costs in our production environment.
>>>>>>> Looking forward to your feedback on both the problems we're addressing
>>>>>> and the proposed solutions. Both proposals are fully backward compatible
>>>>>> and follow existing Airflow patterns.
>>>>>>> Happy to answer any questions or dive deeper into implementation
>>>>> details.
>>>>>>> Best,
>>>>>>> 
>>>>>>> Stefan Wang
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>> For additional commands, e-mail: [email protected]
>>>>>> 
>>>>>> 
>>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>> 
> 

Reply via email to