As PMC we are space owners, added your permissions for the user stefwang to the Airflow space. Hope now it is working.

On 11/30/25 04:54, Stefan Wang wrote:
Apologies for the late response folks while I had oncall shifts. Catching up 
here and will respond to each comment in order.



—



Re: https://lists.apache.org/thread/j02owr28cjw7zyyrp938fqt69nbmyxy4 from Jens 
Scheffler

Hi Jens,

Thanks for the suggestion! I completely agree that following the formal AIP 
process is the right approach.

I've been trying to create the AIPs on the Confluence wiki, but I'm running into permission issues. 
When I click the "Create new AIP" button on the AIP page, I get a "Sorry, you don't 
have permission to create content" error.

I've tried following the exact step listed to create ASF confluence account 
however neither has EDIT access granted under the AIRFLOW space, created two 
accounts (stewang and stefwang) to rule out any account-specific issues, but 
both accounts have the same problem. Would really appreciate some expertise in 
this area to help point me to who we should contact to get the appropriate 
permissions, or is there a specific access request process I should follow? - 
Or if someone else with edit access could help copy paste the google doc 
content into Confluence for comments, thanks a lot!

I’ll try to contact ASF infra support in the mean time, and will work on 
migrate the Google Docs to Confluence once I have access.

Thanks, Stefan


On Nov 15, 2025, at 6:27 AM, Christos Bisias <[email protected]> wrote:

Hi Stefan,

Thank you for the work! Very well organised and easy to follow docs.

I have been thinking about infrastructure retries for a while now. Also, I
had a few discussions at the Airflow Summit last month and I know that
others are interested as well.

It looks to me too, that this will be split into multiple PRs but if there
is a code POC, I would like to take a look.

Regards,
Christos

On Fri, Nov 14, 2025 at 11:53 PM Jarek Potiuk <[email protected]> wrote:

Also something we discussed off-line: I think the scope of it is quite
"huge" - but there are small and incremental improvements, that might not
even require AIP that can be implemented as PRs., I think it's great to
keep "big hairy vision" in head (like I did several years ago when I
proposed a "small" improvement in our dependency management that took about
4 years to get to the stage I thought it would take a few weeks.

Getting incremental improvements and showing the dedication, merit and
consistent pattern of improvements is a key to get - eventually - big and
"world-changing" changes.

J.


On Fri, Nov 14, 2025 at 10:32 PM Jens Scheffler <[email protected]>
wrote:

Hi Stefan,

thanks for dropping the proposals!

I'd propose to store the documents in cWiki and open them formally in
there as AIP proposal as then it is sollowing the AIP process.

See


https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals
Jens

On 11/14/25 12:35, Stefan Wang wrote:
Hi Airflow Community,

I'm excited to share two complementary proposals that address critical
reliability challenges in Airflow, particularly around infrastructure
disruptions and task resilience. These proposals build on insights from
managing one of the larger Airflow deployments (20k+ DAGs, 100k+ daily
task
executions per cluster).
Proposals

1. Infrastructure-Aware Task Execution and Context Propagation


https://docs.google.com/document/d/1BAOJTAPfWK93JnN6LQrISo8IqDiE7LpnfG2Q42fnn7M
2. Resumable Operators for Disruption Readiness


https://docs.google.com/document/d/1XPbCfuTVhyiq12tFxbyQrX_kQqrDqMo5t7M789MG4GI
What We're Solving

Infrastructure failures consume user retries - Pod evictions shouldn't
count against application retry budgets
Wasted computation - Worker crashes shouldn't restart healthy 3-hour
Databricks jobs from zero
How

Execution Context: Distinguish infrastructure vs application failures
for smarter retry handling
Resumable Operators: Checkpoint and reconnect to external jobs after
disruptions (follows deferral pattern)
These approaches have significantly improved reliability and user
experience, and reduced wasted costs in our production environment.
Looking forward to your feedback on both the problems we're addressing
and the proposed solutions. Both proposals are fully backward compatible
and follow existing Airflow patterns.
Happy to answer any questions or dive deeper into implementation
details.
Best,

Stefan Wang



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]




---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to