Hi Jens, Thank you so much for the help and for being so supportive — it’s working for me now!
Really appreciate you stepping in. Best, Stefan > On Nov 30, 2025, at 12:27 AM, Jens Scheffler <[email protected]> wrote: > > As PMC we are space owners, added your permissions for the user stefwang to > the Airflow space. Hope now it is working. > > On 11/30/25 04:54, Stefan Wang wrote: >> Apologies for the late response folks while I had oncall shifts. Catching up >> here and will respond to each comment in order. >> >> >> >> — >> >> >> >> Re: https://lists.apache.org/thread/j02owr28cjw7zyyrp938fqt69nbmyxy4 from >> Jens Scheffler >> >> Hi Jens, >> >> Thanks for the suggestion! I completely agree that following the formal AIP >> process is the right approach. >> >> I've been trying to create the AIPs on the Confluence wiki, but I'm running >> into permission issues. When I click the "Create new AIP" button on the AIP >> page, I get a "Sorry, you don't have permission to create content" error. >> >> I've tried following the exact step listed to create ASF confluence account >> however neither has EDIT access granted under the AIRFLOW space, created two >> accounts (stewang and stefwang) to rule out any account-specific issues, but >> both accounts have the same problem. Would really appreciate some expertise >> in this area to help point me to who we should contact to get the >> appropriate permissions, or is there a specific access request process I >> should follow? - Or if someone else with edit access could help copy paste >> the google doc content into Confluence for comments, thanks a lot! >> >> I’ll try to contact ASF infra support in the mean time, and will work on >> migrate the Google Docs to Confluence once I have access. >> >> Thanks, Stefan >> >> >>> On Nov 15, 2025, at 6:27 AM, Christos Bisias <[email protected]> wrote: >>> >>> Hi Stefan, >>> >>> Thank you for the work! Very well organised and easy to follow docs. >>> >>> I have been thinking about infrastructure retries for a while now. Also, I >>> had a few discussions at the Airflow Summit last month and I know that >>> others are interested as well. >>> >>> It looks to me too, that this will be split into multiple PRs but if there >>> is a code POC, I would like to take a look. >>> >>> Regards, >>> Christos >>> >>> On Fri, Nov 14, 2025 at 11:53 PM Jarek Potiuk <[email protected]> wrote: >>> >>>> Also something we discussed off-line: I think the scope of it is quite >>>> "huge" - but there are small and incremental improvements, that might not >>>> even require AIP that can be implemented as PRs., I think it's great to >>>> keep "big hairy vision" in head (like I did several years ago when I >>>> proposed a "small" improvement in our dependency management that took about >>>> 4 years to get to the stage I thought it would take a few weeks. >>>> >>>> Getting incremental improvements and showing the dedication, merit and >>>> consistent pattern of improvements is a key to get - eventually - big and >>>> "world-changing" changes. >>>> >>>> J. >>>> >>>> >>>> On Fri, Nov 14, 2025 at 10:32 PM Jens Scheffler <[email protected]> >>>> wrote: >>>> >>>>> Hi Stefan, >>>>> >>>>> thanks for dropping the proposals! >>>>> >>>>> I'd propose to store the documents in cWiki and open them formally in >>>>> there as AIP proposal as then it is sollowing the AIP process. >>>>> >>>>> See >>>>> >>>>> >>>> https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals >>>>> Jens >>>>> >>>>> On 11/14/25 12:35, Stefan Wang wrote: >>>>>> Hi Airflow Community, >>>>>> >>>>>> I'm excited to share two complementary proposals that address critical >>>>> reliability challenges in Airflow, particularly around infrastructure >>>>> disruptions and task resilience. These proposals build on insights from >>>>> managing one of the larger Airflow deployments (20k+ DAGs, 100k+ daily >>>> task >>>>> executions per cluster). >>>>>> Proposals >>>>>> >>>>>> 1. Infrastructure-Aware Task Execution and Context Propagation >>>>>> >>>>>> >>>> https://docs.google.com/document/d/1BAOJTAPfWK93JnN6LQrISo8IqDiE7LpnfG2Q42fnn7M >>>>>> 2. Resumable Operators for Disruption Readiness >>>>>> >>>>>> >>>> https://docs.google.com/document/d/1XPbCfuTVhyiq12tFxbyQrX_kQqrDqMo5t7M789MG4GI >>>>>> What We're Solving >>>>>> >>>>>> Infrastructure failures consume user retries - Pod evictions shouldn't >>>>> count against application retry budgets >>>>>> Wasted computation - Worker crashes shouldn't restart healthy 3-hour >>>>> Databricks jobs from zero >>>>>> How >>>>>> >>>>>> Execution Context: Distinguish infrastructure vs application failures >>>>> for smarter retry handling >>>>>> Resumable Operators: Checkpoint and reconnect to external jobs after >>>>> disruptions (follows deferral pattern) >>>>>> These approaches have significantly improved reliability and user >>>>> experience, and reduced wasted costs in our production environment. >>>>>> Looking forward to your feedback on both the problems we're addressing >>>>> and the proposed solutions. Both proposals are fully backward compatible >>>>> and follow existing Airflow patterns. >>>>>> Happy to answer any questions or dive deeper into implementation >>>> details. >>>>>> Best, >>>>>> >>>>>> Stefan Wang >>>>>> >>>>>> >>>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: [email protected] >>>>> For additional commands, e-mail: [email protected] >>>>> >>>>> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] >
