Re: [DISCUSS] AIP-105: Pluggable Retry Policies

Stefan Wang Wed, 22 Apr 2026 10:23:05 -0700

Thanks Kaxil for the detailed clarification — this breakdown helped a lot.


And thanks Jens and Natanel as well for raising the convergence questions; the 
discussion has made the boundaries much clearer.

I checked both carefully again, I agree with Kaxil's framing that the 
worker-alive vs. worker-dead split is the right primary distinction. AIP-105 
handling exceptions that occur within a live worker, and AIP-97 handling 
failures where the worker itself is lost, feels like a clean and technically 
meaningful separation.

I can understand where some of the confusion comes from, since from a user 
perspective both AIPs ultimately influence retry behavior. Also, as noted, some 
infrastructure-related failures (like DB unavailability or DNS resolution 
issues) do surface as exceptions, which can make the boundary feel less obvious 
at first glance. But as long as the execution context is clear — whether the 
worker is still alive and able to evaluate policy — the separation seems well 
justified.

So from my side, I agree that AIP-105 should move forward independently, unless 
we identify a concrete technical overlap that would materially benefit from 
convergence rather than just conceptual similarity.

Thanks again for the learning and discussions!

— Stefan


> On Apr 22, 2026, at 6:45 AM, Natanel <[email protected]> wrote:
> 
> Hello Kaxil, this seems like a very interesting proposal.
> 
> I think I can understand where the confusion about the similarity of AIP 97
> and AIP 105 may come from, as both of them retry on specific conditions,
> and I think it might be possible to have the same code do both, as in
> the end, both decide when to retry according to some predefined condition.
> In the document about AIP 97, they do not only talk about events like pod
> eviction and such (which probably should be handled by the executor and
> rerun automatically, where some already are), it is also described that
> there are infra errors which DO throw an exception, such as DB unavailable,
> DNS resolution error and more (not only executor related errors), where
> this looks exactly like what you proposed.
> 
> I think I agree with Jens, where I see both AIP's coming together rather
> than being totally separate from each other, as they both have some overlap.
> 
> On another note, if there was an option to change or modify the task being
> retried slightly (as if a smart retry) where if, for example, I have a
> spark app running from airflow, and it gets an OOM, I would like to retry
> the same job with more resources, or that if a query times out, I would run
> it again on less data / partitions or even with a smaller limit / higher
> timeout.
> 
> Do you think it is possible to do as part of the scope of the change? or
> should it be a different separate AIP?
> 
> If so, I would happily create one for it.
> 
> I have seen some LLM based implementations, yet I think that a programmatic
> approach is also needed for this case, as there are a lot of times where
> our clients have to modify their workflow for a single run due to more data
> at specific dates than others.
> 
> Thanks,
> Natanel.
> 
> On Wed, 22 Apr 2026 at 04:53, Kaxil Naik <[email protected]> wrote:
> 
>> Thanks Stefan, and thanks Jens for the follow-up question.
>> 
>> On the AIP-96/97 convergence question you both raised:
>> 
>> I looked at AIP-97 carefully. AIP-105 and AIP-97 cover different failure
>> domains and don't block each other.
>> 
>> AIP-105's RetryPolicy runs in the worker process, after the exception is
>> caught in the task's try/except. It handles failures that manifest as
>> Python exceptions: rate limits, auth errors, connection timeouts,
>> transient DB errors. The worker is alive, the exception object is
>> available, and the policy can inspect it.
>> 
>> To Jens's question directly: when the worker dies (segfault, pod eviction,
>> OOM kill, heartbeat loss), AIP-105's policy never runs. The worker process
>> is gone. In that case, the existing scheduler-based retry kicks
>> in -- exactly as it does today. That's AIP-97's territory: the executor or
>> scheduler detects the failure externally and manages a separate
>> infrastructure retry budget.
>> 
>> So the split is:
>> - Application failures (rate limits, auth errors, data validation) raise
>> Python exceptions in user code -- AIP-105 handles these
>> - Infrastructure failures (pod eviction, OOM kill, worker heartbeat loss)
>> kill the worker process before any exception is caught -- AIP-97 handles
>> these and since it touches Scheduler / Executor -- it is more involved.
>> 
>> They're parallel tracks with separate execution paths.
>> 
>> Thanks,
>> Kaxil
>> 
>> On Tue, 21 Apr 2026 at 21:22, Jens Scheffler <[email protected]> wrote:
>> 
>>> Good point that Stefan made - I also had commented on the relation to
>>> AIP-97 which I would love to have or converge AIP-105 with.
>>> 
>>> In this light, actually what would be the intend of the Retry policy if
>>> the worker "dies" in a segfault or loses heartbeat? Then the standard /
>>> existing scheduler based retry is kicking-in?
>>> 
>>> Jens
>>> 
>>> On 21.04.26 02:19, Stefan Wang wrote:
>>>> Thanks Kaxil,
>>>> huge +1
>>>> 
>>>> This feels like a meaningful step forward.
>>>> 
>>>> Giving users a way to express retry intent and putting the policy on
>> the
>>>> operator is something we've needed for a while. The current options
>>>> aren't great: wrap everything in try/except and raise
>>>> AirflowFailException, or live with retries=3 as a blunt instrument.
>>>> Both are compromises.
>>>> 
>>>> A few things that stand out in the design:
>>>> 
>>>> 1. I think Evaluating on the worker is the right call. Exceptions don't
>>> serialize
>>>> cleanly across process boundaries, and keeping the decision close to
>>>> where the exception actually happens avoids a lot of indirection. The
>>>> scheduler-side version would be simpler to ship but harder to use.
>>>> 
>>>> 2. The flat rule list is easier to reason about and validate at parse
>>> time
>>>> than a nested structure would be. Elad's suggestion to let one rule
>>>> match multiple exception types would tighten the common case without
>>>> losing that.
>>>> 
>>>> A couple of thoughts that came up while reading:
>>>> 
>>>> 1. On Paweł's testing point: if policy.evaluate() is just a method you
>>> can
>>>> call with a synthetic exception, DAG authors can cover a lot of ground
>>>> in unit tests. Not the same as validating in production, but catches a
>>>> decent amount before deploy.
>>>> 
>>>> 2. On retry budgets (separate infra retry budget) more broadly:
>>>> retries=N today can get consumed by
>>>> worker evictions or heartbeat losses before any retry policy ever runs.
>>>> Pluggable policies will feel sharper once the user-visible budget
>>>> actually reflects user-domain failures. I also have two drafts touching
>>> this
>>>> area, AIP-96 (Resumable Operators) and AIP-97 (Execution Context +
>>> separate infra
>>>> retry budget), and will post updates on both soon. Open to converging
>>> where it makes sense.
>>>> For what it's worth, we've been running two related pieces in
>> production
>>>> at LinkedIn. One is a mixin that preserves external jobs (Spark, Flink,
>>>> and similar) when the worker gets disrupted instead of cancelling them.
>>>> The other is a separate infrastructure retry budget set generously
>>>> enough that infrastructure events don't eat into user-visible retries.
>> I
>>>> can share anonymized failure-category data from both if it would help
>>>> ground the default rule library.
>>>> 
>>>> Looking forward to v2.
>>>> 
>>>> — Stefan
>>>> 
>>>>> On Apr 20, 2026, at 1:50 PM, Przemysław Mirowski <[email protected]>
>>> wrote:
>>>>> 
>>>>> Great idea! Thanks for proposing it. It will make proper
>>> exception-retry handling much easier than it was before and will open a
>> new
>>> door for more extensibility too.
>>>>> 
>>>>> +1 also to the questions/concerts which Elad mentioned. Not sure
>> though
>>> regarding the changes to Priority Weight (maybe part of AIP-100) and
>> point
>>> 2 connected to not having full control over exception raised, looking at
>>> the Airflow ecosystem, all of the providers with different libraries, I
>>> think it is something which we should consider.
>>>>> 
>>>>> One additional comment - as the Retry Policies will only run on
>> workers
>>> (which is pretty nice from e.g. security point of view), I didn't see in
>>> AIP and PR a way to validate if configured Retry Policy will work before
>>> actually the time when it will be needed. That can make setting the Retry
>>> Policies harder and the testing them will be cumbersome. I think that
>>> having a nice way (from Dag Authors perspective) of testing the defined
>>> Retry Policy if it will actually work when it really be needed, would
>> make
>>> Dag Authors lifes much easier and defining these rules much easier
>>> (something in some way connected to that could be testing the Airflow
>>> Connections and work for moving the "Test Connection" to workers). Of
>>> course, Retry Policies like LLM-related are rather out-of-scope, but
>>> testing more deterministic behaviours should be much easier to do.
>>>>> 
>>>>> ________________________________
>>>>> From: Vincent Beck <[email protected]>
>>>>> Sent: 20 April 2026 15:17
>>>>> To: [email protected] <[email protected]>
>>>>> Subject: Re: [DISCUSS] AIP-105: Pluggable Retry Policies
>>>>> 
>>>>> Makes a lot of sense to me!
>>>>> 
>>>>> On 2026/04/19 13:56:56 Elad Kalif wrote:
>>>>>> Great idea!
>>>>>> Love it!
>>>>>> 
>>>>>> I have some questions / comments:
>>>>>> 1. The current interface suggests rules that contain a RetryRule
>>> object.
>>>>>> but I wonder if we should change exception to exceptions and
>> accepting
>>> a
>>>>>> list.
>>>>>> 
>>>>>>        rules=[
>>>>>>            RetryRule(
>>>>>>            exceptions=["requests.exceptions.HTTPError",
>>>>>> "google.auth.exceptions.RefreshError"]
>>>>>>                    ...,
>>>>>> )]
>>>>>> 
>>>>>> I'm thinking about a case where several exceptions need the same
>>> behaviour
>>>>>> and user may not wish to offer different reasoning for each.
>>>>>> 
>>>>>> 2. Does it make sense to extend the interface for xcom values? I'm
>>> thinking
>>>>>> about a case where dag authors don't have full control over the
>>> exception
>>>>>> raised or even some upstream library changing the exception which
>>> results
>>>>>> in retry logic to be broken. Maybe we should offer also the option to
>>> set
>>>>>> retry based on previous attempt xcom value?
>>>>>> 
>>>>>> 3. Maybe something for the longer run but still worth discussing -
>> one
>>> of
>>>>>> the main motivations for custom weight rules
>>>>>> 
>>> 
>> https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/priority-weight.html#custom-weight-rule
>>>>>> was to set priority based on try number. I wonder if we may want to
>>> somehow
>>>>>> combine it with the Retry rule. For retries, I can argue that the
>>> weight of
>>>>>> the task is a property of retry instructions and it can very be that
>>> the
>>>>>> weight will change depending on the exception.
>>>>>> 
>>>>>> On Sun, Apr 19, 2026 at 6:30 AM Shahar Epstein <[email protected]>
>>> wrote:
>>>>>> 
>>>>>>> Great idea! I liked both the deterministic approach as well as the
>> AI
>>>>>>> integrated.
>>>>>>> 
>>>>>>> 
>>>>>>> Shahar
>>>>>>> 
>>>>>>> On Sat, Apr 18, 2026 at 3:02 AM Kaxil Naik <[email protected]>
>>> wrote:
>>>>>>> 
>>>>>>>> Hi all,
>>>>>>>> 
>>>>>>>> Continuing the push to make Airflow AI-native, I have put together
>>>>>>> AIP-105:
>>>>>>>> Pluggable Retry Policies.
>>>>>>>> 
>>>>>>>> Wiki:
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-105%3A+Pluggable+Retry+Policies
>>>>>>>> PR (core): https://github.com/apache/airflow/pull/65450
>>>>>>>> PR (LLM-powered, common-ai provider):
>>>>>>>> https://github.com/apache/airflow/pull/65451
>>>>>>>> 
>>>>>>>> The problem is straightforward: Airflow retries every failure the
>>> same
>>>>>>> way.
>>>>>>>> An expired API key gets retried 3 times over 15 minutes. A
>>> rate-limited
>>>>>>> API
>>>>>>>> gets retried immediately, hitting the same 429. Users who want
>>> smarter
>>>>>>>> retries today have to wrap every task in try/except and raise
>>>>>>>> AirflowFailException manually, mixing retry logic into business
>>> logic.
>>>>>>>> 
>>>>>>>> This AIP adds a retry_policy parameter to BaseOperator. The policy
>>>>>>>> evaluates the actual exception at failure time and returns RETRY
>>> (with a
>>>>>>>> custom delay), FAIL (skip remaining retries), or DEFAULT (standard
>>>>>>>> behaviour). It runs in the worker process, not the scheduler.
>>>>>>>> 
>>>>>>>> Declarative example:
>>>>>>>> 
>>>>>>>> ```python
>>>>>>>>    @task(
>>>>>>>>        retries=5,
>>>>>>>>        retry_policy=ExceptionRetryPolicy(
>>>>>>>>        rules=[
>>>>>>>>            RetryRule(
>>>>>>>>            exception="requests.exceptions.HTTPError",
>>>>>>>>                    action=RetryAction.RETRY,
>>>>>>>>                    retry_delay=timedelta(minutes=5)
>>>>>>>>                ),
>>>>>>>>            RetryRule(
>>>>>>>>            exception="google.auth.exceptions.RefreshError",
>>>>>>>>                  action=RetryAction.FAIL
>>>>>>>>              ),
>>>>>>>>        ]
>>>>>>>>    ),
>>>>>>>>    )
>>>>>>>>    def call_api():
>>>>>>>>        ...
>>>>>>>> ```
>>>>>>>> 
>>>>>>>> LLM-powered example -- uses any pydantic-ai provider (OpenAI,
>>> Anthropic,
>>>>>>>> Bedrock, Ollama):
>>>>>>>> 
>>>>>>>>    @task(retries=5, retry_policy=(llm_conn_id="my_llm"))
>>>>>>>>    def call_flaky_api(): ...
>>>>>>>> 
>>>>>>>> The LLM version classifies errors into categories (auth,
>> rate_limit,
>>>>>>>> network, data, transient, permanent) using structured output with a
>>>>>>>> 30-second timeout and declarative fallback rules for when the LLM
>>> itself
>>>>>>> is
>>>>>>>> down.
>>>>>>>> 
>>>>>>>> I have attached demo videos and screenshots to both PRs showing
>> both
>>>>>>>> policies running end-to-end in Airflow -- including the LLM
>> correctly
>>>>>>>> classifying 4 different error types via Claude Haiku.
>>>>>>>> 
>>>>>>>> Full design, done criteria, and implementation details are in the
>>> wiki
>>>>>>> page
>>>>>>>> above.
>>>>>>>> 
>>>>>>>> Feedback welcome.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Kaxil
>>>>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [email protected]
>>>>> For additional commands, e-mail: [email protected]
>>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>> 
>>> 
>>

Re: [DISCUSS] AIP-105: Pluggable Retry Policies

Reply via email to