I really like this direction — it’s something I’ve been thinking about as well, 
although from a slightly different angle.
I’ve been considering starting a discussion around making retry behavior more 
dynamic based on runtime context, rather than introducing AI specifically. The 
current model is quite static: we retry blindly based on configuration, without 
considering why the failure happened or what the system state looks like at 
that moment.
What I find compelling in this AIP is the shift toward failure-aware retries. 
That aligns closely with the idea of making DAGs more resilient — not just 
retrying in the hope of eventual success, but making a more informed decision 
based on the nature of the failure.
One thing I’d be interested in exploring further is how far we can push this in 
a deterministic/runtime-driven way (e.g. exception type, response metadata, 
external signals like rate limits or downstream system health), and how that 
compares to or complements the LLM-based approach.
Overall, this feels like a strong step toward decoupling retry logic from 
business logic, which is definitely a gap today.
Very nice proposal Kaxil, so definitely +1 for me.


________________________________
From: Kaxil Naik <[email protected]>
Sent: Saturday, April 18, 2026 02:01
To: [email protected] <[email protected]>
Subject: [DISCUSS] AIP-105: Pluggable Retry Policies

EXTERNAL MAIL: Indien je de afzender van deze e-mail niet kent en deze niet 
vertrouwt, klik niet op een link of open geen bijlages. Bij twijfel, stuur deze 
e-mail als bijlage naar [email protected]<mailto:[email protected]>.

Hi all,

Continuing the push to make Airflow AI-native, I have put together AIP-105:
Pluggable Retry Policies.

Wiki:
https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FAIRFLOW%2FAIP-105%253A%2BPluggable%2BRetry%2BPolicies&data=05%7C02%7Cdavid.blain%40infrabel.be%7C08fa091c1ec64b36829b08de9cddcc96%7Cb82bc314ab8e4d6fb18946f02e1f27f2%7C0%7C0%7C639120673615964835%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=%2BkJAEKtqo6XVwaunH4ycZm7mYzjNURYRbkMvYvnkvSM%3D&reserved=0<https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-105%3A+Pluggable+Retry+Policies>
PR (core): 
https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fairflow%2Fpull%2F65450&data=05%7C02%7Cdavid.blain%40infrabel.be%7C08fa091c1ec64b36829b08de9cddcc96%7Cb82bc314ab8e4d6fb18946f02e1f27f2%7C0%7C0%7C639120673615992475%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=MRv8LGaBeRF6LXf54vG6U5HuoHKf4o%2FCCJgJw9JiUUE%3D&reserved=0<https://github.com/apache/airflow/pull/65450>
PR (LLM-powered, common-ai provider):
https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fairflow%2Fpull%2F65451&data=05%7C02%7Cdavid.blain%40infrabel.be%7C08fa091c1ec64b36829b08de9cddcc96%7Cb82bc314ab8e4d6fb18946f02e1f27f2%7C0%7C0%7C639120673616006007%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=jjaGhzjVDKgPVjIUDz8eb9m%2B8qpm9Pfk1HgYfO45%2B78%3D&reserved=0<https://github.com/apache/airflow/pull/65451>

The problem is straightforward: Airflow retries every failure the same way.
An expired API key gets retried 3 times over 15 minutes. A rate-limited API
gets retried immediately, hitting the same 429. Users who want smarter
retries today have to wrap every task in try/except and raise
AirflowFailException manually, mixing retry logic into business logic.

This AIP adds a retry_policy parameter to BaseOperator. The policy
evaluates the actual exception at failure time and returns RETRY (with a
custom delay), FAIL (skip remaining retries), or DEFAULT (standard
behaviour). It runs in the worker process, not the scheduler.

Declarative example:

```python
    @task(
        retries=5,
        retry_policy=ExceptionRetryPolicy(
        rules=[
            RetryRule(
            exception="requests.exceptions.HTTPError",
                    action=RetryAction.RETRY,
                    retry_delay=timedelta(minutes=5)
                ),
            RetryRule(
            exception="google.auth.exceptions.RefreshError",
                  action=RetryAction.FAIL
              ),
        ]
    ),
    )
    def call_api():
        ...
```

LLM-powered example -- uses any pydantic-ai provider (OpenAI, Anthropic,
Bedrock, Ollama):

    @task(retries=5, retry_policy=(llm_conn_id="my_llm"))
    def call_flaky_api(): ...

The LLM version classifies errors into categories (auth, rate_limit,
network, data, transient, permanent) using structured output with a
30-second timeout and declarative fallback rules for when the LLM itself is
down.

I have attached demo videos and screenshots to both PRs showing both
policies running end-to-end in Airflow -- including the LLM correctly
classifying 4 different error types via Claude Haiku.

Full design, done criteria, and implementation details are in the wiki page
above.

Feedback welcome.

Thanks,
Kaxil

Reply via email to