I really like this direction — it’s something I’ve been thinking about as well, although from a slightly different angle. I’ve been considering starting a discussion around making retry behavior more dynamic based on runtime context, rather than introducing AI specifically. The current model is quite static: we retry blindly based on configuration, without considering why the failure happened or what the system state looks like at that moment. What I find compelling in this AIP is the shift toward failure-aware retries. That aligns closely with the idea of making DAGs more resilient — not just retrying in the hope of eventual success, but making a more informed decision based on the nature of the failure. One thing I’d be interested in exploring further is how far we can push this in a deterministic/runtime-driven way (e.g. exception type, response metadata, external signals like rate limits or downstream system health), and how that compares to or complements the LLM-based approach. Overall, this feels like a strong step toward decoupling retry logic from business logic, which is definitely a gap today. Very nice proposal Kaxil, so definitely +1 for me.
________________________________ From: Kaxil Naik <[email protected]> Sent: Saturday, April 18, 2026 02:01 To: [email protected] <[email protected]> Subject: [DISCUSS] AIP-105: Pluggable Retry Policies EXTERNAL MAIL: Indien je de afzender van deze e-mail niet kent en deze niet vertrouwt, klik niet op een link of open geen bijlages. Bij twijfel, stuur deze e-mail als bijlage naar [email protected]<mailto:[email protected]>. Hi all, Continuing the push to make Airflow AI-native, I have put together AIP-105: Pluggable Retry Policies. Wiki: https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FAIRFLOW%2FAIP-105%253A%2BPluggable%2BRetry%2BPolicies&data=05%7C02%7Cdavid.blain%40infrabel.be%7C08fa091c1ec64b36829b08de9cddcc96%7Cb82bc314ab8e4d6fb18946f02e1f27f2%7C0%7C0%7C639120673615964835%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=%2BkJAEKtqo6XVwaunH4ycZm7mYzjNURYRbkMvYvnkvSM%3D&reserved=0<https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-105%3A+Pluggable+Retry+Policies> PR (core): https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fairflow%2Fpull%2F65450&data=05%7C02%7Cdavid.blain%40infrabel.be%7C08fa091c1ec64b36829b08de9cddcc96%7Cb82bc314ab8e4d6fb18946f02e1f27f2%7C0%7C0%7C639120673615992475%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=MRv8LGaBeRF6LXf54vG6U5HuoHKf4o%2FCCJgJw9JiUUE%3D&reserved=0<https://github.com/apache/airflow/pull/65450> PR (LLM-powered, common-ai provider): https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fairflow%2Fpull%2F65451&data=05%7C02%7Cdavid.blain%40infrabel.be%7C08fa091c1ec64b36829b08de9cddcc96%7Cb82bc314ab8e4d6fb18946f02e1f27f2%7C0%7C0%7C639120673616006007%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=jjaGhzjVDKgPVjIUDz8eb9m%2B8qpm9Pfk1HgYfO45%2B78%3D&reserved=0<https://github.com/apache/airflow/pull/65451> The problem is straightforward: Airflow retries every failure the same way. An expired API key gets retried 3 times over 15 minutes. A rate-limited API gets retried immediately, hitting the same 429. Users who want smarter retries today have to wrap every task in try/except and raise AirflowFailException manually, mixing retry logic into business logic. This AIP adds a retry_policy parameter to BaseOperator. The policy evaluates the actual exception at failure time and returns RETRY (with a custom delay), FAIL (skip remaining retries), or DEFAULT (standard behaviour). It runs in the worker process, not the scheduler. Declarative example: ```python @task( retries=5, retry_policy=ExceptionRetryPolicy( rules=[ RetryRule( exception="requests.exceptions.HTTPError", action=RetryAction.RETRY, retry_delay=timedelta(minutes=5) ), RetryRule( exception="google.auth.exceptions.RefreshError", action=RetryAction.FAIL ), ] ), ) def call_api(): ... ``` LLM-powered example -- uses any pydantic-ai provider (OpenAI, Anthropic, Bedrock, Ollama): @task(retries=5, retry_policy=(llm_conn_id="my_llm")) def call_flaky_api(): ... The LLM version classifies errors into categories (auth, rate_limit, network, data, transient, permanent) using structured output with a 30-second timeout and declarative fallback rules for when the LLM itself is down. I have attached demo videos and screenshots to both PRs showing both policies running end-to-end in Airflow -- including the LLM correctly classifying 4 different error types via Claude Haiku. Full design, done criteria, and implementation details are in the wiki page above. Feedback welcome. Thanks, Kaxil
