[ https://issues.apache.org/jira/browse/AIRFLOW-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520768#comment-16520768 ]
Alejandro Fernandez commented on AIRFLOW-2659: ---------------------------------------------- I came across https://issues.apache.org/jira/browse/AIRFLOW-1620, which was created by [~aoen], and is essentially the same ask as this Jira. > Improve Robustness of Operators in Airflow during Infra Outages > --------------------------------------------------------------- > > Key: AIRFLOW-2659 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2659 > Project: Apache Airflow > Issue Type: Improvement > Components: hooks > Affects Versions: 1.10 > Reporter: Alejandro Fernandez > Assignee: Alejandro Fernandez > Priority: Major > Attachments: AIRFLOW_2659.pdf, retry_window.png, test_rules.py > > > *Problem:* > If an infrastructure outage occurs on the Hadoop cluster, tasks that rely on > those services will fail in Airflow, thereby causing SLA misses and > deteriorating user confidence in Airflow (even if the outage was in another > system). Only a fraction of tasks and DAGs have retries around certain > operators/hooks and the retry attempts are not sufficient during an outage. > *Goal:* Automatically retry failures that occur due to infrastructure issues. > *High-level design:* > * Retry decorator in the Hooks for easy annotation > * Retry logic will be time-based (initial delay, max delay time, retry > window, etc.) > * Allow each Hook to determine the root-cause of the error (user, infra > outage) > ** User-errors will be handled the way they are today. > ** Infra-errors will be able to retry for extended periods of time. > * Configurable (optional, configurable per Hook) > * Emit metrics using StatsD > See attached [^AIRFLOW_2659.pdf]design doc. -- This message was sent by Atlassian JIRA (v7.6.3#76005)