[ https://issues.apache.org/jira/browse/AIRFLOW-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520135#comment-16520135 ]
Ash Berlin-Taylor commented on AIRFLOW-2659: -------------------------------------------- Tasks/Operators in airflow can already be configured to retry - where does that fall short that this proposal addresses? I'm initially wary of having _two_ levels of retry in Airflow as it sounds like a recipe for confusion. > Improving Robustness of Operators in Airflow during Infra Outages > ----------------------------------------------------------------- > > Key: AIRFLOW-2659 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2659 > Project: Apache Airflow > Issue Type: Improvement > Components: hooks > Affects Versions: 1.10 > Reporter: Alejandro Fernandez > Assignee: Alejandro Fernandez > Priority: Major > Attachments: AIRFLOW_2659.pdf, test_rules.py > > > *Problem:* > If an infrastructure outage occurs on the Hadoop cluster, tasks that rely on > those services will fail in Airflow, thereby causing SLA misses and > deteriorating user confidence in Airflow (even if the outage was in another > system). Only a fraction of tasks and DAGs have retries around certain > operators/hooks and the retry attempts are not sufficient during an outage. > *Goal:* Automatically retry failures that occur due to infrastructure issues. > *High-level design:* > * Retry decorator in the Hooks for easy annotation > * Retry logic will be time-based (initial delay, max delay time, retry > window, etc.) > * Allow each Hook to determine the root-cause of the error (user, infra > outage) > ** User-errors will be handled the way they are today. > ** Infra-errors will be able to retry for extended periods of time. > * Configurable (feature toggle, configurable per Hook) > * Emit metrics using StatsD > See attached [^AIRFLOW_2659.pdf]design doc. -- This message was sent by Atlassian JIRA (v7.6.3#76005)