[ 
https://issues.apache.org/jira/browse/AIRFLOW-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alejandro Fernandez updated AIRFLOW-2659:
-----------------------------------------
    Attachment: test_rules.py

> Improving Robustness of Operators in Airflow during Infra Outages
> -----------------------------------------------------------------
>
>                 Key: AIRFLOW-2659
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-2659
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: hooks
>    Affects Versions: 1.10
>            Reporter: Alejandro Fernandez
>            Assignee: Alejandro Fernandez
>            Priority: Major
>         Attachments: AIRFLOW_2659.pdf, test_rules.py
>
>
> *Problem:*
> If an infrastructure outage occurs on the Hadoop cluster, tasks that rely on 
> those services will fail in Airflow, thereby causing SLA misses and 
> deteriorating user confidence in Airflow (even if the outage was in another 
> system). Only a fraction of tasks and DAGs have retries around certain 
> operators/hooks and the retry attempts are not sufficient during an outage.
> *Goal:* Automatically retry failures that occur due to infrastructure issues.
> *High-level design:*
> * Retry decorator in the Hooks for easy annotation
> * Retry logic will be time-based (initial delay, max delay time, retry 
> window, etc.)
> * Allow each Hook to determine the root-cause of the error (user, infra 
> outage)
> ** User-errors will be handled the way they are today.
> ** Infra-errors will be able to retry for extended periods of time.
> * Configurable (feature toggle, configurable per Hook)
> * Emit metrics using StatsD
> See attached design doc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to