[ 
https://issues.apache.org/jira/browse/AIRFLOW-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520768#comment-16520768
 ] 

Alejandro Fernandez commented on AIRFLOW-2659:
----------------------------------------------

I came across https://issues.apache.org/jira/browse/AIRFLOW-1620, which was 
created by [~aoen], and is essentially the same ask as this Jira.

> Improve Robustness of Operators in Airflow during Infra Outages
> ---------------------------------------------------------------
>
>                 Key: AIRFLOW-2659
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-2659
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: hooks
>    Affects Versions: 1.10
>            Reporter: Alejandro Fernandez
>            Assignee: Alejandro Fernandez
>            Priority: Major
>         Attachments: AIRFLOW_2659.pdf, retry_window.png, test_rules.py
>
>
> *Problem:*
>  If an infrastructure outage occurs on the Hadoop cluster, tasks that rely on 
> those services will fail in Airflow, thereby causing SLA misses and 
> deteriorating user confidence in Airflow (even if the outage was in another 
> system). Only a fraction of tasks and DAGs have retries around certain 
> operators/hooks and the retry attempts are not sufficient during an outage.
> *Goal:* Automatically retry failures that occur due to infrastructure issues.
> *High-level design:*
>  * Retry decorator in the Hooks for easy annotation
>  * Retry logic will be time-based (initial delay, max delay time, retry 
> window, etc.)
>  * Allow each Hook to determine the root-cause of the error (user, infra 
> outage)
>  ** User-errors will be handled the way they are today.
>  ** Infra-errors will be able to retry for extended periods of time.
>  * Configurable (optional, configurable per Hook)
>  * Emit metrics using StatsD
> See attached [^AIRFLOW_2659.pdf]design doc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to