[ 
https://issues.apache.org/jira/browse/AIRFLOW-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520048#comment-16520048
 ] 

Alejandro Fernandez commented on AIRFLOW-2659:
----------------------------------------------

Hi [~maxime.beauche...@apache.org], [~bolke], [~saguziel], [~yrqls21],

I'm proposing this feature and would greatly appreciate feedback on the design 
doc. Happy to also share it on the dev mailing list.

Cheers,
Alejandro

> Improving Robustness of Operators in Airflow during Infra Outages
> -----------------------------------------------------------------
>
>                 Key: AIRFLOW-2659
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-2659
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: hooks
>    Affects Versions: 1.10
>            Reporter: Alejandro Fernandez
>            Assignee: Alejandro Fernandez
>            Priority: Major
>         Attachments: AIRFLOW_2659.pdf, test_rules.py
>
>
> *Problem:*
>  If an infrastructure outage occurs on the Hadoop cluster, tasks that rely on 
> those services will fail in Airflow, thereby causing SLA misses and 
> deteriorating user confidence in Airflow (even if the outage was in another 
> system). Only a fraction of tasks and DAGs have retries around certain 
> operators/hooks and the retry attempts are not sufficient during an outage.
> *Goal:* Automatically retry failures that occur due to infrastructure issues.
> *High-level design:*
>  * Retry decorator in the Hooks for easy annotation
>  * Retry logic will be time-based (initial delay, max delay time, retry 
> window, etc.)
>  * Allow each Hook to determine the root-cause of the error (user, infra 
> outage)
>  ** User-errors will be handled the way they are today.
>  ** Infra-errors will be able to retry for extended periods of time.
>  * Configurable (feature toggle, configurable per Hook)
>  * Emit metrics using StatsD
> See attached [^AIRFLOW_2659.pdf]design doc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to