[jira] [Commented] (AIRFLOW-2659) Improving Robustness of Operators in Airflow during Infra Outages

Ash Berlin-Taylor (JIRA) Fri, 22 Jun 2018 01:59:05 -0700


    [ 
https://issues.apache.org/jira/browse/AIRFLOW-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520135#comment-16520135
 ]


Ash Berlin-Taylor commented on AIRFLOW-2659:
--------------------------------------------

Tasks/Operators in airflow can already be configured to retry  - where does 
that fall short that this proposal addresses?

I'm initially wary of having _two_ levels of retry in Airflow as it sounds like 
a recipe for confusion.

> Improving Robustness of Operators in Airflow during Infra Outages
> -----------------------------------------------------------------
>
>                 Key: AIRFLOW-2659
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-2659
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: hooks
>    Affects Versions: 1.10
>            Reporter: Alejandro Fernandez
>            Assignee: Alejandro Fernandez
>            Priority: Major
>         Attachments: AIRFLOW_2659.pdf, test_rules.py
>
>
> *Problem:*
>  If an infrastructure outage occurs on the Hadoop cluster, tasks that rely on 
> those services will fail in Airflow, thereby causing SLA misses and 
> deteriorating user confidence in Airflow (even if the outage was in another 
> system). Only a fraction of tasks and DAGs have retries around certain 
> operators/hooks and the retry attempts are not sufficient during an outage.
> *Goal:* Automatically retry failures that occur due to infrastructure issues.
> *High-level design:*
>  * Retry decorator in the Hooks for easy annotation
>  * Retry logic will be time-based (initial delay, max delay time, retry 
> window, etc.)
>  * Allow each Hook to determine the root-cause of the error (user, infra 
> outage)
>  ** User-errors will be handled the way they are today.
>  ** Infra-errors will be able to retry for extended periods of time.
>  * Configurable (feature toggle, configurable per Hook)
>  * Emit metrics using StatsD
> See attached [^AIRFLOW_2659.pdf]design doc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (AIRFLOW-2659) Improving Robustness of Operators in Airflow during Infra Outages

Reply via email to