For completeness, there is also support for a DAG run timeout, which is yet
another mechanism - I haven't used it though and I believe it was
introduced in 1.7.x.

-s

On Fri, Oct 28, 2016 at 9:42 AM, siddharth anand <san...@apache.org> wrote:

> 1.6.x had an infinite retry problem. If you specified a retry count
> greater than 1, the tasks would get retried ad infinitum.
>
> This was fixed in 1.7.x (1.7.1.3 is most recent release).
>
> We use and have been using the *execution_timeout* for over a year.
>
> build_sender_models_spark_job = BashOperator(
>     task_id='build_sender_models_spark_job',
>     execution_timeout=timedelta(hours=3),
>     pool='ep_data_pipeline_spark_tasks_only',
>     bash_command=sender_model_building_command,
>     params={'CLUSTER_IP':PLATFORM_VARS['ip'], 
> 'USER':PLATFORM_VARS['ssh_user'], 'HOME_DIR':PLATFORM_VARS['home_dir'], 
> 'SSH_KEY':SSH_KEY},
>     dag=dag)
>
>
> As an an additional measure, we specify an SLA timeout on the last step
> (last task) of our DAG. We have an hourly DAG, so if the last task for an
> hourly DAG run exceeds 2 hours, we have missed our SLA. For example, for an
> execution date of *20161027T12:00:00Z*, we'd expect the run to start at
> *20161027T13:00:00Z*. By *20161027T15:00:00Z*, we will be notified of the
> SLA miss.
>
> # Operator : Send Email when flow completes 
> successfullysend_email_notification_flow_successful = PythonOperator(
>     task_id='send_email_notification_flow_successful',
>     execution_timeout=timedelta(minutes=15),
>     pool='ep_data_pipeline_metrics_gathering',
>     provide_context=True,
>     sla=timedelta(hours=2),
>     python_callable=send_email_notification_flow_successful,
>     dag=dag)
>
>
> SLAs have been around since 1.6.x or earlier. In 1.7.x, I add a callback
> mechanism to alert on SLA miss. At Agari, we essentially page our on-call
> engineer and write info to Slack.
>
> default_args = {
>     'owner': 'sanand',
>     'depends_on_past': True,
>     'pool': 'ep_data_pipeline',
>     'start_date': START_DATE,
>     'email': [import_ep_pipeline_alert_email_dl],
>     'email_on_failure': import_airflow_enable_notifications,
>     'email_on_retry': import_airflow_enable_notifications,
>     'retries': 10,
>     'retry_delay': timedelta(seconds=30),
>     'priority_weight': import_airflow_priority_weight}dag = DAG(DAG_NAME, 
> schedule_interval='@hourly', default_args=default_args, 
> sla_miss_callback=sla_alert_func)
>
>
> You can use SLA as an alternative approach to achieve your goals or in
> tandem with retries, as we do.
> -s
>
>
> On Fri, Oct 28, 2016 at 8:11 AM, Adam Gutcheon <
> adam_gutch...@monitor-360.com> wrote:
>
>> Hello,
>>
>> I'm having a big, showstopping problem on my airflow installation. When
>> a task reaches its execution_timeout, I can see the error message in
>> the task's log, but it never actually fails the task, leaving it in a
>> running state forever. This is true of any task that has an
>> execution_timeout set in any dag. I am using the CeleryExecutor. Are
>> there hidden pitfalls to timeouts I should know about?
>>
>> Thanks,
>> Adam G.
>>
>>
>

Reply via email to