For completeness, there is also support for a DAG run timeout, which is yet another mechanism - I haven't used it though and I believe it was introduced in 1.7.x.
-s On Fri, Oct 28, 2016 at 9:42 AM, siddharth anand <san...@apache.org> wrote: > 1.6.x had an infinite retry problem. If you specified a retry count > greater than 1, the tasks would get retried ad infinitum. > > This was fixed in 1.7.x (1.7.1.3 is most recent release). > > We use and have been using the *execution_timeout* for over a year. > > build_sender_models_spark_job = BashOperator( > task_id='build_sender_models_spark_job', > execution_timeout=timedelta(hours=3), > pool='ep_data_pipeline_spark_tasks_only', > bash_command=sender_model_building_command, > params={'CLUSTER_IP':PLATFORM_VARS['ip'], > 'USER':PLATFORM_VARS['ssh_user'], 'HOME_DIR':PLATFORM_VARS['home_dir'], > 'SSH_KEY':SSH_KEY}, > dag=dag) > > > As an an additional measure, we specify an SLA timeout on the last step > (last task) of our DAG. We have an hourly DAG, so if the last task for an > hourly DAG run exceeds 2 hours, we have missed our SLA. For example, for an > execution date of *20161027T12:00:00Z*, we'd expect the run to start at > *20161027T13:00:00Z*. By *20161027T15:00:00Z*, we will be notified of the > SLA miss. > > # Operator : Send Email when flow completes > successfullysend_email_notification_flow_successful = PythonOperator( > task_id='send_email_notification_flow_successful', > execution_timeout=timedelta(minutes=15), > pool='ep_data_pipeline_metrics_gathering', > provide_context=True, > sla=timedelta(hours=2), > python_callable=send_email_notification_flow_successful, > dag=dag) > > > SLAs have been around since 1.6.x or earlier. In 1.7.x, I add a callback > mechanism to alert on SLA miss. At Agari, we essentially page our on-call > engineer and write info to Slack. > > default_args = { > 'owner': 'sanand', > 'depends_on_past': True, > 'pool': 'ep_data_pipeline', > 'start_date': START_DATE, > 'email': [import_ep_pipeline_alert_email_dl], > 'email_on_failure': import_airflow_enable_notifications, > 'email_on_retry': import_airflow_enable_notifications, > 'retries': 10, > 'retry_delay': timedelta(seconds=30), > 'priority_weight': import_airflow_priority_weight}dag = DAG(DAG_NAME, > schedule_interval='@hourly', default_args=default_args, > sla_miss_callback=sla_alert_func) > > > You can use SLA as an alternative approach to achieve your goals or in > tandem with retries, as we do. > -s > > > On Fri, Oct 28, 2016 at 8:11 AM, Adam Gutcheon < > adam_gutch...@monitor-360.com> wrote: > >> Hello, >> >> I'm having a big, showstopping problem on my airflow installation. When >> a task reaches its execution_timeout, I can see the error message in >> the task's log, but it never actually fails the task, leaving it in a >> running state forever. This is true of any task that has an >> execution_timeout set in any dag. I am using the CeleryExecutor. Are >> there hidden pitfalls to timeouts I should know about? >> >> Thanks, >> Adam G. >> >> >