Golokesh Patra created AIRFLOW-6648: ---------------------------------------
Summary: Timeout Feature - Provided statistical solution to long running/stuck jobs and take appropriate actions Key: AIRFLOW-6648 URL: https://issues.apache.org/jira/browse/AIRFLOW-6648 Project: Apache Airflow Issue Type: Improvement Components: aws, DAG, database, operators Affects Versions: 1.10.0 Environment: AWS Linux AMI - Ubuntu 18.04.1 LTS (GNU/Linux 4.15.0-1027-aws x86_64) Reporter: Golokesh Patra Assignee: Golokesh Patra Attachments: image-2020-01-27-17-07-51-822.png, image-2020-01-27-17-08-09-867.png, image-2020-01-27-17-08-33-088.png, image-2020-01-27-17-22-07-433.png, image2019-3-25_12-33-57.png Sometimes, across different type of tasks/jobs , one might encounter issues where airflow jobs/tasks get stuck while they are in running state. Such issues will cause - Pipeline being stuck for no reason stalling other jobs/tasks which will be a disaster when such issues happen on Production. This particular improvement aims to not only improve upon the TIMEOUT logic already in airflow, but to make it more functional and automated. *Diagrammatically Explanation of the solution -* !image-2020-01-27-17-22-07-433.png! *Detailed Theoretical Explanation -* With increasing Data & Complexity of tasks/job , besides the increasing load, the chances of memory leaks/stuck jobs/some infrastructural issues etc may occur thereby creating some unwanted results. Maybe on some day there was more data which resulted in a steep jump in the duration of the job; otherwise, the growth is expected to be gradual. And sometimes, the Jobs get stuck because of various issues and often requires termination followed by a restart. So, we are trying to make a logic which will automatically decide whether to * _terminate the Job_ * _Terminate and Restart_ * _Terminate and Mark as a failure so that downstream jobs don't get triggered._ * _Take no action and inform DevOps regarding the issue ( Manual Action )_ So, I just want to know, statistically, what will be the effective way to achieve the above outcomes. Lets Consider 2 Jobs X & Y. Jobs related Info - !image-2020-01-27-17-07-51-822.png! !image-2020-01-27-17-08-09-867.png! Then I was thinking of having a New Table which would be structured as - +Derived table-+ !image-2020-01-27-17-08-33-088.png! ( The above Example is theoretical and actual implementation might differ ) *LIMITATION -* # For now , we have only tested the above on EMR ( Personal Usecase ) # Testing Pending for Databricks. ( Personal Usecase ) Please do suggest any other services where this needs/can be used. -- This message was sent by Atlassian Jira (v8.3.4#803005)