[GitHub] [airflow] bharatk-meesho opened a new issue, #26587: Airflow tasks failing with SIGTERM when worker pod downscaling.

GitBox Thu, 22 Sep 2022 03:06:47 -0700


bharatk-meesho opened a new issue, #26587:
URL: https://github.com/apache/airflow/issues/26587


   ### Official Helm Chart version
   
   1.6.0 (latest released)
   
   ### Apache Airflow version
   
   2.3.2
   
   ### Kubernetes Version
   
   4.5.7
   
   ### Helm Chart configuration
   
   ```
   celery:
       ## if celery worker Pods are gracefully terminated
       ## - consider defining a `workers.podDisruptionBudget` to prevent there 
not being
       ##   enough available workers during graceful termination waiting periods
       ##
       ## graceful termination process:
       ##  1. prevent worker accepting new tasks
       ##  2. wait AT MOST `workers.celery.gracefullTerminationPeriod` for 
tasks to finish
       ##  3. send SIGTERM to worker
       ##  4. wait AT MOST `workers.terminationPeriod` for kill to finish
       ##  5. send SIGKILL to worker
       ##
       gracefullTermination: true
   
       ## how many seconds to wait for tasks to finish before SIGTERM of the 
celery worker
       ##
       gracefullTerminationPeriod: 180
   
     ## how many seconds to wait after SIGTERM before SIGKILL of the celery 
worker
     ## - [WARNING] tasks that are still running during SIGKILL will be 
orphaned, this is important
     ##   to understand with KubernetesPodOperator(), as Pods may continue 
running
     ##
     terminationPeriod: 120
   ```
   
   ### Docker Image customisations
   
   _No response_
   
   ### What happened
   
   I am running an airflow cluster on EKS on AWS. I have setup some scaling 
config for worker setup. If CPU/Mem > 70% then airflow spins up new worker pod. 
However I am facing an issue when these worker pods are scaling down. When 
worker pods start scaling down, they terminate within few minutes irrespective 
of any tasks running. 
   
   Is there any way I can setup config so that worker pod only terminates when 
task running on it finishes execution. Since tasks in my dags can run anywhere 
between few minutes to few hours so I don't want to put a large value for 
gracefullTerminationPeriod.
   
   Generally the long running task is a python operator which runs either a 
presto sql query or Databricks job via Prestohook or DatabricksOperator 
respectively. And I don't want these to receive SIGTERM before they complete 
their execution on worker pod scaling down.
   
   ### What you think should happen instead
   
   What should happen is that either of below two things:
   1) Have an option that worker pods doesn't terminate until all tasks running 
on that particular worker have completed execution.
   2) That task can be terminated gracefully and same could be started on other 
worker node.
   
   ### How to reproduce
   
   It can be reproduced by running multiple dags without different execution 
times setup so that worker scales up first and then scale down. One simple way 
is to run multiple copies of dag with python operator with random time set for 
sleep.
   
   ### Anything else
   
   Looking for any sort of solution which doesn't mark task as fail when worker 
pod scales down.
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] bharatk-meesho opened a new issue, #26587: Airflow tasks failing with SIGTERM when worker pod downscaling.

Reply via email to