bharatk-meesho opened a new issue, #26587:
URL: https://github.com/apache/airflow/issues/26587
### Official Helm Chart version
1.6.0 (latest released)
### Apache Airflow version
2.3.2
### Kubernetes Version
4.5.7
### Helm Chart configuration
```
celery:
## if celery worker Pods are gracefully terminated
## - consider defining a `workers.podDisruptionBudget` to prevent there
not being
## enough available workers during graceful termination waiting periods
##
## graceful termination process:
## 1. prevent worker accepting new tasks
## 2. wait AT MOST `workers.celery.gracefullTerminationPeriod` for
tasks to finish
## 3. send SIGTERM to worker
## 4. wait AT MOST `workers.terminationPeriod` for kill to finish
## 5. send SIGKILL to worker
##
gracefullTermination: true
## how many seconds to wait for tasks to finish before SIGTERM of the
celery worker
##
gracefullTerminationPeriod: 180
## how many seconds to wait after SIGTERM before SIGKILL of the celery
worker
## - [WARNING] tasks that are still running during SIGKILL will be
orphaned, this is important
## to understand with KubernetesPodOperator(), as Pods may continue
running
##
terminationPeriod: 120
```
### Docker Image customisations
_No response_
### What happened
I am running an airflow cluster on EKS on AWS. I have setup some scaling
config for worker setup. If CPU/Mem > 70% then airflow spins up new worker pod.
However I am facing an issue when these worker pods are scaling down. When
worker pods start scaling down, they terminate within few minutes irrespective
of any tasks running.
Is there any way I can setup config so that worker pod only terminates when
task running on it finishes execution. Since tasks in my dags can run anywhere
between few minutes to few hours so I don't want to put a large value for
gracefullTerminationPeriod.
Generally the long running task is a python operator which runs either a
presto sql query or Databricks job via Prestohook or DatabricksOperator
respectively. And I don't want these to receive SIGTERM before they complete
their execution on worker pod scaling down.
### What you think should happen instead
What should happen is that either of below two things:
1) Have an option that worker pods doesn't terminate until all tasks running
on that particular worker have completed execution.
2) That task can be terminated gracefully and same could be started on other
worker node.
### How to reproduce
It can be reproduced by running multiple dags without different execution
times setup so that worker scales up first and then scale down. One simple way
is to run multiple copies of dag with python operator with random time set for
sleep.
### Anything else
Looking for any sort of solution which doesn't mark task as fail when worker
pod scales down.
### Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
### Code of Conduct
- [X] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]