GitHub user tanvn added a comment to the discussion: Scheduler sends second try 
before first is complete

We have been hitting the same issue recently
Airflow scheduler is retrying task when first one still running.

Our airflow setup is as below

- Version: 2.10.5
- Running on Kubernetes
- Components
  - single scheduler
  - single webserver
- Use MySQL 8.0
- Use Helm chart for Deployment

After some investigation, I found that the issue happens under a quite rare 
condition: a new deployment takes place while the current scheduler is 
executing `adopt_or_reset_orphaned_tasks`: while some pods have their label 
(airflow-worker=job_id) updated, the scheduler is terminated, causing an 
inconsistence situation:
- in DB, a task instance has an old queued_by_job_id, for example: 
queued_by_job_id=1
- its running pod, now has a new label: airflow-worker=2 (updated from 1 -> 2)

With this situation, when the new deployment takes place, a new scheduler is up 
(its job_id is now 3, for example) and execute `adopt_or_reset_orphaned_tasks` 
then the task instance  with queued_by_job_id=1 will be reset but its pod will 
still be running.
A new worker pod will be launched later -> causing this error.


Related source code:

- Scheduler
  - adopt_or_reset_orphaned_tasks when the scheduler is up 
https://github.com/apache/airflow/blob/2.10.5/airflow/jobs/scheduler_job_runner.py#L1081-L1082
  - 
https://github.com/apache/airflow/blob/2.10.5/airflow/jobs/scheduler_job_runner.py#L1943
  - get list of TIs linked with non-running old scheduler jobs: 
https://github.com/apache/airflow/blob/2.10.5/airflow/jobs/scheduler_job_runner.py#L1975-L1991
  - reset TIs (ones that were unable to adopt) 
https://github.com/apache/airflow/blob/2.10.5/airflow/jobs/scheduler_job_runner.py#L1999-L2002
  - update queued_by_job_id for adopted TIs: 
https://github.com/apache/airflow/blob/2.10.5/airflow/jobs/scheduler_job_runner.py#L2004-L2005
- K8s executor
  - try adopt task instances: 
https://github.com/apache/airflow/blob/providers-cncf-kubernetes/10.1.0/providers/src/airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py#L577
  - find pods whose label match the target scheduler job ids: 
https://github.com/apache/airflow/blob/providers-cncf-kubernetes/10.1.0/providers/src/airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py#L591-L598

GitHub link: 
https://github.com/apache/airflow/discussions/22554#discussioncomment-14678157

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to