GitHub user tanvn added a comment to the discussion: Scheduler sends second try before first is complete
We have been hitting the same issue recently Airflow scheduler is retrying task when first one still running. Our airflow setup is as below - Version: 2.10.5 - Running on Kubernetes - Components - single scheduler - single webserver - Use MySQL 8.0 - Use Helm chart for Deployment After some investigation, I found that the issue happens under a quite rare condition: a new deployment takes place while the current scheduler is executing `adopt_or_reset_orphaned_tasks`: while some pods have their label (airflow-worker=job_id) updated, the scheduler is terminated, causing an inconsistence situation: - in DB, a task instance has an old queued_by_job_id, for example: queued_by_job_id=1 - its running pod, now has a new label: airflow-worker=2 (updated from 1 -> 2) With this situation, when the new deployment takes place, a new scheduler is up (its job_id is now 3, for example) and execute `adopt_or_reset_orphaned_tasks` then the task instance with queued_by_job_id=1 will be reset but its pod will still be running. A new worker pod will be launched later -> causing this error. Related source code: - Scheduler - adopt_or_reset_orphaned_tasks when the scheduler is up https://github.com/apache/airflow/blob/2.10.5/airflow/jobs/scheduler_job_runner.py#L1081-L1082 - https://github.com/apache/airflow/blob/2.10.5/airflow/jobs/scheduler_job_runner.py#L1943 - get list of TIs linked with non-running old scheduler jobs: https://github.com/apache/airflow/blob/2.10.5/airflow/jobs/scheduler_job_runner.py#L1975-L1991 - reset TIs (ones that were unable to adopt) https://github.com/apache/airflow/blob/2.10.5/airflow/jobs/scheduler_job_runner.py#L1999-L2002 - update queued_by_job_id for adopted TIs: https://github.com/apache/airflow/blob/2.10.5/airflow/jobs/scheduler_job_runner.py#L2004-L2005 - K8s executor - try adopt task instances: https://github.com/apache/airflow/blob/providers-cncf-kubernetes/10.1.0/providers/src/airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py#L577 - find pods whose label match the target scheduler job ids: https://github.com/apache/airflow/blob/providers-cncf-kubernetes/10.1.0/providers/src/airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py#L591-L598 GitHub link: https://github.com/apache/airflow/discussions/22554#discussioncomment-14678157 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
