RikHeijdens opened a new issue #13151:
URL: https://github.com/apache/airflow/issues/13151


   **Apache Airflow version**: 2.0.0
   **Kubernetes version (if you are using kubernetes)** (use `kubectl version`):
   **Environment**:
   
   - **OS** (e.g. from /etc/os-release): Debian GNU/Linux 10 (buster)
   - **Kernel** (e.g. `uname -a`): Linux 6ae65b86e112 5.4.0-52-generic 
#57-Ubuntu SMP Thu Oct 15 10:57:00 UTC 2020 x86_64 GNU/Linux
   - **Others**: Python 3.8
   
   **What happened**:
   
   After migrating one of our development Airflow instances from 1.10.14 to 
2.0.0, the scheduler started to refuse to schedule tasks for a DAG that did not 
actually exceed its `max_active_runs`.
   
   When it did this the following error would be logged:
   
   ```
   DAG <dag_name> already has 577 active runs, not queuing any tasks for run 
2020-12-17 08:05:00+00:00
   ```
   
   A bit of digging revealed that this DAG had task instances associated with 
it that are in the `removed` state. As soon as I forced the task instances that 
are in the `removed` state into the `failed` state, the tasks would be 
scheduled.
   
   I believe the root cause of the issue is that [this 
filter](https://github.com/apache/airflow/blob/master/airflow/jobs/scheduler_job.py#L1506)
 does not filter out tasks that are in the `removed` state.
   
   **What you expected to happen**:
   
   I expected the task instances in the DAG to be scheduled, because the DAG 
did not actually exceed the number of `max_active_runs`.
   
   **How to reproduce it**:
   
   I think the best approach to reproduce it is as follows:
   1. Create a DAG and set `max_active_runs` to 1.
   2. Ensure the DAG has ran successfully a number of times, such that it has 
some history associated with it.
   3. Set one historical task instance to the `removed` state (either by 
directly updating it in the DB, or deleting a task from a DAG before it has 
been able to execute).
   
   **Anything else we need to know**:
   
   The Airflow instance that I ran into this issue with contains about 3 years 
of task history, which means that we actually had quite a few task instances 
that are in the `removed` state, but there is no easy way to delete those from 
the Web UI.
   
   A work around is to set the tasks to `failed`, which will allow the 
scheduler to proceed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to