Nataneljpwd commented on PR #54103: URL: https://github.com/apache/airflow/pull/54103#issuecomment-3300201564
Hello @xBis7, what is the goal of the PR? it seems to solve the issue that someone on the mailing list had with the starvation, however, even here it does not fully do it. If the goal is to increase throughput, why not delegate ALL the work to the sql query? It seems weird to only delegate 1 check to sql, which is used equally as much as `max_active_tis_per_dagrun` and `dag_max_active_tasks`, no? It just looks like you are trying to solve a single issue that someone in the mailing list was experiencing, which happens to increase the scheduler throughput, and it seems like a sub problem of the greater problem of general starvation in airflow. Why shouldn't we add the other checks? Why did you decide specifically to add the `max_active_tasks` check only? it seems weird to me that you chose one limit, did you try other limits first? or did you ever try to do it on multiple limits together? And it seems to increase scheduler throughput, up until the point you have mapped tasks, or a lot of tasks, or tasks which belong to a starved pool, where you will still choose multiple tasks that will be dropped in the loop, and cause more iterations, however, this time the query takes twice the amount of time. An example is as follows: Task A goes to pool A, 128 mapped tasks, short running Task B goes to pool B 50 mapped tasks Pool A 80 free slots Pool B 1 free slot Same dag, and dagrun, max_active_tasks set to 10 Here, the tasks will sort in an alternating pattern, lets suppose Task A is first for whatever reason, does not really matter. We will get 10 tasks, 5 from A, 5 from B, run only one from B, and suppose, Task A falls because of mapped task count limit, and we only schedule 1 task. Next run, another task (or two) from Pool B is freed, the same thing happens again, we schedule only 1 or 2 tasks, and drop the other selected tasks. This issue can occur for any other concurrency limit such as `max_active_tis_per_dag`, or pool slot count. I just don't think I fully understood the goal of the PR, and so, if you coud clarify, it would be appreciated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
