Nataneljpwd commented on PR #54103:
URL: https://github.com/apache/airflow/pull/54103#issuecomment-3300201564

   Hello @xBis7, what is the goal of the PR? it seems to solve the issue that 
someone on the mailing list had with the starvation, however, even here it does 
not fully do it.
   If the goal is to increase throughput, why not delegate ALL the work to the 
sql query?
   It seems weird to only delegate 1 check to sql, which is used equally as 
much as `max_active_tis_per_dagrun` and `dag_max_active_tasks`, no?
   
   It just looks like you are trying to solve a single issue that someone in 
the mailing list was experiencing, which happens to increase the scheduler 
throughput, and it seems like a sub problem of the greater problem of general 
starvation in airflow.
   
   Why shouldn't we add the other checks? 
   Why did you decide specifically to add the `max_active_tasks` check only? it 
seems weird to me that you chose one limit, did you try other limits first? or 
did you ever try to do it on multiple limits together?
   
   And it seems to increase scheduler throughput, up until the point you have 
mapped tasks, or a lot of tasks, or tasks which belong to a starved pool, where 
you will still choose multiple tasks that will be dropped in the loop, and 
cause more iterations, however, this time the query takes twice the amount of 
time.
   
   An example is as follows:
   Task A goes to pool A, 128 mapped tasks, short running
   Task B goes to pool B 50 mapped tasks
   Pool A 80 free slots
   Pool B 1 free slot
   Same dag, and dagrun, max_active_tasks set to 10
   
   Here, the tasks will sort in an alternating pattern, lets suppose Task A is 
first for whatever reason, does not really matter.
   We will get 10 tasks, 5 from A, 5 from B, run only one from B, and suppose, 
Task A falls because of mapped task count limit, and we only schedule 1 task.
   Next run, another task (or two) from Pool B is freed, the same thing happens 
again, we schedule only 1 or 2 tasks, and drop the other selected tasks.
   
   This issue can occur for any other concurrency limit such as 
`max_active_tis_per_dag`, or pool slot count.
   
   I just don't think I fully understood the goal of the PR, and so, if you 
coud clarify, it would be appreciated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to