Re: [PR] Include the max_active_tasks limit in the query fetching TIs to be queued [airflow]

via GitHub Wed, 17 Sep 2025 15:43:56 -0700


Asquator commented on PR #54103:
URL: https://github.com/apache/airflow/pull/54103#issuecomment-3304768333


   Hey @xBis7,
   This is definitely a starvation problem that causes the scheduler to queue 
less than `max_tis` tasks in every iteration. The starvation simply comes from 
the edge cases where you have SO much tasks that this inefficiency turns to a 
complete hell where tasks are created faster than they can be queued, simply 
because the scheduler picks 1-2 tasks in every cycle (priorities and similar 
sorting patterns may be involved).
   
   We originally experienced this issue with pools: 
https://github.com/apache/airflow/issues/45636
   It's conceptually the same problem you presented, but one that stems from 
the pool slots limit rather than `max_active_tasks`. We wanted to open a very 
similar PR to address our case, but soon realized the problem is a bit wider, 
and our fix wouldn't have solved your problem for instance.
   
   We started digging deeper and found out that starvation may come from all 
kinds the limits:
   1. pool slots (our case)
   2. `max_active_tasks` (your case)
   3. `max_active_tis_per_dag`
   4. `max_active_tis_per_dagrun`
   5. _executor slots_ (slightly different problem, local per scheduler)
   
   So we gave up the idea of solving one specific case with pools simply 
because it would benefit only part of the users. We wanted a global solution. 
https://github.com/apache/airflow/pull/53492 was born to do the thing you've 
done here, but for ALL the limits using window functions/lateral joins. It 
turned out in the end that applying this logic to multiple limits that are 
orthogonal to each other is kind of impossible in SQL (for instance,  
`max_active_tis_per_dagrun` is a sub-limit to `max_active_tis_per_dag` and 
`max_active_tasks`, but pool slots is a limit orthogonal to all others - 
meaning there's no logical nesting between them). Won't bring up all the 
reasoning here, but we tried VERY hard to make it work. There are still cases 
where this logic fails, in addition to SQL being uncapable of optimizing 
multiple WFs so it can run fast.
   
   This way https://github.com/apache/airflow/pull/55537 was born, and I'll 
soon give a notice of it on mail (need some time to make it robust and ready 
for review). It seems to solve all cases 1-4, but has some drawbacks that have 
to be addressed.
   
   Could you please kindly share the exact workloads used for benchmarks? I 
mean the source code.
   It would be great to agree on some "generic" workloads so we can test 
different proposals using the same DAGs.
   Numbers will be different due to hardware differences, but we only care 
about the ratios here.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Include the max_active_tasks limit in the query fetching TIs to be queued [airflow]

Reply via email to