james-seymour-cubiko opened a new issue, #29416:
URL: https://github.com/apache/airflow/issues/29416

   ### Description
   
   Optionally allow a task pool to count tasks in the 'deferred' state as 
occupying slots in that pool - not sure what the best way of implementing this 
is, but currently my very hacky solution is to patch the 
`airflow.models.pool.Pool.slots_stats` method to include deferred tasks as 
running in each pool.
   
   ### Use case/motivation
   
   The prototypical usecase here is using Airflow to limit the number of 
concurrent queries executing against a database while keeping the benefit of 
waiting for those queries to complete on a triggerer (where a proxy is used to 
execute queries instead of a direct connection to the db)
   
   In our case, we use Airflow to orchestrate an Azure Data Factory that 
executes queries against a database and moves the resulting data. 
   
   We have an airflow task trigger a single pipeline run in that data factory, 
which then defers and waits for that pipeline run to complete in the triggerer 
(for efficiency) before continuing the dag run. 
   
   However, we have ~100 tasks that all execute a pipeline run on the same 
factory - ideally we would execute all of these pipelines concurrently, but the 
database is quickly overwhelmed by that many queries at the same time, 
resulting in timeouts. Therefore, the next best option is to limit the 
concurrency of those queries with a task pool in Airflow.
   
   This _can_ currently be achieved with Airflow's task pools, but only if we 
keep each of those tasks in the running state while waiting each query to 
complete (as deferred tasks do not occupy slots in the task pool). Otherwise, 
if we defer the tasks while waiting, then we lose the concurrency limits of the 
pool, as all ~100 tasks are free to defer at the same time, so its currently an 
either / or solution.
   
   I am aware that in this specific case that ADF does support a maximum 
pipeline run concurrency setting, which is a much quicker way to solve this 
problem, but we have other extraction tools that we can't rely on to limit 
concurrency in this way, and I thought I would just throw this idea out here 
anyway in case others might find it helpful :)
   
   ### Related issues
   
   Somewhat related - https://github.com/apache/airflow/issues/15082
   
   ### Are you willing to submit a PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to