[LAZY CONSENSUS] Evaluate DAG.max_active_tasks at scope of dag run

2024-10-04 Thread Daniel Standish
Following the admittedly brief discussion here , I am calling for a very lazy vote of the following proposal: DAG.max_active_tasks should be evaluated per-dag-run The status quo is that it is evaluated across all running dag runs.

Re: [DISCUSS] Remove `max_active_tasks_per_dag`? Or at least the default

2024-10-04 Thread Jarek Potiuk
So, I propose, DAG.max_active_tasks should be evaluated per-dag-run. And we can change the name accordingly if folks on board. Agree. And possibly we name it DAG.max_active_tasks_per_run J. On Fri, Oct 4, 2024 at 3:30 PM Ryan Hatter wrote: > I think I agree with this: > > I feel it should be

Re: [PROPOSAL] Add streaming support to PartialOperator

2024-10-04 Thread Jarek Potiuk
>From the earlier discussions with David - this is also (and mainly) about optimisation. Those operators do very little, and when you add total overhead that Airflow adds for scheduling and running every task, then it turns out that looping such operator's execute in a single interpreter is many, m

Re: [DISCUSS] Remove `max_active_tasks_per_dag`? Or at least the default

2024-10-04 Thread Ryan Hatter
I think I agree with this: I feel it should be applied at the dag *run* scope > and not across all dag runs. > Just a thought: If someone *did* want to run multiple DAG runs at the same time and limit the max active tasks per DAG, they could create a pool for that DAG and pass the pool in default

Re: [DISCUSS] Remove `max_active_tasks_per_dag`? Or at least the default

2024-10-04 Thread Daniel Standish
Ok, sorry, these concurrency settings are confusing. Let me clarify. `max_active_tasks_per_dag` is a core airflow setting and it provides the default for DAG.max_active_tasks. DAG.max_active_tasks I think is a reasonable config to have but the problem in my view is the scope. I feel it should b

[DISCUSS] Remove `max_active_tasks_per_dag`? Or at least the default

2024-10-04 Thread Daniel Standish
The setting max_active_tasks_per_dag seems mostly useless to me / and footgunish. Why? Because you already have a setting for max active dag runs. If you don't want to run more tasks, don't create the extra dag runs. We also already have a mechanism (param on base operator) for limiting indivi

Re: [PROPOSAL] Add streaming support to PartialOperator

2024-10-04 Thread Daniel Standish
Well, it looks like we do have concurrency control for mapped tasks after all. See max_active_tis_per_dagrun which was added in https://github.com/apache/airflow/pull/29094. So this would allow you to map over your 3000 users in a single run, but process only one at a time (or 5 or 10 at a time).

Re: [DISCUSS] Backfill concurrency re AIP-78

2024-10-04 Thread Daniel Standish
Ok want to focus the discussion on one question. Proposal: DAG.max_active_runs will apply to non-backfill dag runs and will *not* apply to backfill dag runs. Backfill.max_active_runs is a required param for backfills and is evaluated entirely separately from DAG.max_active_runs. In other words, DA

Re: [DISCUSS] Backfill concurrency re AIP-78

2024-10-04 Thread Daniel Standish
Thanks Jarek. Let me respond to your comments. Then I'll follow up with another message trying to focus the discussion on my question. > I think it should be controllable at the moment when you start backfill. Current state is, when you create a backfill you set max_active_runs for it. And t

Re: [PROPOSAL] Add streaming support to PartialOperator

2024-10-04 Thread Daniel Standish
One thing, it would have to be 3.0 since no new features are going into 2.x anymore AFAIK. Do I understand correctly that essentially what you want to be able to do is limit parallelism in mapped task? E.g. is it correct that you essentially want to do task mapping, but with parallelism=1? Would

RE: [PROPOSAL] Add streaming support to PartialOperator

2024-10-04 Thread Blain David
We cleary want to position ourselves as the second group. Not that we don't write python code, as a fact we do, but once we see something could be solved in a generic way and be contributed back to Airflow, we will also propose that. I understand the use of hooks principle, we also do it, when