We are going to go round and round on the "I think I know what you mean, 
but..." carousel 😄

I suspect you are thinking in terms of Airflow Queues like the Celery Queue 
where you can direct tasks to a specific Celery worker.  I was using "callback 
queue"  to refer to an internal data structure within the executor, not a 
user-facing Queue.  I could have easily called it a FIFO collection or list or 
whatever, and maybe I should have.

Internally, the executor has two collections:  one stores the callbacks in FIFO 
order, and one stores the prioritized tasks.  When a slot opens up, it takes 
the top of the callback list if any are available, then fills the remaining 
slots from the list of tasks.  From the user's point of view it's one 
prioritized queue where all callbacks get top priority.

I suspect there is confusion because I've somewhat overloaded the term Queue.  
I'm not sure there is any configuration to update.

- ferruzzi
________________________________
From: Vikram Koka via dev <[email protected]>
Sent: Tuesday, March 3, 2026 8:54 AM
To: [email protected] <[email protected]>
Cc: Vikram Koka <[email protected]>
Subject: RE: [EXT] [AIP-86] Deadline Callback queuing and priority

CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.



AVERTISSEMENT: Ce courrier Ă©lectronique provient d’un expĂ©diteur externe. Ne 
cliquez sur aucun lien et n’ouvrez aucune piùce jointe si vous ne pouvez pas 
confirmer l’identitĂ© de l’expĂ©diteur et si vous n’ĂȘtes pas certain que le 
contenu ne présente aucun risque.



Dennis,

Thank you for writing this up.

I completely missed this concept of “callback queue” as a parallel FIFO to
the task queue during the AIP discussions and in the AIP document itself.
So, this definitely helps clarify that part for me.

I understand the comment about honoring the parallelism config vs. not
honoring the DAG level parallelism configurations.

What I am not clear about is the how this “callback queue” maps to the
existing queues already configured within the Airflow deployments, when
multiple queues are configured. I can guess, but would prefer to have a
definitive answer based on your thinking and implementation.
This then leads to the updated configuration needed to take advantage of
this feature.

Best regards,
Vikram

On Fri, Feb 27, 2026 at 3:27 PM Ferruzzi, Dennis <[email protected]>
wrote:

> I think a lot of this has been discussed in dev calls but maybe never got
> documented anywhere for folks who aren't/weren't on the call.  I haven't
> been very good at all about keeping the AIP up to date after it was
> approved, or keeping community decisions and discussions easily
> discoverable, and that's on me.  Now that folks can see the feature
> as-implemented, Amogh raised some good questions that are worth addressing
> here so the rationale is captured somewhere persistent. I also want to flag
> a couple of ideas that came out of that discussion as potential follow-up
> improvements.
>
> The plan has been that we treat a deadline callback as "finishing old
> work" and therefore all callbacks get priority over new tasks.  The core
> design principle here is that deadline callbacks need to be timely — if a
> deadline fires, the user expects to know about it promptly. A deadline
> callback that gets stuck behind the same bottleneck it's trying to alert
> you about isn't useful.
>
> To that end, what I implemented was effectively two queues:
> task_instances get queued by priority_weight (as they "always" have), and
> callbacks get queued in FIFO order.  When the scheduler looks for new work
> it will always fill slots from the callback queue first.  Parallelism is
> honored, but pools and max_active_tasks_per_dag are ignored.  The reason
> for that is twofold.  Let's say a user has a Dag with a deadline at
> 10-minutes.  There are two cases where that deadline can be triggered:
>
> Case 1) This Dag is still running at the deadline, the callback triggers -
> maybe a Slack message that the report is still being generated and may be
> delayed today - while the Dag still runs.
>
> In this case, `pools` and `tasks_per_dag` may make sense as the dag is
> still active, but I'd argue that in this case the deadline should "break
> through" that pool/task limit and execute regardless.  For example, if the
> issue is that the tasks stalled because the worker is hitting resource
> constraints then tacking the deadline callback on behind that roadblock and
> waiting for them to finish before alerting the user that there is a problem
> defeats the intent of the Deadline.  It is still bound by the
> executor-level parallelism so it's not allowed to just run rampant, but it
> isn't bound by the dag-level constraints.
>
> Case 2) The Dag failed at 2 minutes and the callback is triggered at the
> 2-minute point.  At this point there is no `pool` or `max_tasks` to
> consider.  The task instances should have released their slots and aren't
> being counted toward the active tasks.  Even if other Dags have started up
> and are claiming pool slots, this callback isn't part of that Dag and
> shouldn't be lumped with its tasks.
>
> A `max_executor_callbacks` setting which parallels max_executor_tasks is
> one idea that has come up.  It's not a bad idea; I guess if the whole
> building is burning down, you don't need to know that each floor is
> burning. It feels a bit against the intention of the callbacks getting
> prioritized over tasks, but if it's a user-defined option then that's on
> the user. It might be worth considering as a follow-up improvement if
> anyone thinks it's something we really need.
>
> The other decision that seems to be contentious is that callbacks are FIFO
> instead of implementing a `callback-priority-weight`.  FIFO seems fine for
> a callback queue and how I envision the feature being used, but maybe once
> the feature gets in the hands of users they'll find a need for it.  We as a
> community have been saying for a while that there are way too many
> user-facing knobs in the settings and this felt like a logical place for us
> to be opinionated in the code.  With FIFO, the feature's behavior is
> straightforward and predictable, and it's easier to add the weight later if
> there is demand than it would be to remove it later if we decide to prune
> the config options in a future version.
>
> For now, I think we're in a good place to ship with the current behavior
> and we can iterate based on real-world usage.  I'll cut some Issues to make
> sure the follow-up ideas from here and the PR are tracked so they don't get
> lost.  If anyone has concerns about the current behavior that should be
> addressed before launch, let me know.
>
>
>   *
> ferruzzi
>

Reply via email to