GitHub user matrach added a comment to the discussion: Scheduler performance 
with large number of mapped task instances

Jarek, thanks for your reply! I'd be glad to contribute, especially as the 
issue seems to be the only one preventing us from incorporating Airflow in our 
project. Is there any more specific documentation of the planned changes to 
Airflow 3? (I haven't found it right away in the Contributor's Guide.) Or 
should I better just ask my questions by the chat?

The main limiting factor in the scheduler right now seems to be always 
iterating over all task instances with `SCHEDULEABLE_STATES`, as one would 
typically constrain such a list to TIs which actually have a chance to change 
their scheduling decision. IIUC, then the scheduler checks the upstream 
dependencies for each scheduleable task instance separately. However, in a 
typical performant scheduler, it is the completion of an upstream task which 
triggers evaluation of the dependencies (modulo race conditions). That's why I 
was surprised by the behavior of mini-scheduler.

Do I understand it correctly that, in the upcoming release, the scheduler still 
holds in-memory representation of the task dependencies *and* receives status 
updates from the workers? That should be enough to implement the above idea – 
even with multiple schedulers.

In regards to the stream-ish jobs, did you mean something that could involve 
spawning a task instance for each event (for instance, from a Trigger)? I felt 
like the expansion of a MappedOperator to a per-DagRun-constant is quite 
limiting, but necessary in the current architecture. Are there any deeper 
reasons why task-instance dependencies are not materialized in the database? 
Intuitively, it could make it easier to support arbitrary task-group nesting 
and streaming expansion over MappedOperators. Furthermore, it would improve 
data locality in the scheduler by pushing some dependency resolution into the 
database query.




GitHub link: 
https://github.com/apache/airflow/discussions/46044#discussioncomment-11999192

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to