GitHub user matrach added a comment to the discussion: Scheduler performance with large number of mapped task instances
Jarek, thanks for your reply! I'd be glad to contribute, especially as the issue seems to be the only one preventing us from incorporating Airflow in our project. Is there any more specific documentation of the planned changes to Airflow 3? (I haven't found it right away in the Contributor's Guide.) Or should I better just ask my questions by the chat? The main limiting factor in the scheduler right now seems to be always iterating over all task instances with `SCHEDULEABLE_STATES`, as one would typically constrain such a list to TIs which actually have a chance to change their scheduling decision. IIUC, then the scheduler checks the upstream dependencies for each scheduleable task instance separately. However, in a typical performant scheduler, it is the completion of an upstream task which triggers evaluation of the dependencies (modulo race conditions). That's why I was surprised by the behavior of mini-scheduler. Do I understand it correctly that, in the upcoming release, the scheduler still holds in-memory representation of the task dependencies *and* receives status updates from the workers? That should be enough to implement the above idea – even with multiple schedulers. In regards to the stream-ish jobs, did you mean something that could involve spawning a task instance for each event (for instance, from a Trigger)? I felt like the expansion of a MappedOperator to a per-DagRun-constant is quite limiting, but necessary in the current architecture. Are there any deeper reasons why task-instance dependencies are not materialized in the database? Intuitively, it could make it easier to support arbitrary task-group nesting and streaming expansion over MappedOperators. Furthermore, it would improve data locality in the scheduler by pushing some dependency resolution into the database query. GitHub link: https://github.com/apache/airflow/discussions/46044#discussioncomment-11999192 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
