shahar1 opened a new pull request, #67750:
URL: https://github.com/apache/airflow/pull/67750
Replace `collections.deque` with `OrderedDict[DagFileInfo, None]` for
`DagFileProcessorManager._file_queue`.
## Problem
`x in deque` and `deque.remove(x)` are O(queue-size). The frontprio
priority path (`_queue_requested_files_for_parsing`), per-loop callback
adds (`_add_callback_to_queue`), and any re-add against a populated queue
are therefore O(N × Q) — quadratic in the steady-state file count.
The three affected paths:
- `mode="frontprio"`: `deque.remove(file)` per file — O(Q) each
- `mode="front"`: `f not in self._file_queue` per file — O(Q) each
- incremental drip: one file at a time via callback/priority adds
## Fix
Use `OrderedDict` as an ordered set (`{file: None}`). Membership is O(1),
push-front is `move_to_end(file, last=False)`, pop-front is
`popitem(last=False)`. All three modes collapse to O(1) per file.
Behavior is verified identical over 300 random operations against the old
deque semantics.
## Benchmark results
Measured on WSL2 / Python 3.12 with synthetic `DagFileInfo` objects
(no DB, no subprocess). Full benchmark scripts:
[gist: dag-processing
benchmarks](https://gist.github.com/shahar1/d2da5afe634034253e4e08fd84eff22a)
**File-queue ops (best-of-5, ms):**
Before:
| files | fill empty (linear) | front re-add | frontprio re-add |
incremental drip |
|------:|--------------------:|-------------:|-----------------:|-----------------:|
| 500 | 0.17 | 35.4 | 35.6 | 36.5
|
| 4000 | 0.31 | 2636.7 | 2537.9 | 2640.5
|
After:
| files | fill empty (linear) | front re-add | frontprio re-add |
incremental drip |
|------:|--------------------:|-------------:|-----------------:|-----------------:|
| 500 | 0.18 | 0.22 | 0.23 | 0.24
|
| 4000 | 0.31 | 2.94 | 3.82 | 3.05
|
The `ms/N²` column flips from ~142 (flat = quadratic) to ~0.1 (declining =
linear), confirming the complexity class change.
## Tests
All 116 `test_manager.py` tests pass. 19 tests required migrating
hardcoded `deque(...)` in setup/assertions to `OrderedDict.fromkeys(...)`.
---
##### Was generative AI tooling used to co-author this PR?
- [X] Yes — Claude Code (claude-sonnet-4-6)
Generated-by: Claude Code (claude-sonnet-4-6) following [the
guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#gen-ai-assisted-contributions)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]