shahar1 opened a new pull request, #67750:
URL: https://github.com/apache/airflow/pull/67750

   Replace `collections.deque` with `OrderedDict[DagFileInfo, None]` for
   `DagFileProcessorManager._file_queue`.
   
   ## Problem
   
   `x in deque` and `deque.remove(x)` are O(queue-size). The frontprio
   priority path (`_queue_requested_files_for_parsing`), per-loop callback
   adds (`_add_callback_to_queue`), and any re-add against a populated queue
   are therefore O(N × Q) — quadratic in the steady-state file count.
   
   The three affected paths:
   - `mode="frontprio"`: `deque.remove(file)` per file — O(Q) each
   - `mode="front"`: `f not in self._file_queue` per file — O(Q) each
   - incremental drip: one file at a time via callback/priority adds
   
   ## Fix
   
   Use `OrderedDict` as an ordered set (`{file: None}`). Membership is O(1),
   push-front is `move_to_end(file, last=False)`, pop-front is
   `popitem(last=False)`. All three modes collapse to O(1) per file.
   
   Behavior is verified identical over 300 random operations against the old
   deque semantics.
   
   ## Benchmark results
   
   Measured on WSL2 / Python 3.12 with synthetic `DagFileInfo` objects
   (no DB, no subprocess). Full benchmark scripts:
   [gist: dag-processing 
benchmarks](https://gist.github.com/shahar1/d2da5afe634034253e4e08fd84eff22a)
   
   **File-queue ops (best-of-5, ms):**
   
   Before:
   
   | files | fill empty (linear) | front re-add | frontprio re-add | 
incremental drip |
   
|------:|--------------------:|-------------:|-----------------:|-----------------:|
   |   500 | 0.17                | 35.4         | 35.6             | 36.5       
      |
   |  4000 | 0.31                | 2636.7       | 2537.9           | 2640.5     
      |
   
   After:
   
   | files | fill empty (linear) | front re-add | frontprio re-add | 
incremental drip |
   
|------:|--------------------:|-------------:|-----------------:|-----------------:|
   |   500 | 0.18                | 0.22         | 0.23             | 0.24       
      |
   |  4000 | 0.31                | 2.94         | 3.82             | 3.05       
      |
   
   The `ms/N²` column flips from ~142 (flat = quadratic) to ~0.1 (declining = 
linear), confirming the complexity class change.
   
   ## Tests
   
   All 116 `test_manager.py` tests pass. 19 tests required migrating
   hardcoded `deque(...)` in setup/assertions to `OrderedDict.fromkeys(...)`.
   
   ---
   
   ##### Was generative AI tooling used to co-author this PR?
   
   - [X] Yes — Claude Code (claude-sonnet-4-6)
   
   Generated-by: Claude Code (claude-sonnet-4-6) following [the 
guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#gen-ai-assisted-contributions)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to