1fanwang commented on issue #66817:
URL: https://github.com/apache/airflow/issues/66817#issuecomment-4438398087

   A unit-level repro that demonstrates the leak deterministically: in 
`_do_scheduling`, capture `session.identity_map.keys()` at the boundary between 
phase 1 and phase 2 (patch `DagRun.get_running_dag_runs_to_examine` so the 
first call records the keys, then exit phase 2 with an empty result).
   
   Without `session.expunge_all()` between the phases, phase 1's `DagRun` and 
`TaskInstance` instances are still in the identity map when phase 2 starts:
   
   ```
   AssertionError: identity map leaked into phase 2: [
       (<class 'airflow.models.dagrun.DagRun'>, (UUID('...'),), None),
       (<class 'airflow.models.taskinstance.TaskInstance'>, (UUID('...'),), 
None),
   ]
   ```
   
   With the expunge in place, the identity map is empty when phase 2 starts. 
Those are the exact two entries that `_schedule_all_dag_runs`' 
`session.merge(...)` call can re-dirty and pull into the final `guard.commit()` 
— the mechanism behind the cross-replica deadlocks reported on MySQL (`1213 
"Deadlock found"`) and PostgreSQL (`deadlock detected`).
   
   PR with the fix plus the deterministic regression test: #66820.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to