1fanwang commented on code in PR #66820:
URL: https://github.com/apache/airflow/pull/66820#discussion_r3232722270


##########
airflow-core/src/airflow/jobs/scheduler_job_runner.py:
##########
@@ -1776,6 +1776,15 @@ def _do_scheduling(self, session: Session) -> int:
             self._start_queued_dagruns(session)
             guard.commit()
 
+            # Clear DagRun objects loaded by phase 1 from the identity map so
+            # phase 2 reloads them fresh. Otherwise stale rows can be 
re-dirtied
+            # by flush/merge in _schedule_all_dag_runs and committed in a 
row-lock
+            # order that differs from what other scheduler replicas are taking
+            # for their own work, producing A-B / B-A deadlocks on dag_run and
+            # task_instance under HA scheduler deployments. See
+            # https://github.com/apache/airflow/issues/66817.
+            session.expunge_all()

Review Comment:
   Nice to meet you Ephraim, and thanks for flagging — appreciate the 
directness.
   
   > I have seen a lot of PRs from you with self created issues
   
   Yes, that's accurate.
   
   > seems to step from guesses instead of issues you experienced
   
   Not quite, most of my recent issues/PRs are actually from my direct 
experience running one of the largest set of Airflow Clusters out there (based 
on my discussions with folks at Airflow Summit 2025), raising a bunch in a 
batch because we are actively planning the Airflow 2 → 3 migration and this a a 
consolidation effort between our own 2.9.2 fork and 3.x.x - hope this context 
helps
   
   These issues and PRs come from running production Airflow at extremely large 
scale (20-30k+ DAGs per cluster, very high TI concurrency) plus actively 
planning the Airflow 2 → 3 migration. Some have hit us in production already; 
others come from defensive code review of the paths we'll lean on at cutover. 
The intent is to land the fixes upstream so the community benefits too, not 
just us.
   
   Beyond the PR/issue stream: I'm active on dev list, gave a talk at Airflow 
Summit 2025, have accepted talks for Airflow Summit 2026 and ApacheCon 
Community Over Code Glasgow 2026, and an AIP-96 + AIP-97 refresh is heading to 
the list shortly. Aiming for sustained engagement and contribution with the 
community — hopefully that context helps :)
   
   On the technical analysis itself: some of the raw internal logs and traces 
can't be copy-pasted out due to company policy, but the pattern that works is 
to repro the issue end-to-end against the OSS code, capture before/after 
evidence, and share the result here. That's what we've already done together on 
several other PR threads (e.g., the deterministic FAILED → PASSED snippet in 
this PR body's regression test) — same plan here.
   
   Will follow up with what you asked for. Just to send the right shape — would 
the most useful be:
   
   - A sanitized scheduler log with the `1213 "Deadlock found"` / `deadlock 
detected` traces against `dag_run` / `task_instance` UPDATEs?
   - A SQLAlchemy event-listener capture of the phase-2 commit set?
   - A `SHOW ENGINE INNODB STATUS` snapshot from a deadlock incident?
   - Or something else entirely?
   
   Whichever form you prefer, I'll put together and follow up here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to