1fanwang commented on code in PR #66820:
URL: https://github.com/apache/airflow/pull/66820#discussion_r3232722270


##########
airflow-core/src/airflow/jobs/scheduler_job_runner.py:
##########
@@ -1776,6 +1776,15 @@ def _do_scheduling(self, session: Session) -> int:
             self._start_queued_dagruns(session)
             guard.commit()
 
+            # Clear DagRun objects loaded by phase 1 from the identity map so
+            # phase 2 reloads them fresh. Otherwise stale rows can be 
re-dirtied
+            # by flush/merge in _schedule_all_dag_runs and committed in a 
row-lock
+            # order that differs from what other scheduler replicas are taking
+            # for their own work, producing A-B / B-A deadlocks on dag_run and
+            # task_instance under HA scheduler deployments. See
+            # https://github.com/apache/airflow/issues/66817.
+            session.expunge_all()

Review Comment:
   Fair concern, and a reasonable one to raise directly — thanks. I don't have 
a clean production trace I can point to for this one, which is exactly the gap 
you're flagging. So rather than argue it from analysis, I went and built a live 
multi-scheduler repro — and it turned out I couldn't reproduce the root cause 
this PR assumes. Writing that up in the closing comment and stepping back on 
it. Appreciate you pushing on it.



##########
airflow-core/src/airflow/jobs/scheduler_job_runner.py:
##########
@@ -1776,6 +1776,15 @@ def _do_scheduling(self, session: Session) -> int:
             self._start_queued_dagruns(session)
             guard.commit()
 
+            # Clear DagRun objects loaded by phase 1 from the identity map so
+            # phase 2 reloads them fresh. Otherwise stale rows can be 
re-dirtied
+            # by flush/merge in _schedule_all_dag_runs and committed in a 
row-lock
+            # order that differs from what other scheduler replicas are taking
+            # for their own work, producing A-B / B-A deadlocks on dag_run and
+            # task_instance under HA scheduler deployments. See
+            # https://github.com/apache/airflow/issues/66817.
+            session.expunge_all()

Review Comment:
   Replacing the synthetic two-thread probe I'd put here: it only demonstrated 
the *shape* of an InnoDB deadlock, not that this code path produces one, so it 
wasn't real evidence.
   
   I since ran a proper live repro — 2–3 real `airflow scheduler` processes 
against MySQL 8.0 on `main` under heavy catchup load — and the `dag_run` / 
`task_instance` deadlock this targets didn't reproduce (both phases already 
fetch with `FOR UPDATE SKIP LOCKED`, and the before/after SQL is identical). 
Details in the closing comment. Stepping back to benchmark before pursuing this 
again.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to