1fanwang commented on code in PR #66773:
URL: https://github.com/apache/airflow/pull/66773#discussion_r3269906184


##########
airflow-core/src/airflow/jobs/scheduler_job_runner.py:
##########
@@ -2938,7 +2944,10 @@ def _find_task_instances_without_heartbeats(self, *, 
session: Session) -> list[T
                 .join(DM, TI.dag_id == DM.dag_id)
                 .where(
                     TI.state.in_((TaskInstanceState.RUNNING, 
TaskInstanceState.RESTARTING)),
-                    TI.last_heartbeat_at < limit_dttm,
+                    or_(
+                        TI.last_heartbeat_at < limit_dttm,

Review Comment:
   Circling back on this, Actually you are right @ephraimbuddy, closing this PR 
since this is the incorrect RCA / fix for #58307
   
   Ran the live repro on a freshly-migrated A3 sqlite deployment — `airflow db 
migrate`, a real DAG, `airflow standalone`. Triggered, two TIs reached running 
through `/run`. Stopped the scheduler, set `last_heartbeat_at = NULL` on both 
rows to simulate the post-migration state, restarted with DEBUG logging.
   
   ```
   19:38:30.413671 | info | Adopting or resetting orphaned tasks for active dag 
runs (scheduler_job_runner.py:2815)
   19:38:30.445094 | info | Reset the following 2 orphaned TaskInstances 
(scheduler_job_runner.py:2894)
   ```
   
   `adopt_or_reset_orphaned_tasks` rotates both into `task_instance_history` 
within 32 ms of startup; fresh rows come up with `last_heartbeat_at` populated 
by `/run` in 4 s. The cleanup query only ever sees the fresh rows, so the `OR 
last_heartbeat_at IS NULL` predicate never fires.
   
   The repro did surface something different in #58307. vgl-grin is on 
`heartbeat_sec=0`, `heartbeat_timeout=300`, containers respawning between runs. 
Their `last_heartbeat_at` isn't NULL; `/run` sets it. But the adopt path on 
each respawn looks like it can land a fresh `last_heartbeat_at`, which would 
restart the timeout clock indefinitely. That fits the asymmetry — `timeout=1` 
or `2` terminate, `300/4/5/6/10` don't. Different code path than this PR — 
leaving #58307 open.
   
   Thanks for pushing back on this - I'll be more thorough and careful on the 
repro and RCA investigation in the future.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to