1fanwang commented on code in PR #66773:
URL: https://github.com/apache/airflow/pull/66773#discussion_r3269906184


##########
airflow-core/src/airflow/jobs/scheduler_job_runner.py:
##########
@@ -2938,7 +2944,10 @@ def _find_task_instances_without_heartbeats(self, *, 
session: Session) -> list[T
                 .join(DM, TI.dag_id == DM.dag_id)
                 .where(
                     TI.state.in_((TaskInstanceState.RUNNING, 
TaskInstanceState.RESTARTING)),
-                    TI.last_heartbeat_at < limit_dttm,
+                    or_(
+                        TI.last_heartbeat_at < limit_dttm,

Review Comment:
   Circling back — you're right, @ephraimbuddy. Closing this; the live repro 
shows this isn't the right fix for #58307.
   
   Ran the repro on a freshly-migrated A3 sqlite deployment — `airflow db 
migrate`, a real DAG, `airflow standalone`. Triggered, two TIs reached running 
through `/run`. Stopped the scheduler, set `last_heartbeat_at = NULL` on both 
rows to simulate the post-migration state, restarted with DEBUG logging.
   
   ```
   19:38:30.413671 | info | Adopting or resetting orphaned tasks for active dag 
runs (scheduler_job_runner.py:2815)
   19:38:30.445094 | info | Reset the following 2 orphaned TaskInstances 
(scheduler_job_runner.py:2894)
   ```
   
   `adopt_or_reset_orphaned_tasks` rotates both into `task_instance_history` 
within 32 ms of startup; fresh rows come up with `last_heartbeat_at` populated 
by `/run` in 4 s. The cleanup query only ever sees the fresh rows, so the `OR 
last_heartbeat_at IS NULL` predicate never fires.
   
   The repro did surface something different in #58307. vgl-grin is on 
`heartbeat_sec=0`, `heartbeat_timeout=300`, containers respawning between runs. 
Their `last_heartbeat_at` isn't NULL; `/run` sets it. But the adopt path on 
each respawn looks like it can land a fresh `last_heartbeat_at`, which would 
restart the timeout clock indefinitely. That fits the asymmetry — `timeout=1` 
or `2` terminate, `300/4/5/6/10` don't. Different code path than this PR — 
leaving #58307 open.
   
   Thanks for the pushback.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to