1fanwang opened a new pull request, #66773:
URL: https://github.com/apache/airflow/pull/66773

   Closes #58307.
   
   `_find_task_instances_without_heartbeats` filters with `TI.last_heartbeat_at 
< limit_dttm`. In SQL three-valued logic, that predicate evaluates to `NULL` 
(not `TRUE`) when the row's `last_heartbeat_at IS NULL`, so the row is never 
returned and the TI never gets purged.
   
   `last_heartbeat_at IS NULL` is a real state — every TI has it briefly 
between QUEUED→RUNNING and the first heartbeat from the worker. If a worker 
crashes inside that window (OOM kill, K8s eviction during pod start, network 
blip during init), the TI stays RUNNING forever. The scheduler already knows 
about this gap: `adopt_or_reset_orphaned_tasks` falls back to `utcnow()` on the 
migration path when `last_heartbeat_at IS NULL` 
(`scheduler_job_runner.py:2855`), but the heartbeat-cleanup path doesn't have a 
matching fallback.
   
   This PR extends the predicate to use `start_date` when `last_heartbeat_at IS 
NULL`. A TI that started long enough ago to be past the heartbeat-timeout, and 
has still never reported a heartbeat, is the exact stuck-forever case the 
cleanup is meant to handle.
   
   ## Tests
   
   Two new cases in `tests/unit/jobs/test_scheduler_job.py`:
   
   - 
`test_find_and_purge_task_instances_without_heartbeats_null_last_heartbeat` — 
NULL `last_heartbeat_at` with an old `start_date` is now caught by the query 
and purged. Fails on `main`, passes with this PR.
   - 
`test_find_and_purge_task_instances_without_heartbeats_null_last_heartbeat_fresh_start`
 — NULL `last_heartbeat_at` with a fresh `start_date` (inside the timeout 
window) is still left alone. Guards against killing newly-started tasks that 
haven't had a chance to report their first heartbeat yet.
   
   The other 11 heartbeat-related tests in the file continue to pass.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to