hkc-8010 opened a new pull request, #68008:
URL: https://github.com/apache/airflow/pull/68008

   closes: #42553
   
   ## Description
   
   When the scheduler detects a running task instance that has stopped 
heartbeating, it currently builds and enqueues a `TaskCallbackRequest`, but the 
task instance can remain in `RUNNING` until the callback is processed 
asynchronously.
   
   That leaves a window where the next scheduler heartbeat-timeout scan can 
find the same task instance attempt again and enqueue another callback for the 
same try.
   
   This changes the heartbeat-timeout purge path to enqueue the existing 
callback request, then immediately run the established task-instance failure 
transition path so the task instance leaves `RUNNING` in the same scheduler 
pass.
   
   Executor cleanup via `executor.change_state(..., FAILED, 
remove_running=True)` is unchanged.
   
   Related: #66767 handles callback type metadata for heartbeat-timed-out 
retries. This PR is scoped to preventing duplicate same-try callback emission.
   
   ## Tests
   
   - `uv run --frozen --no-sync pytest --with-db-init 
airflow-core/tests/unit/jobs/test_scheduler_job.py::TestSchedulerJob::test_heartbeat_timeout_converges_ti_state_before_next_scan
 -q`
     - Failed before the fix as expected.
   - `uv run --frozen --no-sync pytest 
airflow-core/tests/unit/jobs/test_scheduler_job.py::TestSchedulerJob::test_heartbeat_timeout_converges_ti_state_before_next_scan
 -q`
   - `uv run --frozen --no-sync pytest 
airflow-core/tests/unit/jobs/test_scheduler_job.py -k "heartbeat_timeout" -q`
   - `uv run --frozen --no-sync pytest 
airflow-core/tests/unit/jobs/test_scheduler_job.py -q`
   - `prek run --files airflow-core/src/airflow/jobs/scheduler_job_runner.py 
airflow-core/tests/unit/jobs/test_scheduler_job.py`
   - `.venv/bin/python 
scripts/ci/prek/run_mypy_full_dist_local_venv_or_breeze_in_ci.py airflow-core`
   - `breeze testing core-tests --backend postgres --python 3.10 --db-reset -- 
airflow-core/tests/unit/jobs/test_scheduler_job.py -k "heartbeat_timeout" -q`
   - `breeze testing core-tests --backend sqlite --python 3.10 
--force-lowest-dependencies --db-reset -- 
airflow-core/tests/unit/jobs/test_scheduler_job.py -k "heartbeat_timeout" -q`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to