hkc-8010 opened a new pull request, #68008:
URL: https://github.com/apache/airflow/pull/68008
closes: #42553
## Description
When the scheduler detects a running task instance that has stopped
heartbeating, it currently builds and enqueues a `TaskCallbackRequest`, but the
task instance can remain in `RUNNING` until the callback is processed
asynchronously.
That leaves a window where the next scheduler heartbeat-timeout scan can
find the same task instance attempt again and enqueue another callback for the
same try.
This changes the heartbeat-timeout purge path to enqueue the existing
callback request, then immediately run the established task-instance failure
transition path so the task instance leaves `RUNNING` in the same scheduler
pass.
Executor cleanup via `executor.change_state(..., FAILED,
remove_running=True)` is unchanged.
Related: #66767 handles callback type metadata for heartbeat-timed-out
retries. This PR is scoped to preventing duplicate same-try callback emission.
## Tests
- `uv run --frozen --no-sync pytest --with-db-init
airflow-core/tests/unit/jobs/test_scheduler_job.py::TestSchedulerJob::test_heartbeat_timeout_converges_ti_state_before_next_scan
-q`
- Failed before the fix as expected.
- `uv run --frozen --no-sync pytest
airflow-core/tests/unit/jobs/test_scheduler_job.py::TestSchedulerJob::test_heartbeat_timeout_converges_ti_state_before_next_scan
-q`
- `uv run --frozen --no-sync pytest
airflow-core/tests/unit/jobs/test_scheduler_job.py -k "heartbeat_timeout" -q`
- `uv run --frozen --no-sync pytest
airflow-core/tests/unit/jobs/test_scheduler_job.py -q`
- `prek run --files airflow-core/src/airflow/jobs/scheduler_job_runner.py
airflow-core/tests/unit/jobs/test_scheduler_job.py`
- `.venv/bin/python
scripts/ci/prek/run_mypy_full_dist_local_venv_or_breeze_in_ci.py airflow-core`
- `breeze testing core-tests --backend postgres --python 3.10 --db-reset --
airflow-core/tests/unit/jobs/test_scheduler_job.py -k "heartbeat_timeout" -q`
- `breeze testing core-tests --backend sqlite --python 3.10
--force-lowest-dependencies --db-reset --
airflow-core/tests/unit/jobs/test_scheduler_job.py -k "heartbeat_timeout" -q`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]