Pranaykarvi opened a new pull request, #64709:
URL: https://github.com/apache/airflow/pull/64709
Closes #64658
## Problem
Long-running `GenericTransfer` tasks (>3 hours) were being incorrectly
killed by the scheduler due to a heartbeat timeout / zombie detection
false positive.
During the paginated transfer loop, the operator performs long blocking
work (bulk inserts via `executemany`) without emitting any heartbeat to
the Airflow metadata DB. The scheduler's
`_find_and_purge_task_instances_without_heartbeats` routine in
`scheduler_job_runner.py` checks `last_heartbeat_at` periodically — if
it goes stale beyond `task_instance_heartbeat_timeout`, the task is
treated as a zombie and terminated, even though it is actively
processing data.
This affects both:
- The paginated path (`execute_complete` — called per page when deferred)
- The non-paginated multi-SQL path (`execute` — iterates over a list of SQL
statements)
## Fix
- Added `_emit_transfer_heartbeat()` helper that calls `ti.heartbeat()`
or `ti.update_heartbeat()` (first match wins via `getattr`) after each
page in `execute_complete()` and after each SQL batch in `execute()`
- Helper is best-effort — no-ops cleanly if neither method exists on the
task instance (no regression for older runtimes)
- Added docstring note on tuning the following config values for
long-running transfers:
- `[scheduler] task_instance_heartbeat_timeout`
- `[celery_broker_transport_options] visibility_timeout`
- `[scheduler] task_instance_heartbeat_sec`
- Added `test_heartbeat_called_during_paginated_transfer` to verify
heartbeat is called once per page during a paginated transfer
## Testing
```bash
uv run --project providers/common/sql pytest \
providers/common/sql/tests/unit/common/sql/operators/test_generic_transfer.py \
-xvs
```
## Related Issues
- Closes #64658
- Related to #48719
- Related to #54479
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]