1fanwang opened a new pull request, #66808:
URL: https://github.com/apache/airflow/pull/66808
### Problem
`dag_processing.loop_duration` is aggregate. When the scheduler loop slows
down, there's no signal pointing at the executor as the cause —
`executor.heartbeat()` runs every iteration and can spike from Kubernetes API
throttling, Celery broker lag, etc., but its wall time is folded into the loop
total.
### Fix
Wrap `executor.heartbeat()` in
`stats.timer("scheduler.executor_heartbeat_duration")`, tagged with `executor:
<ClassName>` so each configured executor reports independently in
multi-executor deployments.
### Tests
`test_executor_heartbeat_emits_timer` patches `stats.timer`, runs one
scheduler loop iteration with the two-executor `mock_executors` fixture, and
asserts the timer was opened once per executor with the right metric name and
`tags={"executor": <class>}`.
### Note
I scoped this to `executor.heartbeat()` only, not
`_process_executor_events`. The two calls are structurally separate — they live
in different loops with a fresh `create_session()` block between them — and
their cost characteristics differ (heartbeat is executor I/O;
`_process_executor_events` is DB writes plus event-buffer drain). They look
like two independent signals rather than a tight pair, so a separate timer for
`_process_executor_events` makes sense as a follow-up if the same need surfaces
there. Happy to roll it into this PR if reviewers would prefer.
Closes #66803
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]