1fanwang opened a new issue, #66803:
URL: https://github.com/apache/airflow/issues/66803

   ### Description
   
   `SchedulerJobRunner._run_scheduler_loop` calls `executor.heartbeat()` + 
`_process_executor_events()` on every loop iteration. Their wall time is the 
dominant variable cost of the scheduler loop, but it's not measured anywhere. 
When the executor heartbeat slows down (k8s api throttle, celery broker lag, 
etc.), the scheduler loop slows down — and operators have no signal pointing at 
the executor as the cause.
   
   ### Use case / motivation
   
   Diagnose slow scheduler loops. The current `dag_processing.loop_duration` is 
aggregate and doesn't tell you where the time went.
   
   ### Proposal
   
   Wrap the `executor.heartbeat()` call in 
`Stats.timer("scheduler.executor_heartbeat_duration")`. Optionally include 
`_process_executor_events` in the same timer, or split them — open to either.
   
   For 3.x's multi-executor support, tag the metric by executor class name so 
each executor instance shows up distinctly.
   
   ### Are you willing to submit a PR?
   
   - [X] Yes
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's Code of Conduct
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to