dirrao opened a new pull request, #35579:
URL: https://github.com/apache/airflow/pull/35579

   Description
   
   We have a scheduler house keeping work (adopt_or_reset_orphaned_tasks, 
check_trigger_timeouts, _emit_pool_metrics, _find_zombies, 
clear_not_launched_queued_tasks and _check_worker_pods_pending_timeout) runs on 
certain frequency. Right now, we don't have any latency metrics on these house 
keeping work. These will impact the scheduler heartbeat. Its good idea to 
capture these latency metrics to identify and tune the airflow configuration
   
   Use case/motivation
   
   As we run the airflow at a large scale, we have found that the 
adopt_or_reset_orphaned_tasks and clear_not_launched_queued_tasks functions 
might take time in a few minutes prior to bug fix 
(https://github.com/apache/airflow/issues/34877). These will delay the 
heartbeat of the scheduler and leads to the scheduler instance 
restarting/killed. In order to detect these latency issues, we need metrics to 
capture these latencies.
   
   https://github.com/apache/airflow/issues/31957


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to