dirrao opened a new pull request, #35579: URL: https://github.com/apache/airflow/pull/35579
Description We have a scheduler house keeping work (adopt_or_reset_orphaned_tasks, check_trigger_timeouts, _emit_pool_metrics, _find_zombies, clear_not_launched_queued_tasks and _check_worker_pods_pending_timeout) runs on certain frequency. Right now, we don't have any latency metrics on these house keeping work. These will impact the scheduler heartbeat. Its good idea to capture these latency metrics to identify and tune the airflow configuration Use case/motivation As we run the airflow at a large scale, we have found that the adopt_or_reset_orphaned_tasks and clear_not_launched_queued_tasks functions might take time in a few minutes prior to bug fix (https://github.com/apache/airflow/issues/34877). These will delay the heartbeat of the scheduler and leads to the scheduler instance restarting/killed. In order to detect these latency issues, we need metrics to capture these latencies. https://github.com/apache/airflow/issues/31957 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org