1fanwang opened a new issue, #66799: URL: https://github.com/apache/airflow/issues/66799
### Description `KubernetesExecutor`'s pod create / patch / delete calls in `providers/cncf/kubernetes/.../kubernetes_executor_utils.py` go straight to the k8s API client with no metric emission. When the upstream apiserver is slow (rate-limiting, etcd contention, network), operators see "scheduler stalling" but can't tell whether the bottleneck is the airflow scheduler loop, the executor's queue, or the k8s api itself. ### Use case / motivation Today, troubleshooting a slow KE deployment requires correlating airflow scheduler logs against the apiserver's own metrics — and even then you don't see per-status-code distributions (200 vs 429 vs 503) for each operation. ### Proposal Three timer metrics + three status-code counters around the existing K8s API call sites: | Metric | Type | Wraps | |---|---|---| | `executor.pod_creation` | timer | `_create_pod` (or equivalent create call) | | `executor.pod_deletion` | timer | `delete_pod` | | `executor.pod_patching` | timer | `patch_namespaced_pod` | | `executor.pod_creation_status` | counter, tagged by status code | same | | `executor.pod_deletion_status` | counter | same | | `executor.pod_patching_status` | counter | same | All additive. No behavioral change. Provider PR. ### Are you willing to submit a PR? - [X] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's Code of Conduct -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
