1fanwang opened a new pull request, #66806: URL: https://github.com/apache/airflow/pull/66806
### Problem The `KubernetesExecutor` calls `create_namespaced_pod`, `delete_namespaced_pod`, and `patch_namespaced_pod` against the API server on every task lifecycle event, but emits no metrics around those calls. When a cluster's control plane is slow, throttling (HTTP 429), or returning 5xx, the only signal today is scheduler log noise — there's no way to alert on latency drift or error-rate spikes without scraping logs. ### Fix Wrap each of the three pod API call sites in `kubernetes_executor_utils.py` with `Stats.timer` for latency (`kubernetes_executor.pod_creation` / `pod_deletion` / `pod_patching`) and a paired `Stats.incr` tagged by status (`pod_creation_status` / `pod_deletion_status` / `pod_patching_status`). The counter is tagged `status="200"` on success and with the `ApiException.status` value on failure, so operators can chart per-status-code rates. The 404-is-fine branch in `delete_pod` and the swallow-on-failure branches in the two patch methods still behave as before — they just emit a counter on the way out. The three new timers and three new counters are registered in `shared/observability/src/airflow_shared/observability/metrics/metrics_template.yaml` so they pass the metrics-registry pre-commit hook and show up in the published metrics docs. ### Tests New unit tests in `test_kubernetes_executor.py` mock the `Stats` module and assert the timer + tagged counter fire on both the success path and an `ApiException(status=429)` failure path for `delete_pod`. Closes #66799 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
