diogosilva30 opened a new issue, #68077:
URL: https://github.com/apache/airflow/issues/68077
### Apache Airflow version
3.2.2
### What happened
After upgrading to Airflow 3.2.x with the `edge3` provider (3.7.0), all
`airflow_edge_worker_*` metrics disappeared from the metrics backend
(StatsD/Prometheus). Other Airflow metrics (`scheduler.*`, `dag_processing.*`,
`api_server.*`, pool metrics) kept working.
### What you think should happen instead
Edge Worker metrics (`edge_worker.connected`, `edge_worker.num_queues`,
`edge_worker.heartbeat_count`, `edge_worker.ti.*`, etc.) should be emitted as
before.
### Root cause
The Edge Worker REST API is served by the **API server**
(`/edge_worker/v1/...`). A worker heartbeat (`PATCH /edge_worker/v1/worker/`)
runs `set_state` → `set_metrics`, which records metrics through the **Task
SDK** `Stats` singleton (`airflow.sdk._shared.observability.metrics.stats`,
resolved by the Edge provider via `airflow.providers.common.compat`).
Every other long-running component initializes that singleton:
- core stats: `jobs/scheduler_job_runner.py`,
`jobs/triggerer_job_runner.py`, `dag_processing/manager.py`,
`executors/base_executor.py`
- SDK stats: `task-sdk/.../execution_time/task_runner.py`,
`task-sdk/.../serde/__init__.py`
…all call `stats.initialize(factory=stats_utils.get_stats_factory(),
export_legacy_names=...)`.
The **API server never calls `stats.initialize(...)`**. Before #63932
(*Remove the DualStatsManager and the Stats interfaces*), `Stats` lazily
auto-initialized its backend on first use. #63932 replaced that with explicit
`Stats.initialize(...)` + a PID guard, and added the explicit call to the
components above — but **not** to the API server. As a result the Task SDK
`Stats` singleton in the API server process stays a `NoStatsLogger`, and every
Edge Worker metric is silently dropped.
This also explains the asymmetry that `api_server.*` metrics still work:
they go through the separately-initialized **core** stats path, while the Edge
metrics use the uninitialized **SDK** path. Note edge3 3.7.0 already passes
`tags={"worker_name": ...}` to `Stats.gauge`, so the metrics were missing
entirely (not merely missing the `worker_name` tag) — once `Stats` is
initialized they flow through correctly tagged.
### Minimal reproduction
1. Airflow 3.2.x, `EdgeExecutor`, `edge3` 3.7.0, metrics enabled (`[metrics]
statsd_on = True`).
2. Start an edge worker so it heartbeats against the API server.
3. Scrape the metrics backend → no `edge_worker.*` series.
4. In the API server process: `from airflow.providers.common.compat.sdk
import Stats; type(Stats.instance)` → `NoStatsLogger`.
5. Manually run, in the same process:
```python
from airflow.sdk._shared.observability.metrics import stats
from airflow.sdk.observability.metrics import stats_utils
from airflow.configuration import conf
stats.initialize(
factory=stats_utils.get_stats_factory(),
export_legacy_names=conf.getboolean("metrics", "legacy_names_on"),
)
```
→ on the next heartbeat all `edge_worker.*` series appear (verified in a
live 3.2.2 deployment: 200+ series restored, correctly tagged with
`worker_name`).
### Proposed fix
Initialize the Task SDK `Stats` singleton in the API server's FastAPI
`lifespan` (runs once per worker, post-fork), mirroring the existing init in
`serde` / `task_runner`. PR to follow.
### Relationship to #67328
Not a duplicate, and on Airflow 3.2+ the two do not overlap. #67328
reintroduces a `DualStatsManager` code path in edge3's `set_metrics`, guarded
by `try: from airflow.sdk.observability.stats import DualStatsManager`.
`DualStatsManager` was removed in #63932, so on any Airflow that includes
#63932 (i.e. 3.2+) that import fails, `DualStatsManager` is `None`, and edge3
falls back to the existing `Stats.gauge(..., tags={"worker_name": ...})` path —
making #67328 effectively a **no-op on 3.2+**. It targets cross-version
provider compatibility with **Airflow < 3.2** (where `DualStatsManager` still
exists) so legacy dotted-name StatsD mappings keep working. Neither branch of
#67328 calls `Stats.initialize()`, so it does not address the 3.2+ init gap
described here.
### Operating System / Deployment
Kubernetes, Airflow 3.2.2, `edge3` provider 3.7.0, `statsd_exporter` v0.28.0.
### Are you willing to submit PR?
Yes — PR to follow.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]