diogosilva30 opened a new issue, #68077:
URL: https://github.com/apache/airflow/issues/68077
### Apache Airflow version
3.2.2 (also affects `main`).
### What happened
After upgrading to Airflow 3.2.x with the `edge3` provider (3.7.0), all
`airflow_edge_worker_*` metrics disappeared from the metrics backend
(StatsD/Prometheus). Other Airflow metrics (`scheduler.*`, `dag_processing.*`,
`api_server.*`, pool metrics) kept working.
### What you think should happen instead
Edge Worker metrics (`edge_worker.connected`, `edge_worker.num_queues`,
`edge_worker.heartbeat_count`, `edge_worker.ti.*`, etc.) should be emitted as
before.
### Root cause
The Edge Worker REST API is served by the **API server**
(`/edge_worker/v1/...`). A worker heartbeat (`PATCH
/edge_worker/v1/worker/<name>`) runs `set_state` → `set_metrics`, which records
metrics through the **Task SDK** `Stats` singleton
(`airflow.sdk._shared.observability.metrics.stats`, resolved by the Edge
provider via `airflow.providers.common.compat`).
Every other long-running component initializes that singleton:
- core stats: `jobs/scheduler_job_runner.py`,
`jobs/triggerer_job_runner.py`, `dag_processing/manager.py`,
`executors/base_executor.py`
- SDK stats: `task-sdk/.../execution_time/task_runner.py`,
`task-sdk/.../serde/__init__.py`
…all call `stats.initialize(factory=stats_utils.get_stats_factory(),
export_legacy_names=...)`.
The **API server never calls `stats.initialize(...)`**. Before #63932
(*Remove the DualStatsManager and the Stats interfaces*), `Stats` lazily
auto-initialized its backend on first use. #63932 replaced that with explicit
`Stats.initialize(...)` + a PID guard, and added the explicit call to the
components above — but **not** to the API server. As a result the Task SDK
`Stats` singleton in the API server process stays a `NoStatsLogger`, and every
Edge Worker metric is silently dropped.
This also explains the asymmetry that `api_server.*` metrics still work:
they go through the separately-initialized **core** stats path, while the Edge
metrics use the uninitialized **SDK** path.
### Minimal reproduction
1. Airflow 3.2.x, `EdgeExecutor`, `edge3` 3.7.0, metrics enabled (`[metrics]
statsd_on = True`).
2. Start an edge worker so it heartbeats against the API server.
3. Scrape the metrics backend → no `edge_worker.*` series.
4. In the API server process: `from airflow.providers.common.compat.sdk
import Stats; type(Stats.instance)` → `NoStatsLogger`.
5. Manually run, in the same process:
```python
from airflow.sdk._shared.observability.metrics import stats
from airflow.sdk.observability.metrics import stats_utils
from airflow.configuration import conf
stats.initialize(
factory=stats_utils.get_stats_factory(),
export_legacy_names=conf.getboolean("metrics", "legacy_names_on"),
)
```
→ on the next heartbeat all `edge_worker.*` series appear (verified in a
live 3.2.2 deployment: 200+ series restored, correctly tagged with
`worker_name`).
### Proposed fix
Initialize the Task SDK `Stats` singleton in the API server's FastAPI
`lifespan` (runs once per worker, post-fork), mirroring the existing init in
`serde` / `task_runner`. PR to follow.
### Relationship to #67328
#67328 (*Bring back edge worker metric compatibility with Airflow 3.2*) is
**complementary, not a duplicate**: it makes `edge3` dual-emit the legacy
dotted form so old StatsD mappings match again (a tag/naming concern). It does
**not** initialize the `Stats` singleton — empirically,
`DualStatsManager.gauge(...)` without `Stats.initialize()` is also dropped.
This issue is the root-cause init gap.
### Side note (separate, benign)
Once Stats is initialized, the SDK `statsd_logger` logs `Dropping invalid
tag: queues=a,b,c` for the `edge_worker.num_queues` gauge: `set_metrics` builds
a `queues` tag whose value is a comma-joined list, and commas are the InfluxDB
tag delimiter, so that single tag is dropped (the metric value and
`worker_name` still flow). Worth fixing in `edge3` separately.
### Operating System / Deployment
Kubernetes, Airflow 3.2.2, `edge3` provider 3.7.0, `statsd_exporter` v0.28.0.
### Are you willing to submit PR?
Yes — PR to follow.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]