diogosilva30 opened a new issue, #68077:
URL: https://github.com/apache/airflow/issues/68077

   ### Apache Airflow version
   
   3.2.2 
   
   ### What happened
   
   After upgrading to Airflow 3.2.x with the `edge3` provider (3.7.0), all 
`airflow_edge_worker_*` metrics disappeared from the metrics backend 
(StatsD/Prometheus). Other Airflow metrics (`scheduler.*`, `dag_processing.*`, 
`api_server.*`, pool metrics) kept working.
   
   ### What you think should happen instead
   
   Edge Worker metrics (`edge_worker.connected`, `edge_worker.num_queues`, 
`edge_worker.heartbeat_count`, `edge_worker.ti.*`, etc.) should be emitted as 
before.
   
   ### Root cause
   
   The Edge Worker REST API is served by the **API server** 
(`/edge_worker/v1/...`). A worker heartbeat (`PATCH /edge_worker/v1/worker/`) 
runs `set_state` → `set_metrics`, which records metrics through the **Task 
SDK** `Stats` singleton (`airflow.sdk._shared.observability.metrics.stats`, 
resolved by the Edge provider via `airflow.providers.common.compat`).
   
   Every other long-running component initializes that singleton:
   
   - core stats: `jobs/scheduler_job_runner.py`, 
`jobs/triggerer_job_runner.py`, `dag_processing/manager.py`, 
`executors/base_executor.py`
   - SDK stats: `task-sdk/.../execution_time/task_runner.py`, 
`task-sdk/.../serde/__init__.py`
   
   …all call `stats.initialize(factory=stats_utils.get_stats_factory(), 
export_legacy_names=...)`.
   
   The **API server never calls `stats.initialize(...)`**. Before #63932 
(*Remove the DualStatsManager and the Stats interfaces*), `Stats` lazily 
auto-initialized its backend on first use. #63932 replaced that with explicit 
`Stats.initialize(...)` + a PID guard, and added the explicit call to the 
components above — but **not** to the API server. As a result the Task SDK 
`Stats` singleton in the API server process stays a `NoStatsLogger`, and every 
Edge Worker metric is silently dropped.
   
   This also explains the asymmetry that `api_server.*` metrics still work: 
they go through the separately-initialized **core** stats path, while the Edge 
metrics use the uninitialized **SDK** path. Note edge3 3.7.0 already passes 
`tags={"worker_name": ...}` to `Stats.gauge`, so the metrics were missing 
entirely (not merely missing the `worker_name` tag) — once `Stats` is 
initialized they flow through correctly tagged.
   
   ### Minimal reproduction
   
   1. Airflow 3.2.x, `EdgeExecutor`, `edge3` 3.7.0, metrics enabled (`[metrics] 
statsd_on = True`).
   2. Start an edge worker so it heartbeats against the API server.
   3. Scrape the metrics backend → no `edge_worker.*` series.
   4. In the API server process: `from airflow.providers.common.compat.sdk 
import Stats; type(Stats.instance)` → `NoStatsLogger`.
   5. Manually run, in the same process:
      ```python
      from airflow.sdk._shared.observability.metrics import stats
      from airflow.sdk.observability.metrics import stats_utils
      from airflow.configuration import conf
      stats.initialize(
          factory=stats_utils.get_stats_factory(),
          export_legacy_names=conf.getboolean("metrics", "legacy_names_on"),
      )
      ```
      → on the next heartbeat all `edge_worker.*` series appear (verified in a 
live 3.2.2 deployment: 200+ series restored, correctly tagged with 
`worker_name`).
   
   ### Proposed fix
   
   Initialize the Task SDK `Stats` singleton in the API server's FastAPI 
`lifespan` (runs once per worker, post-fork), mirroring the existing init in 
`serde` / `task_runner`. PR to follow.
   
   ### Relationship to #67328
   
   Not a duplicate, and on Airflow 3.2+ the two do not overlap. #67328 
reintroduces a `DualStatsManager` code path in edge3's `set_metrics`, guarded 
by `try: from airflow.sdk.observability.stats import DualStatsManager`. 
`DualStatsManager` was removed in #63932, so on any Airflow that includes 
#63932 (i.e. 3.2+) that import fails, `DualStatsManager` is `None`, and edge3 
falls back to the existing `Stats.gauge(..., tags={"worker_name": ...})` path — 
making #67328 effectively a **no-op on 3.2+**. It targets cross-version 
provider compatibility with **Airflow < 3.2** (where `DualStatsManager` still 
exists) so legacy dotted-name StatsD mappings keep working. Neither branch of 
#67328 calls `Stats.initialize()`, so it does not address the 3.2+ init gap 
described here.
   
   
   
   ### Operating System / Deployment
   
   Kubernetes, Airflow 3.2.2, `edge3` provider 3.7.0, `statsd_exporter` v0.28.0.
   
   ### Are you willing to submit PR?
   
   Yes — PR to follow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to