MichaelRBlack opened a new issue, #64690:
URL: https://github.com/apache/airflow/issues/64690
### Apache Airflow version
3.1.8
### What happened
Task-level OTel metrics (e.g. `ti.finish`) are silently dropped in forked
task subprocesses. The metrics never reach the OTel collector, causing gaps in
monitoring dashboards (e.g. Grafana).
### What you think should happen instead
Metrics emitted from forked task processes should be exported to the OTel
collector, same as the parent process.
### How to reproduce
1. Configure Airflow 3.x with OTel metrics enabled (`otel_on = True`)
2. Run any DAG with KubernetesExecutor (or any executor that forks for task
execution)
3. Observe task logs show:
```
INFO - Stats instance was created in PID 7 but accessed in PID 19.
Re-initializing.
INFO - [Metric Exporter] Connecting to OpenTelemetry Collector at
http://...
WARNING - Overriding of current MeterProvider is not allowed
```
4. Check Grafana/Prometheus — `ti.finish` metrics are missing
### Root cause
`airflow/stats.py` correctly detects PID mismatches after fork and
re-initializes the Stats instance by calling `otel_logger.get_otel_logger()`.
This creates a fresh `MeterProvider` and calls `metrics.set_meter_provider()`.
However, the OTel Python SDK uses a `Once()` guard
(`opentelemetry/metrics/_internal/__init__.py`):
```python
_METER_PROVIDER_SET_ONCE = Once()
def set_meter_provider(meter_provider):
def set_mp():
global _METER_PROVIDER
_METER_PROVIDER = meter_provider
_PROXY_METER_PROVIDER.on_set_meter_provider(meter_provider)
did_set = _METER_PROVIDER_SET_ONCE.do_once(set_mp)
if not did_set:
_logger.warning("Overriding of current MeterProvider is not allowed")
```
The `Once._done = True` flag from the parent process survives `fork()`, so
the child's `set_meter_provider()` silently fails. The child ends up using the
parent's stale `MeterProvider` whose `PeriodicExportingMetricReader` background
thread is dead after fork.
The code path:
1. **`stats.py:55-64`** — detects PID mismatch, sets `cls.instance = None`,
calls factory
2. **`otel_logger.py:410`** — creates new `MeterProvider`, calls
`metrics.set_meter_provider()`
3. **OTel SDK `Once().do_once()`** — returns `False` because `_done` was
inherited from parent
4. **`otel_logger.py` returns**
`SafeOtelLogger(metrics.get_meter_provider(), ...)` — gets stale parent provider
5. **`task_runner.py:1195`** — `Stats.incr("ti.finish", ...)` → dead
exporter → metrics lost
### Proposed fix
Reset the OTel SDK's provider state in `get_otel_logger()` before calling
`set_meter_provider()`. Since `stats.py` only calls the factory after detecting
a PID mismatch (i.e., we know we're in a forked child), this is safe:
```python
# otel_logger.py — before metrics.set_meter_provider(...)
import opentelemetry.metrics._internal as _metrics_internal
_metrics_internal._METER_PROVIDER_SET_ONCE._done = False
_metrics_internal._METER_PROVIDER = None
```
This is the minimal change that fixes the issue at the point of failure. The
`stats.py` fork detection already guarantees this code only runs in child
processes that need a fresh provider.
### Operating System
Linux (EKS, Kubernetes)
### Versions of Apache Airflow Providers
N/A (core issue)
### Deployment
Official Apache Airflow Helm Chart
### Anything else
Workaround: an Airflow plugin using
`os.register_at_fork(after_in_child=...)` to reset the OTel state. This works
but shouldn't be necessary — the fix belongs in core.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]