MichaelRBlack opened a new issue, #64690:
URL: https://github.com/apache/airflow/issues/64690

   ### Apache Airflow version
   
   3.1.8
   
   ### What happened
   
   Task-level OTel metrics (e.g. `ti.finish`) are silently dropped in forked 
task subprocesses. The metrics never reach the OTel collector, causing gaps in 
monitoring dashboards (e.g. Grafana).
   
   ### What you think should happen instead
   
   Metrics emitted from forked task processes should be exported to the OTel 
collector, same as the parent process.
   
   ### How to reproduce
   
   1. Configure Airflow 3.x with OTel metrics enabled (`otel_on = True`)
   2. Run any DAG with KubernetesExecutor (or any executor that forks for task 
execution)
   3. Observe task logs show:
      ```
      INFO - Stats instance was created in PID 7 but accessed in PID 19. 
Re-initializing.
      INFO - [Metric Exporter] Connecting to OpenTelemetry Collector at 
http://...
      WARNING - Overriding of current MeterProvider is not allowed
      ```
   4. Check Grafana/Prometheus — `ti.finish` metrics are missing
   
   ### Root cause
   
   `airflow/stats.py` correctly detects PID mismatches after fork and 
re-initializes the Stats instance by calling `otel_logger.get_otel_logger()`. 
This creates a fresh `MeterProvider` and calls `metrics.set_meter_provider()`.
   
   However, the OTel Python SDK uses a `Once()` guard 
(`opentelemetry/metrics/_internal/__init__.py`):
   
   ```python
   _METER_PROVIDER_SET_ONCE = Once()
   
   def set_meter_provider(meter_provider):
       def set_mp():
           global _METER_PROVIDER
           _METER_PROVIDER = meter_provider
           _PROXY_METER_PROVIDER.on_set_meter_provider(meter_provider)
   
       did_set = _METER_PROVIDER_SET_ONCE.do_once(set_mp)
       if not did_set:
           _logger.warning("Overriding of current MeterProvider is not allowed")
   ```
   
   The `Once._done = True` flag from the parent process survives `fork()`, so 
the child's `set_meter_provider()` silently fails. The child ends up using the 
parent's stale `MeterProvider` whose `PeriodicExportingMetricReader` background 
thread is dead after fork.
   
   The code path:
   1. **`stats.py:55-64`** — detects PID mismatch, sets `cls.instance = None`, 
calls factory
   2. **`otel_logger.py:410`** — creates new `MeterProvider`, calls 
`metrics.set_meter_provider()`
   3. **OTel SDK `Once().do_once()`** — returns `False` because `_done` was 
inherited from parent
   4. **`otel_logger.py` returns** 
`SafeOtelLogger(metrics.get_meter_provider(), ...)` — gets stale parent provider
   5. **`task_runner.py:1195`** — `Stats.incr("ti.finish", ...)` → dead 
exporter → metrics lost
   
   ### Proposed fix
   
   Reset the OTel SDK's provider state in `get_otel_logger()` before calling 
`set_meter_provider()`. Since `stats.py` only calls the factory after detecting 
a PID mismatch (i.e., we know we're in a forked child), this is safe:
   
   ```python
   # otel_logger.py — before metrics.set_meter_provider(...)
   import opentelemetry.metrics._internal as _metrics_internal
   _metrics_internal._METER_PROVIDER_SET_ONCE._done = False
   _metrics_internal._METER_PROVIDER = None
   ```
   
   This is the minimal change that fixes the issue at the point of failure. The 
`stats.py` fork detection already guarantees this code only runs in child 
processes that need a fresh provider.
   
   ### Operating System
   
   Linux (EKS, Kubernetes)
   
   ### Versions of Apache Airflow Providers
   
   N/A (core issue)
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Anything else
   
   Workaround: an Airflow plugin using 
`os.register_at_fork(after_in_child=...)` to reset the OTel state. This works 
but shouldn't be necessary — the fix belongs in core.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to