Nikita created CAMEL-23380:
------------------------------

             Summary: `TracingRoutePolicy.onExchangeBegin` opens scopes and 
spans that are never closed/ended
                 Key: CAMEL-23380
                 URL: https://issues.apache.org/jira/browse/CAMEL-23380
             Project: Camel
          Issue Type: Bug
          Components: camel-opentelemetry
    Affects Versions: 4.14.5
            Reporter: Nikita


## Summary
`Tracer$TracingRoutePolicy.onExchangeBegin()` calls `beginEventSpan()` which 
opens an OTel scope via `makeCurrent()`, but the corresponding `endEventSpan()` 
is never reached. The span is never ended (`endEpochNanos=0`) and the scope is 
never closed. On pooled Tomcat threads, this causes `ArrayBasedContext` objects 
to accumulate in `ThreadLocal` storage, growing ~8 MB/hour until OOM.
## Evidence
With `-Dio.opentelemetry.context.enableStrictContext=true` enabled, **71 `Scope 
garbage collected before being closed` errors** were captured over 15 minutes 
of normal traffic. All 71 share an identical stack trace and all 71 leaked 
spans have `endEpochNanos=0`.
**Stack trace (all 71 entries):**
```
io.opentelemetry.context.Context.makeCurrent (Context.java:231)
io.opentelemetry.context.ImplicitContextKeyed.makeCurrent 
(ImplicitContextKeyed.java:33)
org.apache.camel.opentelemetry2.OpenTelemetrySpanAdapter.makeCurrent 
(OpenTelemetrySpanAdapter.java:47)
org.apache.camel.opentelemetry2.OpenTelemetryTracer$OpentelemetrySpanLifecycleManager.activate
 (OpenTelemetryTracer.java:141)
org.apache.camel.telemetry.Tracer.beginEventSpan (Tracer.java:267)
org.apache.camel.telemetry.Tracer$TracingRoutePolicy.onExchangeBegin 
(Tracer.java:220)
org.apache.camel.impl.engine.CamelInternalProcessor$RoutePolicyAdvice.before 
(CamelInternalProcessor.java:554)
org.apache.camel.impl.engine.CamelInternalProcessor.process 
(CamelInternalProcessor.java:324)
```
**Per-exchange pattern:**
- 41 exchanges produced 71 leaks
- 26 exchanges leaked 2 scopes: `[router-route span, child-route span]`
- 15 exchanges leaked 3 scopes: `[router-route span, router-route span, 
child-route span]`
- All 71 leaks occur on the same Tomcat thread as the originating exchange (no 
async/cross-thread handoff for the HTTP side)
The double router-route leak in the 3-scope pattern may indicate 
`TracingRoutePolicy` is registered twice on some routes, but the core issue 
(spans never ended) affects all patterns.
**Additional observation — wireTap executor threads also affected:**
Heap dump analysis from Apr 29 shows `wireTapIoExecutor` threads accumulating 
OTel contexts via the same ThreadLocal mechanism. In the most leaked pod 
observed (879 MB live heap), two of the four wireTap executor threads each 
retained ~15.9 MB of OTel context chains while the other two remained at 
baseline (~147 KB). The selectively affected threads are those that have 
processed exchanges on the child handler routes (route 3 above), which contain 
multiple `.wireTap()` calls. The wireTap executor threads are not Tomcat 
servlet threads, so the leak on these threads follows a different call path 
than the HTTP side but appears to share the same root cause: 
`TracingRoutePolicy.onExchangeBegin` opening a scope that `endEventSpan` never 
closes.
**Span names from the leaking routes:**
| Leaks | Span Name |
|---|---|
| 31 | `corsair/carrier-direct-documents` (child route) |
| 18 | `carrier-direct-tracking-rest-consumer-*-router` (router route) |
| 12 | `carrier-direct-document-rest-consumer-*-router` (router route) |
| 10 | `corsair/carrier-direct-tracking` (child route) |
## Route structure
The leaking routes follow a standard REST DSL pattern with three routes per 
endpoint:
1. `rest:post:/path` (auto-generated, `routeId = X--rest`) calls 
`.to("direct:X-router")`
2. `direct:X-router` (`routeId = X-router`) dispatches via 
`toD("direct:${header.ROUTE_TO}")`
3. `direct:X-<version>` (`routeId = X::<version>`) is the handler with multiple 
`.wireTap()` calls
Spans for routes 2 and 3 leak. Route 1's span does not appear in the 
strict-context output.
## Heap dump confirmation
Production heap dumps across multiple pods and dates show linear growth of OTel 
objects retained by Tomcat `http-nio-*-exec` ThreadLocals, with 
`wireTapIoExecutor` threads also affected at higher uptimes:
| Date | Uptime | TaskThread Retained | Live Heap (MAT) | `SdkSpan` Count |
|---|---|---|---|---|
| Apr 17 | 0.2h | not measurable | ~470 MB (baseline) | — |
| Apr 17 | 10.8h | 104 MB (16.8%) | 593 MB | 23,188 |
| Apr 17 | 23.9h | 201 MB (26.2%) | 733 MB | 47,145 |
| Apr 24 | 23.9h | 209 MB (30.1%) | 697 MB | 51,444 |
| Apr 29 | high uptime | **309 MB (33.5%)** | **879 MB** | **83,138** |
Growth rate: ~8 MB/hour across HTTP threads. With an 8 GB max heap, pods OOM 
after approximately 40-50 hours. `SdkSpan` count (unended spans) grows 
proportionally and serves as the most direct indicator of leak severity.
## Configuration
```yaml
camel:
  opentelemetry2:
    enabled: true
    trace-processors: false   # only route-level spans, no per-processor spans
```
## Environment
| Dependency | Version |
|---|---|
| `camel-opentelemetry2-starter` | 4.14.5 |
| `camel-telemetry` | 4.14.5 (transitive) |
| `io.opentelemetry:opentelemetry-bom` | 1.60.1 |
| `io.opentelemetry.instrumentation:opentelemetry-instrumentation-bom` | 2.26.1 
|
| `io.micrometer:micrometer-tracing-bridge-otel` | 1.5.10 |
| Spring Boot | 3.5.13 |
| JDK | 25.0.1+8-LTS |
Note: `CAMEL-21302` (async producer context leak, fixed in 4.4.5) does not 
cover this case. Our leak is synchronous, same-thread, and caused by 
`beginEventSpan`/`endEventSpan` lifecycle mismatch rather than cross-thread 
scope closure.
## Expected behavior
Every `beginEventSpan()` call in `TracingRoutePolicy.onExchangeBegin` should 
have a matching `endEventSpan()` that ends the span and closes the scope when 
the exchange completes on that route.
## Workaround
A servlet filter calling `Context.root().makeCurrent().close()` after each 
request caps the heap impact on Tomcat HTTP threads but does not fix the 
underlying issue (spans still go unended, trace data is lost). It also has no 
effect on `wireTapIoExecutor` threads or other non-servlet thread pools, which 
accumulate OTel contexts via the same mechanism but are not covered by the 
filter chain.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to