[ 
https://issues.apache.org/jira/browse/CAMEL-23380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikita updated CAMEL-23380:
---------------------------
    Description: 
*Summary*
`Tracer$TracingRoutePolicy.onExchangeBegin()` calls `beginEventSpan()` which 
opens an OTel scope via `makeCurrent()`, but the corresponding `endEventSpan()` 
is never reached. The span is never ended (`endEpochNanos=0`) and the scope is 
never closed. On pooled Tomcat threads, this causes `ArrayBasedContext` objects 
to accumulate in `ThreadLocal` storage, growing ~8 MB/hour until OOM.

*Evidence*
With `-Dio.opentelemetry.context.enableStrictContext=true` enabled, *{*}71 
`Scope garbage collected before being closed` errors{*}* were captured over 15 
minutes of normal traffic. All 71 share an identical stack trace and all 71 
leaked spans have `endEpochNanos=0`.
*{*}Stack trace (all 71 entries):{*}*
```
io.opentelemetry.context.Context.makeCurrent (Context.java:231)
io.opentelemetry.context.ImplicitContextKeyed.makeCurrent 
(ImplicitContextKeyed.java:33)
org.apache.camel.opentelemetry2.OpenTelemetrySpanAdapter.makeCurrent 
(OpenTelemetrySpanAdapter.java:47)
org.apache.camel.opentelemetry2.OpenTelemetryTracer$OpentelemetrySpanLifecycleManager.activate
 (OpenTelemetryTracer.java:141)
org.apache.camel.telemetry.Tracer.beginEventSpan (Tracer.java:267)
org.apache.camel.telemetry.Tracer$TracingRoutePolicy.onExchangeBegin 
(Tracer.java:220)
org.apache.camel.impl.engine.CamelInternalProcessor$RoutePolicyAdvice.before 
(CamelInternalProcessor.java:554)
org.apache.camel.impl.engine.CamelInternalProcessor.process 
(CamelInternalProcessor.java:324)
```
*{*}Per-exchange pattern:{*}*
 - 41 exchanges produced 71 leaks
 - 26 exchanges leaked 2 scopes: `[router-route span, child-route span]`
 - 15 exchanges leaked 3 scopes: `[router-route span, router-route span, 
child-route span]`

All 71 leaks occur on the same Tomcat thread as the originating exchange (no 
async/cross-thread handoff for the HTTP side)
The double router-route leak in the 3-scope pattern may indicate 
`TracingRoutePolicy` is registered twice on some routes, but the core issue 
(spans never ended) affects all patterns.
*{*}Additional observation — wireTap executor threads also affected:{*}*
Heap dump analysis from Apr 29 shows `wireTapIoExecutor` threads accumulating 
OTel contexts via the same ThreadLocal mechanism. In the most leaked pod 
observed (879 MB live heap), two of the four wireTap executor threads each 
retained ~15.9 MB of OTel context chains while the other two remained at 
baseline (~147 KB). The selectively affected threads are those that have 
processed exchanges on the child handler routes (route 3 above), which contain 
multiple `.wireTap()` calls. The wireTap executor threads are not Tomcat 
servlet threads, so the leak on these threads follows a different call path 
than the HTTP side but appears to share the same root cause: 
`TracingRoutePolicy.onExchangeBegin` opening a scope that `endEventSpan` never 
closes.
*{*}Span names from the leaking routes:{*}*
|Leaks|Span Name|
|—|—|
|31|`corsair/carrier-direct-documents` (child route)|
|18|`carrier-direct-tracking-rest-consumer-*-router` (router route)|
|12|`carrier-direct-document-rest-consumer-*-router` (router route)|
|10|`corsair/carrier-direct-tracking` (child route)|

*Route structure*
The leaking routes follow a standard REST DSL pattern with three routes per 
endpoint:
1. `rest:post:/path` (auto-generated, `routeId = X--rest`) calls 
`.to("direct:X-router")`
2. `direct:X-router` (`routeId = X-router`) dispatches via 
`toD("direct:${header.ROUTE_TO}")`
3. `direct:X-<version>` (`routeId = X::<version>`) is the handler with multiple 
`.wireTap()` calls
Spans for routes 2 and 3 leak. Route 1's span does not appear in the 
strict-context output.

*Heap dump confirmation*
Production heap dumps across multiple pods and dates show linear growth of OTel 
objects retained by Tomcat `http-nio-*-exec` ThreadLocals, with 
`wireTapIoExecutor` threads also affected at higher uptimes:
|Date|Uptime|TaskThread Retained|Live Heap (MAT)|`SdkSpan` Count|
|—|—|—|—|—|
|Apr 17|0.2h|not measurable|~470 MB (baseline)|—|
|Apr 17|10.8h|104 MB (16.8%)|593 MB|23,188|
|Apr 17|23.9h|201 MB (26.2%)|733 MB|47,145|
|Apr 24|23.9h|209 MB (30.1%)|697 MB|51,444|
|Apr 29|high uptime|*{*}309 MB (33.5%){*}*|*{*}879 MB{*}*|*{*}83,138{*}*|

Growth rate: ~8 MB/hour across HTTP threads. With an 8 GB max heap, pods OOM 
after approximately 40-50 hours. `SdkSpan` count (unended spans) grows 
proportionally and serves as the most direct indicator of leak severity.

*Configuration*
```yaml
camel:
  opentelemetry2:
    enabled: true
    trace-processors: false   # only route-level spans, no per-processor spans
```

*Environment*
|Dependency|Version|
|—|—|
|`camel-opentelemetry2-starter`|4.14.5|
|`camel-telemetry`|4.14.5 (transitive)|
|`io.opentelemetry:opentelemetry-bom`|1.60.1|
|`io.opentelemetry.instrumentation:opentelemetry-instrumentation-bom`|2.26.1|
|`io.micrometer:micrometer-tracing-bridge-otel`|1.5.10|
|Spring Boot|3.5.13|
|JDK|25.0.1+8-LTS|

Note: `CAMEL-21302` (async producer context leak, fixed in 4.4.5) does not 
cover this case. Our leak is synchronous, same-thread, and caused by 
`beginEventSpan`/`endEventSpan` lifecycle mismatch rather than cross-thread 
scope closure.

*Expected behavior*
Every `beginEventSpan()` call in `TracingRoutePolicy.onExchangeBegin` should 
have a matching `endEventSpan()` that ends the span and closes the scope when 
the exchange completes on that route.

*Workaround*
A servlet filter calling `Context.root().makeCurrent().close()` after each 
request caps the heap impact on Tomcat HTTP threads but does not fix the 
underlying issue (spans still go unended, trace data is lost). It also has no 
effect on `wireTapIoExecutor` threads or other non-servlet thread pools, which 
accumulate OTel contexts via the same mechanism but are not covered by the 
filter chain.

  was:
## Summary
`Tracer$TracingRoutePolicy.onExchangeBegin()` calls `beginEventSpan()` which 
opens an OTel scope via `makeCurrent()`, but the corresponding `endEventSpan()` 
is never reached. The span is never ended (`endEpochNanos=0`) and the scope is 
never closed. On pooled Tomcat threads, this causes `ArrayBasedContext` objects 
to accumulate in `ThreadLocal` storage, growing ~8 MB/hour until OOM.
## Evidence
With `-Dio.opentelemetry.context.enableStrictContext=true` enabled, **71 `Scope 
garbage collected before being closed` errors** were captured over 15 minutes 
of normal traffic. All 71 share an identical stack trace and all 71 leaked 
spans have `endEpochNanos=0`.
**Stack trace (all 71 entries):**
```
io.opentelemetry.context.Context.makeCurrent (Context.java:231)
io.opentelemetry.context.ImplicitContextKeyed.makeCurrent 
(ImplicitContextKeyed.java:33)
org.apache.camel.opentelemetry2.OpenTelemetrySpanAdapter.makeCurrent 
(OpenTelemetrySpanAdapter.java:47)
org.apache.camel.opentelemetry2.OpenTelemetryTracer$OpentelemetrySpanLifecycleManager.activate
 (OpenTelemetryTracer.java:141)
org.apache.camel.telemetry.Tracer.beginEventSpan (Tracer.java:267)
org.apache.camel.telemetry.Tracer$TracingRoutePolicy.onExchangeBegin 
(Tracer.java:220)
org.apache.camel.impl.engine.CamelInternalProcessor$RoutePolicyAdvice.before 
(CamelInternalProcessor.java:554)
org.apache.camel.impl.engine.CamelInternalProcessor.process 
(CamelInternalProcessor.java:324)
```
**Per-exchange pattern:**
- 41 exchanges produced 71 leaks
- 26 exchanges leaked 2 scopes: `[router-route span, child-route span]`
- 15 exchanges leaked 3 scopes: `[router-route span, router-route span, 
child-route span]`
- All 71 leaks occur on the same Tomcat thread as the originating exchange (no 
async/cross-thread handoff for the HTTP side)
The double router-route leak in the 3-scope pattern may indicate 
`TracingRoutePolicy` is registered twice on some routes, but the core issue 
(spans never ended) affects all patterns.
**Additional observation — wireTap executor threads also affected:**
Heap dump analysis from Apr 29 shows `wireTapIoExecutor` threads accumulating 
OTel contexts via the same ThreadLocal mechanism. In the most leaked pod 
observed (879 MB live heap), two of the four wireTap executor threads each 
retained ~15.9 MB of OTel context chains while the other two remained at 
baseline (~147 KB). The selectively affected threads are those that have 
processed exchanges on the child handler routes (route 3 above), which contain 
multiple `.wireTap()` calls. The wireTap executor threads are not Tomcat 
servlet threads, so the leak on these threads follows a different call path 
than the HTTP side but appears to share the same root cause: 
`TracingRoutePolicy.onExchangeBegin` opening a scope that `endEventSpan` never 
closes.
**Span names from the leaking routes:**
| Leaks | Span Name |
|---|---|
| 31 | `corsair/carrier-direct-documents` (child route) |
| 18 | `carrier-direct-tracking-rest-consumer-*-router` (router route) |
| 12 | `carrier-direct-document-rest-consumer-*-router` (router route) |
| 10 | `corsair/carrier-direct-tracking` (child route) |
## Route structure
The leaking routes follow a standard REST DSL pattern with three routes per 
endpoint:
1. `rest:post:/path` (auto-generated, `routeId = X--rest`) calls 
`.to("direct:X-router")`
2. `direct:X-router` (`routeId = X-router`) dispatches via 
`toD("direct:${header.ROUTE_TO}")`
3. `direct:X-<version>` (`routeId = X::<version>`) is the handler with multiple 
`.wireTap()` calls
Spans for routes 2 and 3 leak. Route 1's span does not appear in the 
strict-context output.
## Heap dump confirmation
Production heap dumps across multiple pods and dates show linear growth of OTel 
objects retained by Tomcat `http-nio-*-exec` ThreadLocals, with 
`wireTapIoExecutor` threads also affected at higher uptimes:
| Date | Uptime | TaskThread Retained | Live Heap (MAT) | `SdkSpan` Count |
|---|---|---|---|---|
| Apr 17 | 0.2h | not measurable | ~470 MB (baseline) | — |
| Apr 17 | 10.8h | 104 MB (16.8%) | 593 MB | 23,188 |
| Apr 17 | 23.9h | 201 MB (26.2%) | 733 MB | 47,145 |
| Apr 24 | 23.9h | 209 MB (30.1%) | 697 MB | 51,444 |
| Apr 29 | high uptime | **309 MB (33.5%)** | **879 MB** | **83,138** |
Growth rate: ~8 MB/hour across HTTP threads. With an 8 GB max heap, pods OOM 
after approximately 40-50 hours. `SdkSpan` count (unended spans) grows 
proportionally and serves as the most direct indicator of leak severity.
## Configuration
```yaml
camel:
  opentelemetry2:
    enabled: true
    trace-processors: false   # only route-level spans, no per-processor spans
```
## Environment
| Dependency | Version |
|---|---|
| `camel-opentelemetry2-starter` | 4.14.5 |
| `camel-telemetry` | 4.14.5 (transitive) |
| `io.opentelemetry:opentelemetry-bom` | 1.60.1 |
| `io.opentelemetry.instrumentation:opentelemetry-instrumentation-bom` | 2.26.1 
|
| `io.micrometer:micrometer-tracing-bridge-otel` | 1.5.10 |
| Spring Boot | 3.5.13 |
| JDK | 25.0.1+8-LTS |
Note: `CAMEL-21302` (async producer context leak, fixed in 4.4.5) does not 
cover this case. Our leak is synchronous, same-thread, and caused by 
`beginEventSpan`/`endEventSpan` lifecycle mismatch rather than cross-thread 
scope closure.
## Expected behavior
Every `beginEventSpan()` call in `TracingRoutePolicy.onExchangeBegin` should 
have a matching `endEventSpan()` that ends the span and closes the scope when 
the exchange completes on that route.
## Workaround
A servlet filter calling `Context.root().makeCurrent().close()` after each 
request caps the heap impact on Tomcat HTTP threads but does not fix the 
underlying issue (spans still go unended, trace data is lost). It also has no 
effect on `wireTapIoExecutor` threads or other non-servlet thread pools, which 
accumulate OTel contexts via the same mechanism but are not covered by the 
filter chain.


> `TracingRoutePolicy.onExchangeBegin` opens scopes and spans that are never 
> closed/ended
> ---------------------------------------------------------------------------------------
>
>                 Key: CAMEL-23380
>                 URL: https://issues.apache.org/jira/browse/CAMEL-23380
>             Project: Camel
>          Issue Type: Bug
>          Components: camel-opentelemetry
>    Affects Versions: 4.14.5
>            Reporter: Nikita
>            Priority: Critical
>
> *Summary*
> `Tracer$TracingRoutePolicy.onExchangeBegin()` calls `beginEventSpan()` which 
> opens an OTel scope via `makeCurrent()`, but the corresponding 
> `endEventSpan()` is never reached. The span is never ended 
> (`endEpochNanos=0`) and the scope is never closed. On pooled Tomcat threads, 
> this causes `ArrayBasedContext` objects to accumulate in `ThreadLocal` 
> storage, growing ~8 MB/hour until OOM.
> *Evidence*
> With `-Dio.opentelemetry.context.enableStrictContext=true` enabled, *{*}71 
> `Scope garbage collected before being closed` errors{*}* were captured over 
> 15 minutes of normal traffic. All 71 share an identical stack trace and all 
> 71 leaked spans have `endEpochNanos=0`.
> *{*}Stack trace (all 71 entries):{*}*
> ```
> io.opentelemetry.context.Context.makeCurrent (Context.java:231)
> io.opentelemetry.context.ImplicitContextKeyed.makeCurrent 
> (ImplicitContextKeyed.java:33)
> org.apache.camel.opentelemetry2.OpenTelemetrySpanAdapter.makeCurrent 
> (OpenTelemetrySpanAdapter.java:47)
> org.apache.camel.opentelemetry2.OpenTelemetryTracer$OpentelemetrySpanLifecycleManager.activate
>  (OpenTelemetryTracer.java:141)
> org.apache.camel.telemetry.Tracer.beginEventSpan (Tracer.java:267)
> org.apache.camel.telemetry.Tracer$TracingRoutePolicy.onExchangeBegin 
> (Tracer.java:220)
> org.apache.camel.impl.engine.CamelInternalProcessor$RoutePolicyAdvice.before 
> (CamelInternalProcessor.java:554)
> org.apache.camel.impl.engine.CamelInternalProcessor.process 
> (CamelInternalProcessor.java:324)
> ```
> *{*}Per-exchange pattern:{*}*
>  - 41 exchanges produced 71 leaks
>  - 26 exchanges leaked 2 scopes: `[router-route span, child-route span]`
>  - 15 exchanges leaked 3 scopes: `[router-route span, router-route span, 
> child-route span]`
> All 71 leaks occur on the same Tomcat thread as the originating exchange (no 
> async/cross-thread handoff for the HTTP side)
> The double router-route leak in the 3-scope pattern may indicate 
> `TracingRoutePolicy` is registered twice on some routes, but the core issue 
> (spans never ended) affects all patterns.
> *{*}Additional observation — wireTap executor threads also affected:{*}*
> Heap dump analysis from Apr 29 shows `wireTapIoExecutor` threads accumulating 
> OTel contexts via the same ThreadLocal mechanism. In the most leaked pod 
> observed (879 MB live heap), two of the four wireTap executor threads each 
> retained ~15.9 MB of OTel context chains while the other two remained at 
> baseline (~147 KB). The selectively affected threads are those that have 
> processed exchanges on the child handler routes (route 3 above), which 
> contain multiple `.wireTap()` calls. The wireTap executor threads are not 
> Tomcat servlet threads, so the leak on these threads follows a different call 
> path than the HTTP side but appears to share the same root cause: 
> `TracingRoutePolicy.onExchangeBegin` opening a scope that `endEventSpan` 
> never closes.
> *{*}Span names from the leaking routes:{*}*
> |Leaks|Span Name|
> |—|—|
> |31|`corsair/carrier-direct-documents` (child route)|
> |18|`carrier-direct-tracking-rest-consumer-*-router` (router route)|
> |12|`carrier-direct-document-rest-consumer-*-router` (router route)|
> |10|`corsair/carrier-direct-tracking` (child route)|
> *Route structure*
> The leaking routes follow a standard REST DSL pattern with three routes per 
> endpoint:
> 1. `rest:post:/path` (auto-generated, `routeId = X--rest`) calls 
> `.to("direct:X-router")`
> 2. `direct:X-router` (`routeId = X-router`) dispatches via 
> `toD("direct:${header.ROUTE_TO}")`
> 3. `direct:X-<version>` (`routeId = X::<version>`) is the handler with 
> multiple `.wireTap()` calls
> Spans for routes 2 and 3 leak. Route 1's span does not appear in the 
> strict-context output.
> *Heap dump confirmation*
> Production heap dumps across multiple pods and dates show linear growth of 
> OTel objects retained by Tomcat `http-nio-*-exec` ThreadLocals, with 
> `wireTapIoExecutor` threads also affected at higher uptimes:
> |Date|Uptime|TaskThread Retained|Live Heap (MAT)|`SdkSpan` Count|
> |—|—|—|—|—|
> |Apr 17|0.2h|not measurable|~470 MB (baseline)|—|
> |Apr 17|10.8h|104 MB (16.8%)|593 MB|23,188|
> |Apr 17|23.9h|201 MB (26.2%)|733 MB|47,145|
> |Apr 24|23.9h|209 MB (30.1%)|697 MB|51,444|
> |Apr 29|high uptime|*{*}309 MB (33.5%){*}*|*{*}879 MB{*}*|*{*}83,138{*}*|
> Growth rate: ~8 MB/hour across HTTP threads. With an 8 GB max heap, pods OOM 
> after approximately 40-50 hours. `SdkSpan` count (unended spans) grows 
> proportionally and serves as the most direct indicator of leak severity.
> *Configuration*
> ```yaml
> camel:
>   opentelemetry2:
>     enabled: true
>     trace-processors: false   # only route-level spans, no per-processor spans
> ```
> *Environment*
> |Dependency|Version|
> |—|—|
> |`camel-opentelemetry2-starter`|4.14.5|
> |`camel-telemetry`|4.14.5 (transitive)|
> |`io.opentelemetry:opentelemetry-bom`|1.60.1|
> |`io.opentelemetry.instrumentation:opentelemetry-instrumentation-bom`|2.26.1|
> |`io.micrometer:micrometer-tracing-bridge-otel`|1.5.10|
> |Spring Boot|3.5.13|
> |JDK|25.0.1+8-LTS|
> Note: `CAMEL-21302` (async producer context leak, fixed in 4.4.5) does not 
> cover this case. Our leak is synchronous, same-thread, and caused by 
> `beginEventSpan`/`endEventSpan` lifecycle mismatch rather than cross-thread 
> scope closure.
> *Expected behavior*
> Every `beginEventSpan()` call in `TracingRoutePolicy.onExchangeBegin` should 
> have a matching `endEventSpan()` that ends the span and closes the scope when 
> the exchange completes on that route.
> *Workaround*
> A servlet filter calling `Context.root().makeCurrent().close()` after each 
> request caps the heap impact on Tomcat HTTP threads but does not fix the 
> underlying issue (spans still go unended, trace data is lost). It also has no 
> effect on `wireTapIoExecutor` threads or other non-servlet thread pools, 
> which accumulate OTel contexts via the same mechanism but are not covered by 
> the filter chain.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to