[
https://issues.apache.org/jira/browse/RATIS-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ivan Andika updated RATIS-2426:
-------------------------------
Attachment: dn.20260308.1948.1748116.heap_Leak_Suspects.zip
dn.20260308.1948.1748116.heap_System_Overview.zip
dn.20260308.1948.1748116.heap_Top_Components.zip
> Fix memory leak in ServerRequestStreamObserver
> ----------------------------------------------
>
> Key: RATIS-2426
> URL: https://issues.apache.org/jira/browse/RATIS-2426
> Project: Ratis
> Issue Type: Bug
> Components: gRPC
> Reporter: Ivan Andika
> Assignee: Ivan Andika
> Priority: Major
> Attachments: dn.20260308.1948.1748116.heap_Leak_Suspects.zip,
> dn.20260308.1948.1748116.heap_System_Overview.zip,
> dn.20260308.1948.1748116.heap_Top_Components.zip
>
>
> We encountered issues where Ozone datanodes heap memory increased suddenly
> causing high GC and causes the DN to be stuck. After analyzing the memory
> dump, it is suggested that there are memory leaks caused by the following
> issues
> # StreamObservers.stop() doesn't close gRPC stream
> # onCompleted()/onError() don't clear previousOnNext and requestFuture
> (Edit: Technically, it's not really necessary since if 1 is resolved, the
> follower StreamObserver should be GC-eligible, including the previousOnNext
> and requestFuture)
> # StreamObservers#onCompleted is not called if the GrpcLogAppender is
> interrupted (e.g. due to Thread.interrupt in LogAppenderDaemon#tryToClose)
> AI analysis for 1
> {quote}Root cause confirmed: `GrpcLogAppender.StreamObservers.stop()` does
> not close gRPC streams.
> The leak path is:
> 1. `resetClient()` (line 203) is called on error/timeout/inconsistency with
> a follower
> 2. It calls `appendLogRequestObserver.stop()` — which only sets running =
> false
> 3. It then sets appendLogRequestObserver = null — dropping the Java
> reference
> 4. But the underlying gRPC `CallStreamObserver` is never closed — no
> onCompleted(), no onError(), no RST_STREAM sent
> 5. On the server side (follower datanode), the ServerRequestStreamObserver
> stays alive, holding AppendEntriesRequestProto → LogEntryProto → ByteString
> (4MB chunk data) via the previousOnNext reference
> 6. The HTTP/2 stream stays open in DefaultHttp2Connection$DefaultStream —
> this is exactly what MAT showed retaining 99.08% of the heap
> Compare with the clean shutdown at line 267, which correctly calls
> StreamObservers.onCompleted() → appendLog.onCompleted() → properly closes
> both streams.
> Each resetClient() leaks 1-2 HTTP/2 streams. With frequent leader changes,
> timeouts, and retries across many pipelines, this accumulated to 112K leaked
> streams / 52.2 GB.
> The fix is to modify StreamObservers.stop() to also call
> onError(Status.CANCELLED) on the CallStreamObserver to send RST_STREAM and
> release server-side resources immediately.
> {quote}
> AI analysis for 2
> {quote}Neither onCompleted() nor onError() clears previousOnNext. This is a
> secondary bug that makes the leak worse even when streams are technically
> "closed":
> 1. previousOnNext holds a PendingServerRequest<AppendEntriesRequestProto>,
> which holds the last AppendEntriesRequestProto — containing LogEntryProto
> with 4MB ByteString chunk data.
> 2. Even after onCompleted()/onError() sets isClosed = true, the
> previousOnNext reference keeps the last request's payload alive for as long
> as the ServerRequestStreamObserver object exists.
> 3. In onCompleted(), requestFuture.get().thenAccept(...) creates a
> CompletableFuture chain. If the chain hasn't completed yet, the lambda
> captures the observer, preventing GC until all chained futures resolve.
> During this time, the
> 4MB payload is still pinned.
> 4. Similarly, requestFuture itself chains all processing via
> composeRequest():
> {quote}
> AI analysis for 3
> {quote}Root cause of the test failure: The run() method's stream cleanup on
> line 284 was not in a finally block. When the appender thread was interrupted
> during onNext() (visible in the logs as InterruptedIOException: Interrupted
> appendLog), the exception propagated past the cleanup code, leaving the
> server-side observer open forever. This is the exact leak we were trying to
> detect.
> The fix: Wrapping the loop in a try \{ ... } finally \{ onCompleted(); }
> block ensures the server-side stream observer is always closed, even when the
> appender exits via exception (e.g., InterruptedIOException from thread
> interruption
> during a restart).
> {quote}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)