[
https://issues.apache.org/jira/browse/RATIS-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ivan Andika updated RATIS-2426:
-------------------------------
Component/s: gRPC
> Fix memory leak in ServerRequestStreamObserver
> ----------------------------------------------
>
> Key: RATIS-2426
> URL: https://issues.apache.org/jira/browse/RATIS-2426
> Project: Ratis
> Issue Type: Bug
> Components: gRPC
> Reporter: Ivan Andika
> Assignee: Ivan Andika
> Priority: Major
>
> We encountered issues where Ozone datanodes heap memory increased suddenly
> causing high GC and performance degradation. After analyzing the memory dump,
> it is suggested that it's due to two issues
> # StreamObservers.stop() doesn't close gRPC stream
> # onCompleted()/onError() don't clear previousOnNext
> AI analysis for 1
> {quote}Root cause confirmed: `GrpcLogAppender.StreamObservers.stop()` does
> not close gRPC streams.
> The leak path is:
> 1. `resetClient()` (line 203) is called on error/timeout/inconsistency with
> a follower
> 2. It calls `appendLogRequestObserver.stop()` — which only sets running =
> false
> 3. It then sets appendLogRequestObserver = null — dropping the Java
> reference
> 4. But the underlying gRPC `CallStreamObserver` is never closed — no
> onCompleted(), no onError(), no RST_STREAM sent
> 5. On the server side (follower datanode), the ServerRequestStreamObserver
> stays alive, holding AppendEntriesRequestProto → LogEntryProto → ByteString
> (4MB chunk data) via the previousOnNext reference
> 6. The HTTP/2 stream stays open in DefaultHttp2Connection$DefaultStream —
> this is exactly what MAT showed retaining 99.08% of the heap
> Compare with the clean shutdown at line 267, which correctly calls
> StreamObservers.onCompleted() → appendLog.onCompleted() → properly closes
> both streams.
> Each resetClient() leaks 1-2 HTTP/2 streams. With frequent leader changes,
> timeouts, and retries across many pipelines, this accumulated to 112K leaked
> streams / 52.2 GB.
> The fix is to modify StreamObservers.stop() to also call
> onError(Status.CANCELLED) on the CallStreamObserver to send RST_STREAM and
> release server-side resources immediately. I've updated the markdown file
> with the detailed fix code.
> {quote}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)