Ivan Andika created RATIS-2426:
----------------------------------
Summary: Fix memory leak in ServerRequestStreamObserver
Key: RATIS-2426
URL: https://issues.apache.org/jira/browse/RATIS-2426
Project: Ratis
Issue Type: Bug
Reporter: Ivan Andika
Assignee: Ivan Andika
We encountered issues where Ozone datanodes heap memory increased suddenly
causing high GC and performance degradation. After analyzing the memory dump,
it is suggested that it's due to two issues
# StreamObservers.stop() doesn't close gRPC stream
# onCompleted()/onError() don't clear previousOnNext
AI analysis for 1
{quote}Root cause confirmed: `GrpcLogAppender.StreamObservers.stop()` does not
close gRPC streams.
The leak path is:
1. `resetClient()` (line 203) is called on error/timeout/inconsistency with a
follower
2. It calls `appendLogRequestObserver.stop()` — which only sets running =
false
3. It then sets appendLogRequestObserver = null — dropping the Java reference
4. But the underlying gRPC `CallStreamObserver` is never closed — no
onCompleted(), no onError(), no RST_STREAM sent
5. On the server side (follower datanode), the ServerRequestStreamObserver
stays alive, holding AppendEntriesRequestProto → LogEntryProto → ByteString
(4MB chunk data) via the previousOnNext reference
6. The HTTP/2 stream stays open in DefaultHttp2Connection$DefaultStream —
this is exactly what MAT showed retaining 99.08% of the heap
Compare with the clean shutdown at line 267, which correctly calls
StreamObservers.onCompleted() → appendLog.onCompleted() → properly closes both
streams.
Each resetClient() leaks 1-2 HTTP/2 streams. With frequent leader changes,
timeouts, and retries across many pipelines, this accumulated to 112K leaked
streams / 52.2 GB.
The fix is to modify StreamObservers.stop() to also call
onError(Status.CANCELLED) on the CallStreamObserver to send RST_STREAM and
release server-side resources immediately. I've updated the markdown file with
the detailed fix code.
{quote}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)