Ivan Andika created RATIS-2426:
----------------------------------

             Summary: Fix memory leak in ServerRequestStreamObserver
                 Key: RATIS-2426
                 URL: https://issues.apache.org/jira/browse/RATIS-2426
             Project: Ratis
          Issue Type: Bug
            Reporter: Ivan Andika
            Assignee: Ivan Andika


We encountered issues where Ozone datanodes heap memory increased suddenly 
causing high GC and performance degradation. After analyzing the memory dump, 
it is suggested that it's due to two issues
 # StreamObservers.stop() doesn't close gRPC stream
 # onCompleted()/onError() don't clear previousOnNext

AI analysis for 1
{quote}Root cause confirmed: `GrpcLogAppender.StreamObservers.stop()` does not 
close gRPC streams.
  The leak path is:
  1. `resetClient()` (line 203) is called on error/timeout/inconsistency with a 
follower
  2. It calls `appendLogRequestObserver.stop()` — which only sets running = 
false
  3. It then sets appendLogRequestObserver = null — dropping the Java reference
  4. But the underlying gRPC `CallStreamObserver` is never closed — no 
onCompleted(), no onError(), no RST_STREAM sent
  5. On the server side (follower datanode), the ServerRequestStreamObserver 
stays alive, holding AppendEntriesRequestProto → LogEntryProto → ByteString 
(4MB chunk data) via the previousOnNext reference
  6. The HTTP/2 stream stays open in DefaultHttp2Connection$DefaultStream — 
this is exactly what MAT showed retaining 99.08% of the heap

  Compare with the clean shutdown at line 267, which correctly calls 
StreamObservers.onCompleted() → appendLog.onCompleted() → properly closes both 
streams.
  Each resetClient() leaks 1-2 HTTP/2 streams. With frequent leader changes, 
timeouts, and retries across many pipelines, this accumulated to 112K leaked 
streams / 52.2 GB.
  The fix is to modify StreamObservers.stop() to also call 
onError(Status.CANCELLED) on the CallStreamObserver to send RST_STREAM and 
release server-side resources immediately. I've updated the markdown file with 
the detailed fix code.
{quote}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to