[ 
https://issues.apache.org/jira/browse/RATIS-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Andika updated RATIS-2426:
-------------------------------
    Component/s: gRPC

> Fix memory leak in ServerRequestStreamObserver
> ----------------------------------------------
>
>                 Key: RATIS-2426
>                 URL: https://issues.apache.org/jira/browse/RATIS-2426
>             Project: Ratis
>          Issue Type: Bug
>          Components: gRPC
>            Reporter: Ivan Andika
>            Assignee: Ivan Andika
>            Priority: Major
>
> We encountered issues where Ozone datanodes heap memory increased suddenly 
> causing high GC and performance degradation. After analyzing the memory dump, 
> it is suggested that it's due to two issues
>  # StreamObservers.stop() doesn't close gRPC stream
>  # onCompleted()/onError() don't clear previousOnNext
> AI analysis for 1
> {quote}Root cause confirmed: `GrpcLogAppender.StreamObservers.stop()` does 
> not close gRPC streams.
>   The leak path is:
>   1. `resetClient()` (line 203) is called on error/timeout/inconsistency with 
> a follower
>   2. It calls `appendLogRequestObserver.stop()` — which only sets running = 
> false
>   3. It then sets appendLogRequestObserver = null — dropping the Java 
> reference
>   4. But the underlying gRPC `CallStreamObserver` is never closed — no 
> onCompleted(), no onError(), no RST_STREAM sent
>   5. On the server side (follower datanode), the ServerRequestStreamObserver 
> stays alive, holding AppendEntriesRequestProto → LogEntryProto → ByteString 
> (4MB chunk data) via the previousOnNext reference
>   6. The HTTP/2 stream stays open in DefaultHttp2Connection$DefaultStream — 
> this is exactly what MAT showed retaining 99.08% of the heap
>   Compare with the clean shutdown at line 267, which correctly calls 
> StreamObservers.onCompleted() → appendLog.onCompleted() → properly closes 
> both streams.
>   Each resetClient() leaks 1-2 HTTP/2 streams. With frequent leader changes, 
> timeouts, and retries across many pipelines, this accumulated to 112K leaked 
> streams / 52.2 GB.
>   The fix is to modify StreamObservers.stop() to also call 
> onError(Status.CANCELLED) on the CallStreamObserver to send RST_STREAM and 
> release server-side resources immediately. I've updated the markdown file 
> with the detailed fix code.
> {quote}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to