Chi-Hsuan Huang created HDDS-15670:
--------------------------------------

             Summary: Ratis graceful channel shutdown floods logs under high 
client concurrency
                 Key: HDDS-15670
                 URL: https://issues.apache.org/jira/browse/HDDS-15670
             Project: Apache Ozone
          Issue Type: Bug
    Affects Versions: 2.3.0
            Reporter: Chi-Hsuan Huang


Running {{ozone freon dfsg}} with a high thread count produces many WARN lines 
like:

{code}
WARN grpc.GrpcUtil: Timed out gracefully shutting down connection:
ManagedChannelOrphanWrapper{delegate=ManagedChannelImpl{logId=..., 
target=10.15.25.x:9858}}.
{code}

Reproduce:

{code}
ozone freon dfsg \-s 268435456 \-\-prefix beg0i3ghm2 \-\-path 
ofs://ozone/ratis\-vol/andrey \-n10000 \-t160 \-\-buffer=1048576 
\-\-copy\-buffer=1048576
{code}

h3. Source of the message

The line is logged by {{org.apache.ratis.grpc.GrpcUtil}} \(log4j abbreviates 
the logger to {{grpc.GrpcUtil}}\), not by grpc\-core's orphan detector. 
{{target=...:9858}} is the datanode Ratis IPC port 
\({{HDDS\_CONTAINER\_RATIS\_IPC\_PORT\_DEFAULT}}\). The connections are 
{{XceiverClientRatis}} / RaftClient channels.

h3. What actually happens

These are not leaked channels. They are closed on the normal path: 
{{FileSystem.close \-> OzoneClient.close \-> RpcClient.close \-> 
XceiverClientManager.close}} shuts down each Ratis client, and Ratis attempts a 
graceful gRPC channel shutdown. Under high concurrency \({{\-t160}}, each 
thread owning its own FileSystem because {{fs..impl.disable.cache=true}}\), 
many Ratis clients close almost simultaneously, the graceful drain exceeds the 
Ratis grace window, so Ratis logs the WARN and falls back to a forceful 
shutdown. Only {{gracefully}} lines appear \(no {{forcefully}}\), so the 
channels do terminate.

h3. Impact

Log noise and slower shutdown under high client concurrency. No functional 
failure, no descriptor leak. Likely affects any high\-concurrency FileSystem 
consumer, not just Freon.

h3. Open question for triage

Whether this warrants a code change \(tuning the Ratis client shutdown grace 
period, or bounding/parallelizing client shutdown\) or should be treated as 
expected behaviour and handled via log level.

Filed separately from HDDS\-14474 because the root cause and component differ. 
Relates to HDDS\-14474.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to