Chi-Hsuan Huang created HDDS-15670:
--------------------------------------
Summary: Ratis graceful channel shutdown floods logs under high
client concurrency
Key: HDDS-15670
URL: https://issues.apache.org/jira/browse/HDDS-15670
Project: Apache Ozone
Issue Type: Bug
Affects Versions: 2.3.0
Reporter: Chi-Hsuan Huang
Running {{ozone freon dfsg}} with a high thread count produces many WARN lines
like:
{code}
WARN grpc.GrpcUtil: Timed out gracefully shutting down connection:
ManagedChannelOrphanWrapper{delegate=ManagedChannelImpl{logId=...,
target=10.15.25.x:9858}}.
{code}
Reproduce:
{code}
ozone freon dfsg \-s 268435456 \-\-prefix beg0i3ghm2 \-\-path
ofs://ozone/ratis\-vol/andrey \-n10000 \-t160 \-\-buffer=1048576
\-\-copy\-buffer=1048576
{code}
h3. Source of the message
The line is logged by {{org.apache.ratis.grpc.GrpcUtil}} \(log4j abbreviates
the logger to {{grpc.GrpcUtil}}\), not by grpc\-core's orphan detector.
{{target=...:9858}} is the datanode Ratis IPC port
\({{HDDS\_CONTAINER\_RATIS\_IPC\_PORT\_DEFAULT}}\). The connections are
{{XceiverClientRatis}} / RaftClient channels.
h3. What actually happens
These are not leaked channels. They are closed on the normal path:
{{FileSystem.close \-> OzoneClient.close \-> RpcClient.close \->
XceiverClientManager.close}} shuts down each Ratis client, and Ratis attempts a
graceful gRPC channel shutdown. Under high concurrency \({{\-t160}}, each
thread owning its own FileSystem because {{fs..impl.disable.cache=true}}\),
many Ratis clients close almost simultaneously, the graceful drain exceeds the
Ratis grace window, so Ratis logs the WARN and falls back to a forceful
shutdown. Only {{gracefully}} lines appear \(no {{forcefully}}\), so the
channels do terminate.
h3. Impact
Log noise and slower shutdown under high client concurrency. No functional
failure, no descriptor leak. Likely affects any high\-concurrency FileSystem
consumer, not just Freon.
h3. Open question for triage
Whether this warrants a code change \(tuning the Ratis client shutdown grace
period, or bounding/parallelizing client shutdown\) or should be treated as
expected behaviour and handled via log level.
Filed separately from HDDS\-14474 because the root cause and component differ.
Relates to HDDS\-14474.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]