[
https://issues.apache.org/jira/browse/RATIS-2529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tsz-wo Sze updated RATIS-2529:
------------------------------
Attachment: 1466_review.patch
> gRPC worker threads permanently inflate to availableProcessors×2 after
> follower restart catch-up
> ------------------------------------------------------------------------------------------------
>
> Key: RATIS-2529
> URL: https://issues.apache.org/jira/browse/RATIS-2529
> Project: Ratis
> Issue Type: Improvement
> Components: gRPC
> Affects Versions: 3.2.2
> Reporter: Yongzao Dan
> Assignee: Yongzao Dan
> Priority: Critical
> Attachments: 1466_review.patch
>
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> When a Raft follower node restarts and rejoins the cluster, the catch-up
> phase creates a burst of concurrent gRPC streams (log replication, snapshot
> transfer, voting, etc.). This burst activates nearly all Netty EventLoop
> threads in the EventLoopGroup (default size: availableProcessors * 2). After
> catch-up completes, these threads never exit — they remain in
> EpollEventLoop.run() doing empty epollWait indefinitely, even though only 2–3
> of them carry any real I/O.
>
> In contrast, a follower that has been running since initial cluster startup
> only has 2–6 EventLoop threads active, because connections were established
> gradually over time and only activated a few EventLoops.
> *Environment*
> - Ratis version: 3.2.2
> - Cluster: 5 Raft nodes, 18-core machines (so availableProcessors * 2 = 36)
> - Transport: gRPC (Netty-based, using EpollEventLoopGroup)
> *Steps to Reproduce*
> 1. Start a 5-node Raft cluster. Wait for it to stabilize.
> 2. Observe grpc-default-worker-ELG-* threads on each follower via jstack —
> typically 2–6 threads.
> 3. Kill 2 follower nodes. Wait a few minutes.
> 4. Restart them. They will rejoin the cluster and go through the catch-up
> phase.
> 5. After catch-up is done and the cluster is stable again, observe
> grpc-default-worker-ELG-* threads on the restarted followers.
> *Observed Behavior*
> Thread counts from a production cluster (collected via jstack):
>
> ||Node||Role||ELG Threads||Actively Used||
> |node-3|Follower (never restarted)|6|2|
> |node-4|Follower (never restarted)|6|2|
> |node-0|Follower (restarted)|36|2|
> |node-1|Follower (restarted)|35|2|
> |node-2|Leader|36|majority|
>
> Key evidence — thread creation timestamps (from jstack elapsed times on
> node-0):
> grpc-default-worker-ELG-3-1 elapsed=7192.55s
> grpc-default-worker-ELG-3-2 elapsed=7192.55s
> ...
> grpc-default-worker-ELG-3-36 elapsed=7189.04s ← all 36 created within 3.5
> seconds at startup
> CPU usage confirms most threads are idle (over 7192 seconds of life):
> ELG-3-1 cpu=1487.26ms ← actually processing I/O
> ELG-3-2 cpu=1012.09ms ← actually processing I/O
> ELG-3-3 cpu=0.18ms ← idle (epollWait only)
> ELG-3-4 cpu=0.18ms
> ...
> ELG-3-36 cpu=35.02ms
> Normal followers grew their threads gradually over their entire lifetime (2 →
> 4 → 6), corresponding to individual peer connection events.
> *Expected Behavior*
> After catch-up completes, a restarted follower should have a similar number
> of active ELG threads as a follower that was never restarted (~2–6), rather
> than permanently retaining the full availableProcessors * 2 threads.
> *Root Cause Analysis*
> The issue stems from how Netty EventLoopGroup works combined with Ratis's
> gRPC usage:
> 1. EventLoopGroup is created with a fixed capacity of availableProcessors * 2
> EventLoop objects, but threads are lazily started — an EventLoop's thread
> only begins running when a channel is first registered to it.
> 2. During normal operation, new connections trickle in one at a time, and
> gRPC's round-robin registration only touches a few EventLoops. Most
> EventLoops never get a channel registered, so their threads are never started.
> 3. During restart catch-up, the leader sends a burst of log entries /
> snapshot chunks via GrpcLogAppender, and the restarted node simultaneously
> opens client connections to multiple peers. This burst distributes channel
> registrations across all
> EventLoops, starting all 36 threads at once.
> 4. EventLoop threads never exit. Once started, a Netty EventLoop runs an
> infinite loop (EpollEventLoop.run()), and only terminates when
> EventLoopGroup.shutdownGracefully() is called. Ratis does not rebuild or
> resize the EventLoopGroup after catch-up.
> *Suggestion*
> Expose a configuration key in GrpcConfigKeys to allow users to control the
> EventLoopGroup thread count, rather than relying on the Netty default
> (availableProcessors * 2). For example:
> {code:java}
> // In GrpcConfigKeys.Server
> String WORKER_EVENT_LOOP_THREADS_KEY = PREFIX + ".worker.event-loop.threads";
> int WORKER_EVENT_LOOP_THREADS_DEFAULT = 0; // 0 = use Netty default{code}
> Then in GrpcServicesImpl.Builder.newNettyServerBuilder():
>
> {code:java}
> NettyServerBuilder nettyServerBuilder = NettyServerBuilder.forAddress(address)
> .withChildOption(ChannelOption.SO_REUSEADDR, true)
> .maxInboundMessageSize(messageSizeMax.getSizeInt())
> .flowControlWindow(flowControlWindow.getSizeInt());
> int workerThreads = GrpcConfigKeys.Server.workerEventLoopThreads(properties);
> if (workerThreads > 0) {
> nettyServerBuilder.workerEventLoopGroup(new
> EpollEventLoopGroup(workerThreads));
> }
> {code}
>
> Similarly, the client-side NettyChannelBuilder in
> GrpcServerProtocolClient.buildChannel() and
> GrpcClientProtocolClient.buildChannel() could accept a configurable
> EventLoopGroup.
> This would allow downstream projects (e.g., Apache IoTDB) to cap the thread
> count at a reasonable value (e.g., 4–8) for follower nodes, avoiding the
> permanent thread inflation after restart.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)