[
https://issues.apache.org/jira/browse/RATIS-2529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yongzao Dan updated RATIS-2529:
-------------------------------
Description:
When a Raft follower node restarts and rejoins the cluster, the catch-up phase
creates a burst of concurrent gRPC streams (log replication, snapshot
transfer, voting, etc.). This burst activates nearly all Netty EventLoop
threads in the EventLoopGroup (default size: availableProcessors * 2). After
catch-up completes, these threads never exit — they remain in
EpollEventLoop.run() doing empty epollWait indefinitely, even though only 2–3
of them carry any real I/O.
In contrast, a follower that has been running since initial cluster startup
only has 2–6 EventLoop threads active, because connections were established
gradually over time and only activated a few EventLoops.
*Environment*
- Ratis version: 3.2.2
- Cluster: 5 Raft nodes, 18-core machines (so availableProcessors * 2 = 36)
- Transport: gRPC (Netty-based, using EpollEventLoopGroup)
*Steps to Reproduce*
1. Start a 5-node Raft cluster. Wait for it to stabilize.
2. Observe grpc-default-worker-ELG-* threads on each follower via jstack —
typically 2–6 threads.
3. Kill 2 follower nodes. Wait a few minutes.
4. Restart them. They will rejoin the cluster and go through the catch-up phase.
5. After catch-up is done and the cluster is stable again, observe
grpc-default-worker-ELG-* threads on the restarted followers.
*Observed Behavior*
Thread counts from a production cluster (collected via jstack):
||Node||Role||ELG Threads||Actively Used||
|node-3|Follower (never restarted)|6|2|
|node-4|Follower (never restarted)|6|2|
|node-0|Follower (restarted)|36|2|
|node-1|Follower (restarted)|35|2|
|node-2|Leader|36|majority|
Key evidence — thread creation timestamps (from jstack elapsed times on node-0):
grpc-default-worker-ELG-3-1 elapsed=7192.55s
grpc-default-worker-ELG-3-2 elapsed=7192.55s
...
grpc-default-worker-ELG-3-36 elapsed=7189.04s ← all 36 created within 3.5
seconds at startup
CPU usage confirms most threads are idle (over 7192 seconds of life):
ELG-3-1 cpu=1487.26ms ← actually processing I/O
ELG-3-2 cpu=1012.09ms ← actually processing I/O
ELG-3-3 cpu=0.18ms ← idle (epollWait only)
ELG-3-4 cpu=0.18ms
...
ELG-3-36 cpu=35.02ms
Normal followers grew their threads gradually over their entire lifetime (2 → 4
→ 6), corresponding to individual peer connection events.
*Expected Behavior*
After catch-up completes, a restarted follower should have a similar number of
active ELG threads as a follower that was never restarted (~2–6), rather than
permanently retaining the full availableProcessors * 2 threads.
*Root Cause Analysis*
The issue stems from how Netty EventLoopGroup works combined with Ratis's gRPC
usage:
1. EventLoopGroup is created with a fixed capacity of availableProcessors * 2
EventLoop objects, but threads are lazily started — an EventLoop's thread only
begins running when a channel is first registered to it.
2. During normal operation, new connections trickle in one at a time, and
gRPC's round-robin registration only touches a few EventLoops. Most EventLoops
never get a channel registered, so their threads are never started.
3. During restart catch-up, the leader sends a burst of log entries / snapshot
chunks via GrpcLogAppender, and the restarted node simultaneously opens client
connections to multiple peers. This burst distributes channel registrations
across all
EventLoops, starting all 36 threads at once.
4. EventLoop threads never exit. Once started, a Netty EventLoop runs an
infinite loop (EpollEventLoop.run()), and only terminates when
EventLoopGroup.shutdownGracefully() is called. Ratis does not rebuild or resize
the EventLoopGroup after catch-up.
*Suggestion*
Expose a configuration key in GrpcConfigKeys to allow users to control the
EventLoopGroup thread count, rather than relying on the Netty default
(availableProcessors * 2). For example:
{code:java}
// In GrpcConfigKeys.Server
String WORKER_EVENT_LOOP_THREADS_KEY = PREFIX + ".worker.event-loop.threads";
int WORKER_EVENT_LOOP_THREADS_DEFAULT = 0; // 0 = use Netty default{code}
Then in GrpcServicesImpl.Builder.newNettyServerBuilder():
{code:java}
NettyServerBuilder nettyServerBuilder = NettyServerBuilder.forAddress(address)
.withChildOption(ChannelOption.SO_REUSEADDR, true)
.maxInboundMessageSize(messageSizeMax.getSizeInt())
.flowControlWindow(flowControlWindow.getSizeInt());
int workerThreads = GrpcConfigKeys.Server.workerEventLoopThreads(properties);
if (workerThreads > 0) {
nettyServerBuilder.workerEventLoopGroup(new
EpollEventLoopGroup(workerThreads));
}
{code}
Similarly, the client-side NettyChannelBuilder in
GrpcServerProtocolClient.buildChannel() and
GrpcClientProtocolClient.buildChannel() could accept a configurable
EventLoopGroup.
This would allow downstream projects (e.g., Apache IoTDB) to cap the thread
count at a reasonable value (e.g., 4–8) for follower nodes, avoiding the
permanent thread inflation after restart.
was:
When a Raft follower node restarts and rejoins the cluster, the catch-up phase
creates a burst of concurrent gRPC streams (log replication, snapshot transfer,
voting, etc.). This burst activates
nearly all Netty EventLoop threads in the EventLoopGroup (default size:
availableProcessors * 2). After catch-up completes, these threads never exit —
they remain in EpollEventLoop.run() doing empty
epollWait indefinitely, even though only 2–3 of them carry any real I/O.
In contrast, a follower that has been running since initial cluster startup
only has 2–6 EventLoop threads active, because connections were established
gradually over time and only activated a few
EventLoops.
Environment
- Ratis version: 3.2.2
- Cluster: 5 Raft nodes, 18-core machines (so availableProcessors * 2 = 36)
- Transport: gRPC (Netty-based, using EpollEventLoopGroup)
Steps to Reproduce
1. Start a 5-node Raft cluster. Wait for it to stabilize.
2. Observe grpc-default-worker-ELG-* threads on each follower via jstack —
typically 2–6 threads.
3. Kill 2 follower nodes. Wait a few minutes.
4. Restart them. They will rejoin the cluster and go through the catch-up
phase.
5. After catch-up is done and the cluster is stable again, observe
grpc-default-worker-ELG-* threads on the restarted followers.
Observed Behavior
Thread counts from a production cluster (collected via jstack):
┌────────┬────────────────────────────┬─────────────┬───────────────┐
│ Node │ Role │ ELG Threads │ Actively Used │
├────────┼────────────────────────────┼─────────────┼───────────────┤
│ node-3 │ Follower (never restarted) │ 6 │ 2 │
├────────┼────────────────────────────┼─────────────┼───────────────┤
│ node-4 │ Follower (never restarted) │ 6 │ 2 │
├────────┼────────────────────────────┼─────────────┼───────────────┤
│ node-0 │ Follower (restarted) │ 36 │ 2 │
├────────┼────────────────────────────┼─────────────┼───────────────┤
│ node-1 │ Follower (restarted) │ 35 │ 2 │
├────────┼────────────────────────────┼─────────────┼───────────────┤
│ node-2 │ Leader │ 36 │ majority │
└────────┴────────────────────────────┴─────────────┴───────────────┘
Key evidence — thread creation timestamps (from jstack elapsed times on
node-0):
grpc-default-worker-ELG-3-1 elapsed=7192.55s
grpc-default-worker-ELG-3-2 elapsed=7192.55s
...
grpc-default-worker-ELG-3-36 elapsed=7189.04s ← all 36 created within 3.5
seconds at startup
CPU usage confirms most threads are idle (over 7192 seconds of life):
ELG-3-1 cpu=1487.26ms ← actually processing I/O
ELG-3-2 cpu=1012.09ms ← actually processing I/O
ELG-3-3 cpu=0.18ms ← idle (epollWait only)
ELG-3-4 cpu=0.18ms
...
ELG-3-36 cpu=35.02ms
├────────┼────────────────────────────┼─────────────┼───────────────┤
│ node-2 │ Leader │ 36 │ majority │
└────────┴────────────────────────────┴─────────────┴───────────────┘
Key evidence — thread creation timestamps (from jstack elapsed times on
node-0):
grpc-default-worker-ELG-3-1 elapsed=7192.55s
grpc-default-worker-ELG-3-2 elapsed=7192.55s
...
grpc-default-worker-ELG-3-36 elapsed=7189.04s ← all 36 created within 3.5
seconds at startup
CPU usage confirms most threads are idle (over 7192 seconds of life):
ELG-3-1 cpu=1487.26ms ← actually processing I/O
ELG-3-2 cpu=1012.09ms ← actually processing I/O
ELG-3-3 cpu=0.18ms ← idle (epollWait only)
ELG-3-4 cpu=0.18ms
...
ELG-3-36 cpu=35.02ms
Normal followers grew their threads gradually over their entire lifetime (2 →
4 → 6), corresponding to individual peer connection events.
Expected Behavior
After catch-up completes, a restarted follower should have a similar number
of active ELG threads as a follower that was never restarted (~2–6), rather
than permanently retaining the full availableProcessors * 2 threads.
Root Cause Analysis
The issue stems from how Netty EventLoopGroup works combined with Ratis's
gRPC usage:
1. EventLoopGroup is created with a fixed capacity of availableProcessors * 2
EventLoop objects, but threads are lazily started — an EventLoop's thread only
begins running when a channel is first registered to it.
2. During normal operation, new connections trickle in one at a time, and
gRPC's round-robin registration only touches a few EventLoops. Most EventLoops
never get a channel registered, so their threads are never started.
3. During restart catch-up, the leader sends a burst of log entries /
snapshot chunks via GrpcLogAppender, and the restarted node simultaneously
opens client connections to multiple peers. This burst distributes channel
registrations across all
EventLoops, starting all 36 threads at once.
4. EventLoop threads never exit. Once started, a Netty EventLoop runs an
infinite loop (EpollEventLoop.run()), and only terminates when
EventLoopGroup.shutdownGracefully() is called. Ratis does not rebuild or resize
the EventLoopGroup after catch-up.
Suggestion
Expose a configuration key in GrpcConfigKeys to allow users to control the
EventLoopGroup thread count, rather than relying on the Netty default
(availableProcessors * 2). For example:
// In GrpcConfigKeys.Server
String WORKER_EVENT_LOOP_THREADS_KEY = PREFIX + ".worker.event-loop.threads";
int WORKER_EVENT_LOOP_THREADS_DEFAULT = 0; // 0 = use Netty default
Then in GrpcServicesImpl.Builder.newNettyServerBuilder():
NettyServerBuilder nettyServerBuilder = NettyServerBuilder.forAddress(address)
.withChildOption(ChannelOption.SO_REUSEADDR, true)
.maxInboundMessageSize(messageSizeMax.getSizeInt())
.flowControlWindow(flowControlWindow.getSizeInt());
int workerThreads = GrpcConfigKeys.Server.workerEventLoopThreads(properties);
if (workerThreads > 0) {
nettyServerBuilder.workerEventLoopGroup(new
EpollEventLoopGroup(workerThreads));
}
Similarly, the client-side NettyChannelBuilder in
GrpcServerProtocolClient.buildChannel() and
GrpcClientProtocolClient.buildChannel() could accept a configurable
EventLoopGroup.
This would allow downstream projects (e.g., Apache IoTDB) to cap the thread
count at a reasonable value (e.g., 4–8) for follower nodes, avoiding the
permanent thread inflation after restart.
> gRPC worker threads permanently inflate to availableProcessors×2 after
> follower restart catch-up
> ------------------------------------------------------------------------------------------------
>
> Key: RATIS-2529
> URL: https://issues.apache.org/jira/browse/RATIS-2529
> Project: Ratis
> Issue Type: Improvement
> Components: gRPC
> Affects Versions: 3.2.2
> Reporter: Yongzao Dan
> Priority: Critical
>
> When a Raft follower node restarts and rejoins the cluster, the catch-up
> phase creates a burst of concurrent gRPC streams (log replication, snapshot
> transfer, voting, etc.). This burst activates nearly all Netty EventLoop
> threads in the EventLoopGroup (default size: availableProcessors * 2). After
> catch-up completes, these threads never exit — they remain in
> EpollEventLoop.run() doing empty epollWait indefinitely, even though only 2–3
> of them carry any real I/O.
>
> In contrast, a follower that has been running since initial cluster startup
> only has 2–6 EventLoop threads active, because connections were established
> gradually over time and only activated a few EventLoops.
> *Environment*
> - Ratis version: 3.2.2
> - Cluster: 5 Raft nodes, 18-core machines (so availableProcessors * 2 = 36)
> - Transport: gRPC (Netty-based, using EpollEventLoopGroup)
> *Steps to Reproduce*
> 1. Start a 5-node Raft cluster. Wait for it to stabilize.
> 2. Observe grpc-default-worker-ELG-* threads on each follower via jstack —
> typically 2–6 threads.
> 3. Kill 2 follower nodes. Wait a few minutes.
> 4. Restart them. They will rejoin the cluster and go through the catch-up
> phase.
> 5. After catch-up is done and the cluster is stable again, observe
> grpc-default-worker-ELG-* threads on the restarted followers.
> *Observed Behavior*
> Thread counts from a production cluster (collected via jstack):
>
> ||Node||Role||ELG Threads||Actively Used||
> |node-3|Follower (never restarted)|6|2|
> |node-4|Follower (never restarted)|6|2|
> |node-0|Follower (restarted)|36|2|
> |node-1|Follower (restarted)|35|2|
> |node-2|Leader|36|majority|
>
> Key evidence — thread creation timestamps (from jstack elapsed times on
> node-0):
> grpc-default-worker-ELG-3-1 elapsed=7192.55s
> grpc-default-worker-ELG-3-2 elapsed=7192.55s
> ...
> grpc-default-worker-ELG-3-36 elapsed=7189.04s ← all 36 created within 3.5
> seconds at startup
> CPU usage confirms most threads are idle (over 7192 seconds of life):
> ELG-3-1 cpu=1487.26ms ← actually processing I/O
> ELG-3-2 cpu=1012.09ms ← actually processing I/O
> ELG-3-3 cpu=0.18ms ← idle (epollWait only)
> ELG-3-4 cpu=0.18ms
> ...
> ELG-3-36 cpu=35.02ms
> Normal followers grew their threads gradually over their entire lifetime (2 →
> 4 → 6), corresponding to individual peer connection events.
> *Expected Behavior*
> After catch-up completes, a restarted follower should have a similar number
> of active ELG threads as a follower that was never restarted (~2–6), rather
> than permanently retaining the full availableProcessors * 2 threads.
> *Root Cause Analysis*
> The issue stems from how Netty EventLoopGroup works combined with Ratis's
> gRPC usage:
> 1. EventLoopGroup is created with a fixed capacity of availableProcessors * 2
> EventLoop objects, but threads are lazily started — an EventLoop's thread
> only begins running when a channel is first registered to it.
> 2. During normal operation, new connections trickle in one at a time, and
> gRPC's round-robin registration only touches a few EventLoops. Most
> EventLoops never get a channel registered, so their threads are never started.
> 3. During restart catch-up, the leader sends a burst of log entries /
> snapshot chunks via GrpcLogAppender, and the restarted node simultaneously
> opens client connections to multiple peers. This burst distributes channel
> registrations across all
> EventLoops, starting all 36 threads at once.
> 4. EventLoop threads never exit. Once started, a Netty EventLoop runs an
> infinite loop (EpollEventLoop.run()), and only terminates when
> EventLoopGroup.shutdownGracefully() is called. Ratis does not rebuild or
> resize the EventLoopGroup after catch-up.
> *Suggestion*
> Expose a configuration key in GrpcConfigKeys to allow users to control the
> EventLoopGroup thread count, rather than relying on the Netty default
> (availableProcessors * 2). For example:
> {code:java}
> // In GrpcConfigKeys.Server
> String WORKER_EVENT_LOOP_THREADS_KEY = PREFIX + ".worker.event-loop.threads";
> int WORKER_EVENT_LOOP_THREADS_DEFAULT = 0; // 0 = use Netty default{code}
> Then in GrpcServicesImpl.Builder.newNettyServerBuilder():
>
> {code:java}
> NettyServerBuilder nettyServerBuilder = NettyServerBuilder.forAddress(address)
> .withChildOption(ChannelOption.SO_REUSEADDR, true)
> .maxInboundMessageSize(messageSizeMax.getSizeInt())
> .flowControlWindow(flowControlWindow.getSizeInt());
> int workerThreads = GrpcConfigKeys.Server.workerEventLoopThreads(properties);
> if (workerThreads > 0) {
> nettyServerBuilder.workerEventLoopGroup(new
> EpollEventLoopGroup(workerThreads));
> }
> {code}
>
> Similarly, the client-side NettyChannelBuilder in
> GrpcServerProtocolClient.buildChannel() and
> GrpcClientProtocolClient.buildChannel() could accept a configurable
> EventLoopGroup.
> This would allow downstream projects (e.g., Apache IoTDB) to cap the thread
> count at a reasonable value (e.g., 4–8) for follower nodes, avoiding the
> permanent thread inflation after restart.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)