[jira] [Updated] (RATIS-2529) gRPC worker threads permanently inflate to availableProcessors×2 after follower restart catch-up

Yongzao Dan (Jira) Sun, 10 May 2026 21:13:08 -0700


     [ 
https://issues.apache.org/jira/browse/RATIS-2529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yongzao Dan updated RATIS-2529:
-------------------------------
    Description: 
When a Raft follower node restarts and rejoins the cluster, the catch-up phase 
creates a burst of concurrent gRPC streams (log replication, snapshot
transfer, voting, etc.). This burst activates nearly all Netty EventLoop 
threads in the EventLoopGroup (default size: availableProcessors * 2). After
catch-up completes, these threads never exit — they remain in 
EpollEventLoop.run() doing empty epollWait indefinitely, even though only 2–3 
of them carry any real I/O.
 
In contrast, a follower that has been running since initial cluster startup 
only has 2–6 EventLoop threads active, because connections were established 
gradually over time and only activated a few EventLoops.
*Environment*
 - Ratis version: 3.2.2
 - Cluster: 5 Raft nodes, 18-core machines (so availableProcessors * 2 = 36)
 - Transport: gRPC (Netty-based, using EpollEventLoopGroup)

*Steps to Reproduce*

1. Start a 5-node Raft cluster. Wait for it to stabilize.
2. Observe grpc-default-worker-ELG-* threads on each follower via jstack — 
typically 2–6 threads.
3. Kill 2 follower nodes. Wait a few minutes.
4. Restart them. They will rejoin the cluster and go through the catch-up phase.
5. After catch-up is done and the cluster is stable again, observe 
grpc-default-worker-ELG-* threads on the restarted followers.

*Observed Behavior*

Thread counts from a production cluster (collected via jstack):

 
||Node||Role||ELG Threads||Actively Used||
|node-3|Follower (never restarted)|6|2|
|node-4|Follower (never restarted)|6|2|
|node-0|Follower (restarted)|36|2|
|node-1|Follower (restarted)|35|2|
|node-2|Leader|36|majority|

 

Key evidence — thread creation timestamps (from jstack elapsed times on node-0):

grpc-default-worker-ELG-3-1 elapsed=7192.55s
grpc-default-worker-ELG-3-2 elapsed=7192.55s
...
grpc-default-worker-ELG-3-36 elapsed=7189.04s ← all 36 created within 3.5 
seconds at startup

CPU usage confirms most threads are idle (over 7192 seconds of life):

ELG-3-1 cpu=1487.26ms ← actually processing I/O
ELG-3-2 cpu=1012.09ms ← actually processing I/O
ELG-3-3 cpu=0.18ms ← idle (epollWait only)
ELG-3-4 cpu=0.18ms
...
ELG-3-36 cpu=35.02ms

Normal followers grew their threads gradually over their entire lifetime (2 → 4 
→ 6), corresponding to individual peer connection events.

*Expected Behavior*

After catch-up completes, a restarted follower should have a similar number of 
active ELG threads as a follower that was never restarted (~2–6), rather than 
permanently retaining the full availableProcessors * 2 threads.

*Root Cause Analysis*

The issue stems from how Netty EventLoopGroup works combined with Ratis's gRPC 
usage:

1. EventLoopGroup is created with a fixed capacity of availableProcessors * 2 
EventLoop objects, but threads are lazily started — an EventLoop's thread only 
begins running when a channel is first registered to it.
2. During normal operation, new connections trickle in one at a time, and 
gRPC's round-robin registration only touches a few EventLoops. Most EventLoops 
never get a channel registered, so their threads are never started.
3. During restart catch-up, the leader sends a burst of log entries / snapshot 
chunks via GrpcLogAppender, and the restarted node simultaneously opens client 
connections to multiple peers. This burst distributes channel registrations 
across all
EventLoops, starting all 36 threads at once.
4. EventLoop threads never exit. Once started, a Netty EventLoop runs an 
infinite loop (EpollEventLoop.run()), and only terminates when 
EventLoopGroup.shutdownGracefully() is called. Ratis does not rebuild or resize 
the EventLoopGroup after catch-up.

*Suggestion*

Expose a configuration key in GrpcConfigKeys to allow users to control the 
EventLoopGroup thread count, rather than relying on the Netty default 
(availableProcessors * 2). For example:
{code:java}
// In GrpcConfigKeys.Server
String WORKER_EVENT_LOOP_THREADS_KEY = PREFIX + ".worker.event-loop.threads";
int WORKER_EVENT_LOOP_THREADS_DEFAULT = 0; // 0 = use Netty default{code}
Then in GrpcServicesImpl.Builder.newNettyServerBuilder():

 
{code:java}
NettyServerBuilder nettyServerBuilder = NettyServerBuilder.forAddress(address)
    .withChildOption(ChannelOption.SO_REUSEADDR, true)
    .maxInboundMessageSize(messageSizeMax.getSizeInt())
    .flowControlWindow(flowControlWindow.getSizeInt());
int workerThreads = GrpcConfigKeys.Server.workerEventLoopThreads(properties);
if (workerThreads > 0) {
    nettyServerBuilder.workerEventLoopGroup(new 
EpollEventLoopGroup(workerThreads));
}
{code}
 

Similarly, the client-side NettyChannelBuilder in 
GrpcServerProtocolClient.buildChannel() and 
GrpcClientProtocolClient.buildChannel() could accept a configurable 
EventLoopGroup.

This would allow downstream projects (e.g., Apache IoTDB) to cap the thread 
count at a reasonable value (e.g., 4–8) for follower nodes, avoiding the 
permanent thread inflation after restart.

  was:
When a Raft follower node restarts and rejoins the cluster, the catch-up phase 
creates a burst of concurrent gRPC streams (log replication, snapshot transfer, 
voting, etc.). This burst activates
  nearly all Netty EventLoop threads in the EventLoopGroup (default size: 
availableProcessors * 2). After catch-up completes, these threads never exit — 
they remain in EpollEventLoop.run() doing empty
  epollWait indefinitely, even though only 2–3 of them carry any real I/O.

  In contrast, a follower that has been running since initial cluster startup 
only has 2–6 EventLoop threads active, because connections were established 
gradually over time and only activated a few
  EventLoops.

  Environment

  - Ratis version: 3.2.2
  - Cluster: 5 Raft nodes, 18-core machines (so availableProcessors * 2 = 36)
  - Transport: gRPC (Netty-based, using EpollEventLoopGroup)

  Steps to Reproduce

  1. Start a 5-node Raft cluster. Wait for it to stabilize.
  2. Observe grpc-default-worker-ELG-* threads on each follower via jstack — 
typically 2–6 threads.
  3. Kill 2 follower nodes. Wait a few minutes.
  4. Restart them. They will rejoin the cluster and go through the catch-up 
phase.
  5. After catch-up is done and the cluster is stable again, observe 
grpc-default-worker-ELG-* threads on the restarted followers.

  Observed Behavior

  Thread counts from a production cluster (collected via jstack):

  ┌────────┬────────────────────────────┬─────────────┬───────────────┐
  │  Node  │            Role            │ ELG Threads │ Actively Used │
  ├────────┼────────────────────────────┼─────────────┼───────────────┤
  │ node-3 │ Follower (never restarted) │ 6           │ 2             │
  ├────────┼────────────────────────────┼─────────────┼───────────────┤
  │ node-4 │ Follower (never restarted) │ 6           │ 2             │
  ├────────┼────────────────────────────┼─────────────┼───────────────┤
  │ node-0 │ Follower (restarted)       │ 36          │ 2             │
  ├────────┼────────────────────────────┼─────────────┼───────────────┤
  │ node-1 │ Follower (restarted)       │ 35          │ 2             │
  ├────────┼────────────────────────────┼─────────────┼───────────────┤
  │ node-2 │ Leader                     │ 36          │ majority      │
  └────────┴────────────────────────────┴─────────────┴───────────────┘

  Key evidence — thread creation timestamps (from jstack elapsed times on 
node-0):

  grpc-default-worker-ELG-3-1   elapsed=7192.55s
  grpc-default-worker-ELG-3-2   elapsed=7192.55s
  ...
  grpc-default-worker-ELG-3-36  elapsed=7189.04s   ← all 36 created within 3.5 
seconds at startup

  CPU usage confirms most threads are idle (over 7192 seconds of life):

  ELG-3-1   cpu=1487.26ms   ← actually processing I/O
  ELG-3-2   cpu=1012.09ms   ← actually processing I/O
  ELG-3-3   cpu=0.18ms      ← idle (epollWait only)
  ELG-3-4   cpu=0.18ms
  ...
  ELG-3-36  cpu=35.02ms

  ├────────┼────────────────────────────┼─────────────┼───────────────┤
  │ node-2 │ Leader                     │ 36          │ majority      │
  └────────┴────────────────────────────┴─────────────┴───────────────┘

  Key evidence — thread creation timestamps (from jstack elapsed times on 
node-0):

  grpc-default-worker-ELG-3-1   elapsed=7192.55s
  grpc-default-worker-ELG-3-2   elapsed=7192.55s
  ...
  grpc-default-worker-ELG-3-36  elapsed=7189.04s   ← all 36 created within 3.5 
seconds at startup

  CPU usage confirms most threads are idle (over 7192 seconds of life):

  ELG-3-1   cpu=1487.26ms   ← actually processing I/O
  ELG-3-2   cpu=1012.09ms   ← actually processing I/O
  ELG-3-3   cpu=0.18ms      ← idle (epollWait only)
  ELG-3-4   cpu=0.18ms
  ...
  ELG-3-36  cpu=35.02ms

  Normal followers grew their threads gradually over their entire lifetime (2 → 
4 → 6), corresponding to individual peer connection events.

  Expected Behavior

  After catch-up completes, a restarted follower should have a similar number 
of active ELG threads as a follower that was never restarted (~2–6), rather 
than permanently retaining the full availableProcessors * 2 threads.

  Root Cause Analysis

  The issue stems from how Netty EventLoopGroup works combined with Ratis's 
gRPC usage:

  1. EventLoopGroup is created with a fixed capacity of availableProcessors * 2 
EventLoop objects, but threads are lazily started — an EventLoop's thread only 
begins running when a channel is first registered to it.
  2. During normal operation, new connections trickle in one at a time, and 
gRPC's round-robin registration only touches a few EventLoops. Most EventLoops 
never get a channel registered, so their threads are never started.
  3. During restart catch-up, the leader sends a burst of log entries / 
snapshot chunks via GrpcLogAppender, and the restarted node simultaneously 
opens client connections to multiple peers. This burst distributes channel 
registrations across all
  EventLoops, starting all 36 threads at once.
  4. EventLoop threads never exit. Once started, a Netty EventLoop runs an 
infinite loop (EpollEventLoop.run()), and only terminates when 
EventLoopGroup.shutdownGracefully() is called. Ratis does not rebuild or resize 
the EventLoopGroup after catch-up.

  Suggestion

  Expose a configuration key in GrpcConfigKeys to allow users to control the 
EventLoopGroup thread count, rather than relying on the Netty default 
(availableProcessors * 2). For example:

  // In GrpcConfigKeys.Server
  String WORKER_EVENT_LOOP_THREADS_KEY = PREFIX + ".worker.event-loop.threads";
  int WORKER_EVENT_LOOP_THREADS_DEFAULT = 0; // 0 = use Netty default

  Then in GrpcServicesImpl.Builder.newNettyServerBuilder():

  NettyServerBuilder nettyServerBuilder = NettyServerBuilder.forAddress(address)
      .withChildOption(ChannelOption.SO_REUSEADDR, true)
      .maxInboundMessageSize(messageSizeMax.getSizeInt())
      .flowControlWindow(flowControlWindow.getSizeInt());

  int workerThreads = GrpcConfigKeys.Server.workerEventLoopThreads(properties);
  if (workerThreads > 0) {
      nettyServerBuilder.workerEventLoopGroup(new 
EpollEventLoopGroup(workerThreads));
  }

  Similarly, the client-side NettyChannelBuilder in 
GrpcServerProtocolClient.buildChannel() and 
GrpcClientProtocolClient.buildChannel() could accept a configurable 
EventLoopGroup.

  This would allow downstream projects (e.g., Apache IoTDB) to cap the thread 
count at a reasonable value (e.g., 4–8) for follower nodes, avoiding the 
permanent thread inflation after restart.


> gRPC worker threads permanently inflate to availableProcessors×2 after 
> follower restart catch-up
> ------------------------------------------------------------------------------------------------
>
>                 Key: RATIS-2529
>                 URL: https://issues.apache.org/jira/browse/RATIS-2529
>             Project: Ratis
>          Issue Type: Improvement
>          Components: gRPC
>    Affects Versions: 3.2.2
>            Reporter: Yongzao Dan
>            Priority: Critical
>
> When a Raft follower node restarts and rejoins the cluster, the catch-up 
> phase creates a burst of concurrent gRPC streams (log replication, snapshot
> transfer, voting, etc.). This burst activates nearly all Netty EventLoop 
> threads in the EventLoopGroup (default size: availableProcessors * 2). After
> catch-up completes, these threads never exit — they remain in 
> EpollEventLoop.run() doing empty epollWait indefinitely, even though only 2–3 
> of them carry any real I/O.
>  
> In contrast, a follower that has been running since initial cluster startup 
> only has 2–6 EventLoop threads active, because connections were established 
> gradually over time and only activated a few EventLoops.
> *Environment*
>  - Ratis version: 3.2.2
>  - Cluster: 5 Raft nodes, 18-core machines (so availableProcessors * 2 = 36)
>  - Transport: gRPC (Netty-based, using EpollEventLoopGroup)
> *Steps to Reproduce*
> 1. Start a 5-node Raft cluster. Wait for it to stabilize.
> 2. Observe grpc-default-worker-ELG-* threads on each follower via jstack — 
> typically 2–6 threads.
> 3. Kill 2 follower nodes. Wait a few minutes.
> 4. Restart them. They will rejoin the cluster and go through the catch-up 
> phase.
> 5. After catch-up is done and the cluster is stable again, observe 
> grpc-default-worker-ELG-* threads on the restarted followers.
> *Observed Behavior*
> Thread counts from a production cluster (collected via jstack):
>  
> ||Node||Role||ELG Threads||Actively Used||
> |node-3|Follower (never restarted)|6|2|
> |node-4|Follower (never restarted)|6|2|
> |node-0|Follower (restarted)|36|2|
> |node-1|Follower (restarted)|35|2|
> |node-2|Leader|36|majority|
>  
> Key evidence — thread creation timestamps (from jstack elapsed times on 
> node-0):
> grpc-default-worker-ELG-3-1 elapsed=7192.55s
> grpc-default-worker-ELG-3-2 elapsed=7192.55s
> ...
> grpc-default-worker-ELG-3-36 elapsed=7189.04s ← all 36 created within 3.5 
> seconds at startup
> CPU usage confirms most threads are idle (over 7192 seconds of life):
> ELG-3-1 cpu=1487.26ms ← actually processing I/O
> ELG-3-2 cpu=1012.09ms ← actually processing I/O
> ELG-3-3 cpu=0.18ms ← idle (epollWait only)
> ELG-3-4 cpu=0.18ms
> ...
> ELG-3-36 cpu=35.02ms
> Normal followers grew their threads gradually over their entire lifetime (2 → 
> 4 → 6), corresponding to individual peer connection events.
> *Expected Behavior*
> After catch-up completes, a restarted follower should have a similar number 
> of active ELG threads as a follower that was never restarted (~2–6), rather 
> than permanently retaining the full availableProcessors * 2 threads.
> *Root Cause Analysis*
> The issue stems from how Netty EventLoopGroup works combined with Ratis's 
> gRPC usage:
> 1. EventLoopGroup is created with a fixed capacity of availableProcessors * 2 
> EventLoop objects, but threads are lazily started — an EventLoop's thread 
> only begins running when a channel is first registered to it.
> 2. During normal operation, new connections trickle in one at a time, and 
> gRPC's round-robin registration only touches a few EventLoops. Most 
> EventLoops never get a channel registered, so their threads are never started.
> 3. During restart catch-up, the leader sends a burst of log entries / 
> snapshot chunks via GrpcLogAppender, and the restarted node simultaneously 
> opens client connections to multiple peers. This burst distributes channel 
> registrations across all
> EventLoops, starting all 36 threads at once.
> 4. EventLoop threads never exit. Once started, a Netty EventLoop runs an 
> infinite loop (EpollEventLoop.run()), and only terminates when 
> EventLoopGroup.shutdownGracefully() is called. Ratis does not rebuild or 
> resize the EventLoopGroup after catch-up.
> *Suggestion*
> Expose a configuration key in GrpcConfigKeys to allow users to control the 
> EventLoopGroup thread count, rather than relying on the Netty default 
> (availableProcessors * 2). For example:
> {code:java}
> // In GrpcConfigKeys.Server
> String WORKER_EVENT_LOOP_THREADS_KEY = PREFIX + ".worker.event-loop.threads";
> int WORKER_EVENT_LOOP_THREADS_DEFAULT = 0; // 0 = use Netty default{code}
> Then in GrpcServicesImpl.Builder.newNettyServerBuilder():
>  
> {code:java}
> NettyServerBuilder nettyServerBuilder = NettyServerBuilder.forAddress(address)
>     .withChildOption(ChannelOption.SO_REUSEADDR, true)
>     .maxInboundMessageSize(messageSizeMax.getSizeInt())
>     .flowControlWindow(flowControlWindow.getSizeInt());
> int workerThreads = GrpcConfigKeys.Server.workerEventLoopThreads(properties);
> if (workerThreads > 0) {
>     nettyServerBuilder.workerEventLoopGroup(new 
> EpollEventLoopGroup(workerThreads));
> }
> {code}
>  
> Similarly, the client-side NettyChannelBuilder in 
> GrpcServerProtocolClient.buildChannel() and 
> GrpcClientProtocolClient.buildChannel() could accept a configurable 
> EventLoopGroup.
> This would allow downstream projects (e.g., Apache IoTDB) to cap the thread 
> count at a reasonable value (e.g., 4–8) for follower nodes, avoiding the 
> permanent thread inflation after restart.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (RATIS-2529) gRPC worker threads permanently inflate to availableProcessors×2 after follower restart catch-up

Reply via email to