Tian Jiang created RATIS-2314:
---------------------------------

             Summary: Deadlock when appending large batches of entries
                 Key: RATIS-2314
                 URL: https://issues.apache.org/jira/browse/RATIS-2314
             Project: Ratis
          Issue Type: Bug
          Components: server
            Reporter: Tian Jiang
         Attachments: image-2025-06-30-13-25-54-574.png, 
image-2025-06-30-13-33-13-766.png, image-2025-06-30-13-35-09-832.png, 
image-2025-06-30-13-37-28-743.png

Greeting, I found a deadlock when I am using Ratis, which is shown in the 
following stack:

!image-2025-06-30-13-25-54-574.png!

The problem is that Ratis uses CompletableFuture to serialize the process of 
appending logs, and below is the timeline:
 # Thread "SegmentedRaftLogWorker" is working on some logs;
 # Thread "SomeClient" tries to append 5000 entries to the log;
 # Because "SegmentedRaftLogWorker" is working, "SomeClient" adds the entries 
to the CompletableFuture; !image-2025-06-30-13-33-13-766.png!
 # When "SegmentedRaftLogWorker" is done with its last task, it checks the 
CompletableFuture for the next task; !image-2025-06-30-13-35-09-832.png!
 # Of course, there is one from "SomeClient", ordering it to append another 
5000 entries; therefore "SegmentedRaftLogWorker" continues to append the 
entries, and it comes to:  !image-2025-06-30-13-37-28-743.png!
 # However, the queue of SegmentedRaftLogWorker has a max capacity of 4096, and 
there are 5000 entries to be appended, so Thread "SegmentedRaftLogWorker" waits 
for the queue to be not full;
 # But who is supposed to consume from the queue? Thread 
"SegmentedRaftLogWorker"  itself! As a result, the wait shall never end.

I notice that Ratis is using many thread-steal tricks to improve thread 
locality, which is appreciated. However, it seems such a technique, in this 
scenario, creates deadlocks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to