Tian Jiang created RATIS-2314:
---------------------------------
Summary: Deadlock when appending large batches of entries
Key: RATIS-2314
URL: https://issues.apache.org/jira/browse/RATIS-2314
Project: Ratis
Issue Type: Bug
Components: server
Reporter: Tian Jiang
Attachments: image-2025-06-30-13-25-54-574.png,
image-2025-06-30-13-33-13-766.png, image-2025-06-30-13-35-09-832.png,
image-2025-06-30-13-37-28-743.png
Greeting, I found a deadlock when I am using Ratis, which is shown in the
following stack:
!image-2025-06-30-13-25-54-574.png!
The problem is that Ratis uses CompletableFuture to serialize the process of
appending logs, and below is the timeline:
# Thread "SegmentedRaftLogWorker" is working on some logs;
# Thread "SomeClient" tries to append 5000 entries to the log;
# Because "SegmentedRaftLogWorker" is working, "SomeClient" adds the entries
to the CompletableFuture; !image-2025-06-30-13-33-13-766.png!
# When "SegmentedRaftLogWorker" is done with its last task, it checks the
CompletableFuture for the next task; !image-2025-06-30-13-35-09-832.png!
# Of course, there is one from "SomeClient", ordering it to append another
5000 entries; therefore "SegmentedRaftLogWorker" continues to append the
entries, and it comes to: !image-2025-06-30-13-37-28-743.png!
# However, the queue of SegmentedRaftLogWorker has a max capacity of 4096, and
there are 5000 entries to be appended, so Thread "SegmentedRaftLogWorker" waits
for the queue to be not full;
# But who is supposed to consume from the queue? Thread
"SegmentedRaftLogWorker" itself! As a result, the wait shall never end.
I notice that Ratis is using many thread-steal tricks to improve thread
locality, which is appreciated. However, it seems such a technique, in this
scenario, creates deadlocks.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)