[
https://issues.apache.org/jira/browse/RATIS-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17987179#comment-17987179
]
Tian Jiang commented on RATIS-2314:
-----------------------------------
[~szetszwo] Great, it worked. Looking forward to seeing the fix in the next
release.
> Deadlock when appending large batches of entries
> ------------------------------------------------
>
> Key: RATIS-2314
> URL: https://issues.apache.org/jira/browse/RATIS-2314
> Project: Ratis
> Issue Type: Bug
> Components: server
> Reporter: Tian Jiang
> Priority: Major
> Attachments: image-2025-06-30-13-25-54-574.png,
> image-2025-06-30-13-33-13-766.png, image-2025-06-30-13-35-09-832.png,
> image-2025-06-30-13-37-28-743.png, image-2025-07-01-10-43-50-136.png,
> image-2025-07-01-10-44-53-302.png, image-2025-07-01-10-46-04-453.png,
> image-2025-07-01-10-47-16-439.png, image-2025-07-01-10-47-20-566.png,
> image-2025-07-01-10-47-54-166.png, image-2025-07-01-10-48-56-806.png,
> image-2025-07-01-10-52-05-896.png, image-2025-07-01-11-05-10-982.png
>
>
> Greeting, I found a deadlock when I am using Ratis, which is shown in the
> following stack:
> !image-2025-06-30-13-25-54-574.png!
> The problem is that Ratis uses CompletableFuture to serialize the process of
> appending logs, and below is the timeline:
> # Thread "SegmentedRaftLogWorker" is working on some logs;
> # Thread "SomeClient" tries to append 5000 entries to the log;
> # Because "SegmentedRaftLogWorker" is working, "SomeClient" adds the entries
> to the CompletableFuture; !image-2025-06-30-13-33-13-766.png!
> # When "SegmentedRaftLogWorker" is done with its last task, it checks the
> CompletableFuture for the next task; !image-2025-06-30-13-35-09-832.png!
> # Of course, there is one from "SomeClient", ordering it to append another
> 5000 entries; therefore "SegmentedRaftLogWorker" continues to append the
> entries, and it comes to: !image-2025-06-30-13-37-28-743.png!
> # However, the queue of SegmentedRaftLogWorker has a max capacity of 4096,
> and there are 5000 entries to be appended, so Thread "SegmentedRaftLogWorker"
> waits for the queue to be not full;
> # But who is supposed to consume from the queue? Thread
> "SegmentedRaftLogWorker" itself! As a result, the wait shall never end.
> I notice that Ratis is using many thread-steal tricks to improve thread
> locality, which is appreciated. However, it seems such a technique, in this
> scenario, creates deadlocks.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)