[
https://issues.apache.org/jira/browse/RATIS-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17987040#comment-17987040
]
Tsz-wo Sze commented on RATIS-2314:
-----------------------------------
[~jt2594838], thanks for filing this JIRA!
Sorry that I don't get the deadlock bug – queue.offer(..) is run by an
appendEntries thread and queue.poll(..) is run by the SegmentedRaftLogWorker
thread. So, it should not have deadlock.
Could you show the detail of the deadlock, i.e. for each thread involved in the
deadlock, what are the locks it is holding and what are the locks it is waiting?
bq. ... the queue of SegmentedRaftLogWorker has a max capacity of 4096, and
there are 5000 entries ...
SegmentedRaftLog.appendImpl(List<LogEntryProto> entries) calls
appendEntry(LogEntryProto entry, TransactionContext context) for each entry as
shown below
{code:java}
//SegmentedRaftLog.appendImpl(List<LogEntryProto> entries)
try(AutoCloseableLock writeLock = writeLock()) {
...
for (int i = index; i < entries.size(); i++) {
final LogEntryProto entry = entries.get(i);
futures.add(appendEntry(entry, server.getTransactionContext(entry,
true)));
}
...
}
{code}
Even there are 5000, the entry is submitted one at a time. I see that it holds
the writeLock of SegmentedRaftLog but the SegmentedRaftLogWorker does not need
any locks from SegmentedRaftLog. So, it should be able to poll the queue.
> Deadlock when appending large batches of entries
> ------------------------------------------------
>
> Key: RATIS-2314
> URL: https://issues.apache.org/jira/browse/RATIS-2314
> Project: Ratis
> Issue Type: Bug
> Components: server
> Reporter: Tian Jiang
> Priority: Major
> Attachments: image-2025-06-30-13-25-54-574.png,
> image-2025-06-30-13-33-13-766.png, image-2025-06-30-13-35-09-832.png,
> image-2025-06-30-13-37-28-743.png
>
>
> Greeting, I found a deadlock when I am using Ratis, which is shown in the
> following stack:
> !image-2025-06-30-13-25-54-574.png!
> The problem is that Ratis uses CompletableFuture to serialize the process of
> appending logs, and below is the timeline:
> # Thread "SegmentedRaftLogWorker" is working on some logs;
> # Thread "SomeClient" tries to append 5000 entries to the log;
> # Because "SegmentedRaftLogWorker" is working, "SomeClient" adds the entries
> to the CompletableFuture; !image-2025-06-30-13-33-13-766.png!
> # When "SegmentedRaftLogWorker" is done with its last task, it checks the
> CompletableFuture for the next task; !image-2025-06-30-13-35-09-832.png!
> # Of course, there is one from "SomeClient", ordering it to append another
> 5000 entries; therefore "SegmentedRaftLogWorker" continues to append the
> entries, and it comes to: !image-2025-06-30-13-37-28-743.png!
> # However, the queue of SegmentedRaftLogWorker has a max capacity of 4096,
> and there are 5000 entries to be appended, so Thread "SegmentedRaftLogWorker"
> waits for the queue to be not full;
> # But who is supposed to consume from the queue? Thread
> "SegmentedRaftLogWorker" itself! As a result, the wait shall never end.
> I notice that Ratis is using many thread-steal tricks to improve thread
> locality, which is appreciated. However, it seems such a technique, in this
> scenario, creates deadlocks.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)