[ https://issues.apache.org/jira/browse/RATIS-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17869829#comment-17869829 ]
Xinyu Tan commented on RATIS-2129: ---------------------------------- Hi, we IoTDB team has also encountered similar raftlog lock bottleneck issues in online flame graphs. I believe there are several optimization solutions: # Completely lock-free reads: Theoretically, this would eliminate contention between reads and writes, achieving the best performance. However, I suspect that going lock-free might introduce some stability issues. We need to test and evaluate this together. # Minimize the workload while holding locks: Move some operations that don't rely on the lock outside of the lock, such as moving TermIndex.ValueOf and getSerializedSize outside of the lock. This can improve the lock's throughput. # Consider acquiring locks in batches during read and write operations to reduce the overhead of locking itself. > Low replication performance because GrpcLogAppender is often blocked by > RaftLog's readLock > ------------------------------------------------------------------------------------------ > > Key: RATIS-2129 > URL: https://issues.apache.org/jira/browse/RATIS-2129 > Project: Ratis > Issue Type: Bug > Components: server > Affects Versions: 3.1.0 > Reporter: Duong > Priority: Blocker > Labels: Performance, performance > Attachments: Screenshot 2024-07-22 at 4.40.07 PM-1.png, Screenshot > 2024-07-22 at 4.40.07 PM.png, dn_echo_leader_profile.html, > image-2024-07-22-15-25-46-155.png, ratis_ratfLog_lock_contention.png > > > Today, the GrpcLogAppender thread makes a lot of calls that need RaftLog's > readLock. In an active environment, RaftLog is always busy appending > transactions from clients, thus writeLock is frequently busy. This makes the > replication performance slow. > See the [^dn_echo_leader_profile.html], or in the picture below, the purple > is the time taken to acquire readLock from RaftLog. > # !image-2024-07-22-15-25-46-155.png|width=854,height=425! > So far, I'm not sure if this is a regression from a recent change in > 3.1.0/3.0.0, or if it's been always the case. > A few early considerations: > # The rate of calling RaftLog per GrpcLogAppender seems to be too high. > Instead of calling RaftLog multiple, maybe the log appended can call once to > obtain all the required information? > # Can RaftLog expose those data without requiring a read lock? -- This message was sent by Atlassian Jira (v8.20.10#820010)