[ https://issues.apache.org/jira/browse/HBASE-28184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17783883#comment-17783883 ]
Hudson commented on HBASE-28184: -------------------------------- Results for branch branch-2.4 [build #648 on builds.a.o|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.4/648/]: (/) *{color:green}+1 overall{color}* ---- details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.4/648/General_20Nightly_20Build_20Report/] (/) {color:green}+1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.4/648/JDK8_20Nightly_20Build_20Report_20_28Hadoop2_29/] (/) {color:green}+1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.4/648/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/] (/) {color:green}+1 jdk11 hadoop3 checks{color} -- For more information [see jdk11 report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.4/648/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > Tailing the WAL is very slow if there are multiple peers. > --------------------------------------------------------- > > Key: HBASE-28184 > URL: https://issues.apache.org/jira/browse/HBASE-28184 > Project: HBase > Issue Type: Bug > Components: Replication > Affects Versions: 2.0.0 > Reporter: Rushabh Shah > Assignee: Rushabh Shah > Priority: Major > Fix For: 2.6.0, 2.4.18, 3.0.0-beta-1, 2.5.7 > > > Noticed in one of our production clusters which has 4 peers. > Due to sudden ingestion of data, the size of log queue increased to a peak of > 506. We have configured log roll size to 256 MB. Most of the edits in the WAL > were from a table for which replication is disabled. > So all ReplicationSourceWALReader thread had to do was to replay the WAL and > NOT replicate them. Still it took 12 hours to drain the queue. > Took few jstacks and found that ReplicationSourceWALReader was waiting to > acquire rollWriterLock > [here|https://github.com/apache/hbase/blob/branch-2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AbstractFSWAL.java#L1231] > {noformat} > "regionserver/<rs>,1" #1036 daemon prio=5 os_prio=0 tid=0x00007f44b374e800 > nid=0xbd7f waiting on condition [0x00007f37b4d19000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00007f3897a3e150> (a > java.util.concurrent.locks.ReentrantLock$FairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:837) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:872) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1202) > at > java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:228) > at > java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290) > at > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.getLogFileSizeIfBeingWritten(AbstractFSWAL.java:1102) > at > org.apache.hadoop.hbase.wal.WALProvider.lambda$null$0(WALProvider.java:128) > at > org.apache.hadoop.hbase.wal.WALProvider$$Lambda$177/1119730685.apply(Unknown > Source) > at > java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) > at > java.util.ArrayList$ArrayListSpliterator.tryAdvance(ArrayList.java:1361) > at > java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126) > at > java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:499) > at > java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:486) > at > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) > at > java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152) > at > java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at > java.util.stream.ReferencePipeline.findAny(ReferencePipeline.java:536) > at > org.apache.hadoop.hbase.wal.WALProvider.lambda$getWALFileLengthProvider$2(WALProvider.java:129) > at > org.apache.hadoop.hbase.wal.WALProvider$$Lambda$140/1246380717.getLogFileSizeIfBeingWritten(Unknown > Source) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.readNextEntryAndRecordReaderPosition(WALEntryStream.java:260) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:172) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:101) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.readWALEntries(ReplicationSourceWALReader.java:222) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:157) > {noformat} > All the peers will contend for this lock during every batch read. > Look at the code snippet below. We are guarding this section with > rollWriterLock if we are replicating the active WAL file. But in our case we > are NOT replicating active WAL file but still we acquire this lock only to > return OptionalLong.empty(); > {noformat} > /** > * if the given {@code path} is being written currently, then return its > length. > * <p> > * This is used by replication to prevent replicating unacked log entries. > See > * https://issues.apache.org/jira/browse/HBASE-14004 for more details. > */ > @Override > public OptionalLong getLogFileSizeIfBeingWritten(Path path) { > rollWriterLock.lock(); > try { > ... > ... > } finally { > rollWriterLock.unlock(); > } > {noformat} > We can check the size of log queue and if it is greater than 1 then we can > return early without acquiring the lock. -- This message was sent by Atlassian Jira (v8.20.10#820010)