[ https://issues.apache.org/jira/browse/HBASE-23008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Work on HBASE-23008 started by Zheng Wang. ------------------------------------------ > ReplicationSourceShipper has no chance to delete hlog znode when the wal > entry batch always empty > ------------------------------------------------------------------------------------------------- > > Key: HBASE-23008 > URL: https://issues.apache.org/jira/browse/HBASE-23008 > Project: HBase > Issue Type: Bug > Components: Replication > Affects Versions: 2.0.0 > Reporter: Zheng Wang > Assignee: Zheng Wang > Priority: Major > > My live cluster config master-master replication,and only one is used to put > data,as active cluster. > Recently ,i find there are a great many znode in > /hbase/replication/rs/#server/#peer in backup cluster,at least 10000+. > > I think the reason is , the wal entry in backup cluster are filtered by > ClusterMarkingEntryFilter totaly, so ReplicationSourceWALReader will not put > any data to entryBatchQueue,and ReplicationSourceShipper always blocked at > entryReader.take(),it has no chance to delete hlog znode. > The thread stack of walReader and walShiper is below: > {code:java} > "main-EventThread.replicationSource,2.replicationSource.hostname%2C16020%2C1567586932902.hostname%2C16020%2C1567586932902.regiongroup-0,2.replicationSource.wal-reader.hostname%2C16020%2C1567586932902.hostname%2C16020%2C1567586932902.regiongroup-0,2" > #157238 daemon prio=5 os_prio=0 tid=0x00007f7634be8800 nid=0x377ef waiting > on condition > [0x00007f6114c0e000]"main-EventThread.replicationSource,2.replicationSource.hostname%2C16020%2C1567586932902.hostname%2C16020%2C1567586932902.regiongroup-0,2.replicationSource.wal-reader.hostname%2C16020%2C1567586932902.hostname%2C16020%2C1567586932902.regiongroup-0,2" > #157238 daemon prio=5 os_prio=0 tid=0x00007f7634be8800 nid=0x377ef waiting > on condition [0x00007f6114c0e000] java.lang.Thread.State: TIMED_WAITING > (sleeping) at java.lang.Thread.sleep(Native Method) at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.handleEmptyWALEntryBatch(ReplicationSourceWALReader.java:192) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:142) > "main-EventThread.replicationSource,2.replicationSource.hostname%2C16020%2C1567586932902.hostname%2C16020%2C1567586932902.regiongroup-0,2" > #157237 daemon prio=5 os_prio=0 tid=0x00007f76350b0000 nid=0x377ee waiting > on condition [0x00007f6108173000] java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) - parking to wait for > <0x00007f6f99bb6718> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.take(ReplicationSourceWALReader.java:248) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.run(ReplicationSourceShipper.java:108) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)