Hello, We're experimenting with meta table replication but have found that the region servers hosting the replicas will get extremely high priority RPC handler usage, sometimes all the way to 100% at which point clients start to experience errors - the priority RPC handler usage is much higher than the region servers hosting non-replicated meta ever experienced. Has anyone seen this behavior? Are we doing something wrong? Our settings changes are
hbase.regionserver.meta.storefile.refresh.period: 30000 hbase.meta.replica.count: 3 hbase.master.hfilecleaner.ttl: 600000 We're on cdh 5.9.0 (hbase 1.2) We see some occasional errors like this one below (at the time I'm writing this 30 minutes after the log entry, that file doesn't exist) java.io.IOException: Got error for OP_READ_BLOCK, status=ERROR, self=/REDACTED:11702, remote=REDACTED:50010, for file /hbase/data/hbase/meta/1588230740/info/85c3e4b648eb437a8ef62ff9b39f525d, for pool BP-812715994-REDACTED-1441218207856 block 1083723183_9985162 at org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:467) at org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:432) at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:881) at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:759) at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:376) Doing a stack dump of a region server process to see what the priority handlers are up to reveals the following: 1 times (0% total cpu) "RpcServer.Priority.general.handler=12,queue=0,port=60020" #138 daemon prio=5 os_prio=0 tid=0x00007fc5471a3000 nid=0x3f33 waiting on condition [0x00007fc4f5870000] cpu=0.0 start=2019-06-20 19:31:47 java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000003d94b6f18> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943) at org.apache.hadoop.hbase.regionserver.HStore.replaceStoreFiles(HStore.java:1348) at org.apache.hadoop.hbase.regionserver.HStore.refreshStoreFilesInternal(HStore.java:652) -- 3 times (0% total cpu) "RpcServer.Priority.general.handler=3,queue=1,port=60020" #129 daemon prio=5 os_prio=0 tid=0x00007fc547190800 nid=0x3f2a waiting on condition [0x00007fc4f6178000] cpu=0.0 start=2019-06-20 19:31:47 java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0000000723bbee00> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193) at org.apache.hadoop.hdfs.DFSInputStream.getFirstToComplete(DFSInputStream.java:1359) at org.apache.hadoop.hdfs.DFSInputStream.hedgedFetchBlockByteRange(DFSInputStream.java:1327) at org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1468) -- 5 times (0% total cpu) "RpcServer.Priority.general.handler=2,queue=0,port=60020" #128 daemon prio=5 os_prio=0 tid=0x00007fc54718f000 nid=0x3f29 in Object.wait() [0x00007fc4f6279000] cpu=0.0 start=2019-06-20 19:31:47 java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:81) - locked <0x00000004c0878978> (a org.apache.hadoop.hbase.util.IdLock$Entry) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:407) at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:266) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:642) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:592) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:294) -- 7 times (0% total cpu) "RpcServer.Priority.general.handler=1,queue=1,port=60020" #127 daemon prio=5 os_prio=0 tid=0x00007fc54718c800 nid=0x3f28 in Object.wait() [0x00007fc4f637a000] cpu=0.0 start=2019-06-20 19:31:47 java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:81) - locked <0x0000000723bbef88> (a org.apache.hadoop.hbase.util.IdLock$Entry) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:407) at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:266) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:642) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:592) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:294)