Hello,
We're experimenting with meta table replication but have found that the
region servers hosting the replicas will get extremely high priority RPC
handler usage, sometimes all the way to 100% at which point clients start
to experience errors - the priority RPC handler usage is much higher than
the region servers hosting non-replicated meta ever experienced.
Has anyone seen this behavior? Are we doing something wrong?
Our settings changes are

hbase.regionserver.meta.storefile.refresh.period: 30000
hbase.meta.replica.count: 3
hbase.master.hfilecleaner.ttl: 600000

We're on cdh 5.9.0 (hbase 1.2)

We see some occasional errors like this one below (at the time I'm writing
this 30 minutes after the log entry, that file doesn't exist)
java.io.IOException: Got error for OP_READ_BLOCK, status=ERROR,
self=/REDACTED:11702, remote=REDACTED:50010, for file
/hbase/data/hbase/meta/1588230740/info/85c3e4b648eb437a8ef62ff9b39f525d,
for pool BP-812715994-REDACTED-1441218207856 block 1083723183_9985162
        at
org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:467)
        at
org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:432)
        at
org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:881)
        at
org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:759)
        at
org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:376)

Doing a stack dump of a region server process to see what the priority
handlers are up to reveals the following:

1 times (0% total cpu)
"RpcServer.Priority.general.handler=12,queue=0,port=60020" #138 daemon
prio=5 os_prio=0 tid=0x00007fc5471a3000 nid=0x3f33 waiting on condition
[0x00007fc4f5870000] cpu=0.0 start=2019-06-20 19:31:47
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x00000003d94b6f18> (a
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
at
java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943)
at
org.apache.hadoop.hbase.regionserver.HStore.replaceStoreFiles(HStore.java:1348)
at
org.apache.hadoop.hbase.regionserver.HStore.refreshStoreFilesInternal(HStore.java:652)
--

3 times (0% total cpu)
"RpcServer.Priority.general.handler=3,queue=1,port=60020" #129 daemon
prio=5 os_prio=0 tid=0x00007fc547190800 nid=0x3f2a waiting on condition
[0x00007fc4f6178000] cpu=0.0 start=2019-06-20 19:31:47
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x0000000723bbee00> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at
java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193)
at
org.apache.hadoop.hdfs.DFSInputStream.getFirstToComplete(DFSInputStream.java:1359)
at
org.apache.hadoop.hdfs.DFSInputStream.hedgedFetchBlockByteRange(DFSInputStream.java:1327)
at org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1468)
--

5 times (0% total cpu)
"RpcServer.Priority.general.handler=2,queue=0,port=60020" #128 daemon
prio=5 os_prio=0 tid=0x00007fc54718f000 nid=0x3f29 in Object.wait()
[0x00007fc4f6279000] cpu=0.0 start=2019-06-20 19:31:47
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:502)
at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:81)
- locked <0x00000004c0878978> (a org.apache.hadoop.hbase.util.IdLock$Entry)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:407)
at
org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:266)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:642)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:592)
at
org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:294)
--

7 times (0% total cpu)
"RpcServer.Priority.general.handler=1,queue=1,port=60020" #127 daemon
prio=5 os_prio=0 tid=0x00007fc54718c800 nid=0x3f28 in Object.wait()
[0x00007fc4f637a000] cpu=0.0 start=2019-06-20 19:31:47
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:502)
at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:81)
- locked <0x0000000723bbef88> (a org.apache.hadoop.hbase.util.IdLock$Entry)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:407)
at
org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:266)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:642)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:592)
at
org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:294)

Reply via email to