Re: High Priority RPC handler usage with meta replication - block read errors?

James Kebinger Fri, 28 Jun 2019 09:47:21 -0700

Hey Stack, thanks for your response

We were thinking that meta replication would make a hot meta table on a
busy cluster less hot rather than more hot. Is the meta replication feature
for uptime rather than performance if it, as you said, will "up the
temperature even more"?


I was surprised that all the region servers were reading the same hfile
from disk rather than some mechanism that would allow each hosts of a meta
region replica to have its own copy with high locality and less IdLock
contention.

In my brief experience experimenting with the feature, the region servers
needed to be killed to recover so they may be hitting that IdLock issue
even though we're on 1.2 -- that IdLock code looks roughly the same in this
older version.

Given your statement about loading the temperature up even more, if our
goal is to spread out the performance impact of the meta region, will
tweaking cache size and i/o performance be time well spent?

Thank you,
James

Thanks James for your response too, I hope you're doing well :) We have a
bunch of cold-meta-cache hadoop jobs that are indeed likely contributing to
the load on meta.

On Fri, Jun 28, 2019 at 12:14 AM Stack <st...@duboce.net> wrote:

> On Wed, Jun 26, 2019 at 6:47 AM James Kebinger <jkebin...@hubspot.com>
> wrote:
>
> > Hello,
> > We're experimenting with meta table replication
>
>
> Replicating the meta table itself? What you thinking?
>
>
>
> > but have found that the
> > region servers hosting the replicas will get extremely high priority RPC
> > handler usage, sometimes all the way to 100% at which point clients start
> > to experience errors - the priority RPC handler usage is much higher than
> > the region servers hosting non-replicated meta ever experienced.
> >
>
> Meta Table will be hot on a busy cluster. Adding replication load will up
> the temperature even more.
>
> Check out the James Moore suggestions. HDFS issue could occupy handler
> while it holds a the id lock on a popular block.
>
> Can you figure who is trying to read the removed file?
>
> Can't tell for sure from the snippet below but are handlers frequently
> going out to HDFS to pull in blocks rather than fetch from block cache? How
> is server-side cache on the server hosting meta? Can you get more meta up
> into block cache?
>
> If you disable replication, all settles down?
>
> S
>
>
> Has anyone seen this behavior? Are we doing something wrong?
> > Our settings changes are
> >
> > hbase.regionserver.meta.storefile.refresh.period: 30000
> > hbase.meta.replica.count: 3
> > hbase.master.hfilecleaner.ttl: 600000
> >
> > We're on cdh 5.9.0 (hbase 1.2)
> >
> > We see some occasional errors like this one below (at the time I'm
> writing
> > this 30 minutes after the log entry, that file doesn't exist)
> > java.io.IOException: Got error for OP_READ_BLOCK, status=ERROR,
> > self=/REDACTED:11702, remote=REDACTED:50010, for file
> > /hbase/data/hbase/meta/1588230740/info/85c3e4b648eb437a8ef62ff9b39f525d,
> > for pool BP-812715994-REDACTED-1441218207856 block 1083723183_9985162
> > at
> >
> >
> org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:467)
> > at
> >
> >
> org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:432)
> > at
> >
> >
> org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:881)
> > at
> >
> >
> org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:759)
> > at
> >
> >
> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:376)
> >
> > Doing a stack dump of a region server process to see what the priority
> > handlers are up to reveals the following:
> >
> > 1 times (0% total cpu)
> > "RpcServer.Priority.general.handler=12,queue=0,port=60020" #138 daemon
> > prio=5 os_prio=0 tid=0x00007fc5471a3000 nid=0x3f33 waiting on condition
> > [0x00007fc4f5870000] cpu=0.0 start=2019-06-20 19:31:47
> > java.lang.Thread.State: WAITING (parking)
> > at sun.misc.Unsafe.park(Native Method)
> > - parking to wait for <0x00000003d94b6f18> (a
> > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
> > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> > at
> >
> >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> > at
> >
> >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
> > at
> >
> >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
> > at
> >
> >
> java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943)
> > at
> >
> >
> org.apache.hadoop.hbase.regionserver.HStore.replaceStoreFiles(HStore.java:1348)
> > at
> >
> >
> org.apache.hadoop.hbase.regionserver.HStore.refreshStoreFilesInternal(HStore.java:652)
> > --
> >
> > 3 times (0% total cpu)
> > "RpcServer.Priority.general.handler=3,queue=1,port=60020" #129 daemon
> > prio=5 os_prio=0 tid=0x00007fc547190800 nid=0x3f2a waiting on condition
> > [0x00007fc4f6178000] cpu=0.0 start=2019-06-20 19:31:47
> > java.lang.Thread.State: WAITING (parking)
> > at sun.misc.Unsafe.park(Native Method)
> > - parking to wait for <0x0000000723bbee00> (a
> > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> > at
> >
> >
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> > at
> >
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> > at
> >
> >
> java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193)
> > at
> >
> >
> org.apache.hadoop.hdfs.DFSInputStream.getFirstToComplete(DFSInputStream.java:1359)
> > at
> >
> >
> org.apache.hadoop.hdfs.DFSInputStream.hedgedFetchBlockByteRange(DFSInputStream.java:1327)
> > at org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1468)
> > --
> >
> > 5 times (0% total cpu)
> > "RpcServer.Priority.general.handler=2,queue=0,port=60020" #128 daemon
> > prio=5 os_prio=0 tid=0x00007fc54718f000 nid=0x3f29 in Object.wait()
> > [0x00007fc4f6279000] cpu=0.0 start=2019-06-20 19:31:47
> > java.lang.Thread.State: WAITING (on object monitor)
> > at java.lang.Object.wait(Native Method)
> > at java.lang.Object.wait(Object.java:502)
> > at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:81)
> > - locked <0x00000004c0878978> (a
> org.apache.hadoop.hbase.util.IdLock$Entry)
> > at
> > org.apache.hadoop.hbase.io
> <http://org.apache.hadoop.hbase.io>
> > .hfile.HFileReaderV2.readBlock(HFileReaderV2.java:407)
> > at
> > org.apache.hadoop.hbase.io
> <http://org.apache.hadoop.hbase.io>
> >
> .hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:266)
> > at
> > org.apache.hadoop.hbase.io
> <http://org.apache.hadoop.hbase.io>
> > .hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:642)
> > at
> > org.apache.hadoop.hbase.io
> <http://org.apache.hadoop.hbase.io>
> > .hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:592)
> > at
> >
> >
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:294)
> > --
> >
> > 7 times (0% total cpu)
> > "RpcServer.Priority.general.handler=1,queue=1,port=60020" #127 daemon
> > prio=5 os_prio=0 tid=0x00007fc54718c800 nid=0x3f28 in Object.wait()
> > [0x00007fc4f637a000] cpu=0.0 start=2019-06-20 19:31:47
> > java.lang.Thread.State: WAITING (on object monitor)
> > at java.lang.Object.wait(Native Method)
> > at java.lang.Object.wait(Object.java:502)
> > at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:81)
> > - locked <0x0000000723bbef88> (a
> org.apache.hadoop.hbase.util.IdLock$Entry)
> > at
> > org.apache.hadoop.hbase.io
> <http://org.apache.hadoop.hbase.io>
> > .hfile.HFileReaderV2.readBlock(HFileReaderV2.java:407)
> > at
> > org.apache.hadoop.hbase.io
> <http://org.apache.hadoop.hbase.io>
> >
> .hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:266)
> > at
> > org.apache.hadoop.hbase.io
> <http://org.apache.hadoop.hbase.io>
> > .hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:642)
> > at
> > org.apache.hadoop.hbase.io
> <http://org.apache.hadoop.hbase.io>
> > .hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:592)
> > at
> >
> >
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:294)
> >
>

Re: High Priority RPC handler usage with meta replication - block read errors?

Reply via email to