[ https://issues.apache.org/jira/browse/HBASE-25677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael Stack resolved HBASE-25677. ----------------------------------- Hadoop Flags: Reviewed Resolution: Fixed Merged to branch-2.3+. > Server+table counters on each scan #nextRaw invocation becomes a bottleneck > when heavy load > ------------------------------------------------------------------------------------------- > > Key: HBASE-25677 > URL: https://issues.apache.org/jira/browse/HBASE-25677 > Project: HBase > Issue Type: Sub-task > Components: metrics > Affects Versions: 2.3.2 > Reporter: Michael Stack > Assignee: Michael Stack > Priority: Major > Fix For: 3.0.0-alpha-1, 2.5.0, 2.3.5, 2.4.3 > > > On a heavily loaded server mostly doing reads/scan, I saw that 90+% of > handlers were BLOCKED in this fashion in thread dumps: > {code} > "RpcServer.default.FPBQ.Fifo.handler=117,queue=17,port=16020" #161 daemon > prio=5 os_prio=0 tid=0x00007f748757f000 nid=0x73e9 waiting for monitor entry > [0x00007f74783e0000] > java.lang.Thread.State: BLOCKED (on object monitor) > at > java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1674) > - waiting to lock <0x00007f7647e3cc38> (a > java.util.concurrent.ConcurrentHashMap$Node) > at > org.apache.hadoop.hbase.regionserver.MetricsTableQueryMeterImpl.getOrCreateTableMeter(MetricsTableQueryMeterImpl.java:80) > at > org.apache.hadoop.hbase.regionserver.MetricsTableQueryMeterImpl.updateTableReadQueryMeter(MetricsTableQueryMeterImpl.java:90) > at > org.apache.hadoop.hbase.regionserver.RegionServerTableMetrics.updateTableReadQueryMeter(RegionServerTableMetrics.java:89) > at > org.apache.hadoop.hbase.regionserver.MetricsRegionServer.updateReadQueryMeter(MetricsRegionServer.java:274) > at > org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:6742) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3319) > - locked <0x00007f896c0165a0> (a > org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3566) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:44858) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:393) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318) > {code} > It kept up for good periods of time. > I saw it to a leser extent on other servers, with less load. > These RS had 400+ Regions a good few of which were serving out scan reads; > the server was doing ~1M hits a second. In this scenario, I saw the above > bottleneck. > Looking at it, it came in w/ when the parent issue feature was added. There > are these read counts and then there were also write counts. The write counts > are mostly batch-based. Let me do same thing here for the read.... update the > central server+table count after scan is done rather than per invocation of > #nextRaw. -- This message was sent by Atlassian Jira (v8.3.4#803005)