Hi, HBase Users

I encounter a RegionServer Dead case, and want to identify which HBase
Table actually cause the problem

Based on HBase JMX Restful Service, what I have on hand is:
1. Table level Grafana monitoring for Put, Get, Scan, MemStoreSize
2. RegionServer level Grafana monitoring for GC and RegionCount
3. RegionServer logs

*HRegionServer java process information*
/usr/java/jdk1.7.0_67-cloudera/bin/java -Dproc_regionserver
-XX:OnOutOfMemoryError=kill -9 %p -Djava.net.preferIPv4Stack=true
-Xms34359738368 -Xmx34359738368 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps
-Xloggc:/pang/log/hbase/gc-hbase.log -XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=1 -XX:GCLogFileSize=512M -XX:OnOutOfMemoryError=/usr/
lib64/cmf/service/common/killparent.sh -Dhbase.log.dir=/pang/log/hbase
-Dhbase.log.file=hbase-cmf-hbase-REGIONSERVER-fds-hadoop-prod06-mp.log.out
-Dhbase.home.dir=/pang/cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p0.4/lib/hbase
-Dhbase.id.str= -Dhbase.root.logger=WARN,RFA -Djava.library.path=/pang/
cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p0.4/lib/hadoop/lib/
native:/pang/cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p0.4/lib/
hbase/lib/native/Linux-amd64-64 -Dhbase.security.logger=INFO,RFAS
org.apache.hadoop.hbase.regionserver.HRegionServer start

*RegionServer logs show a GC pause*
2016-09-21 11:46:19,263 WARN org.apache.hadoop.hdfs.DFSClient:
DFSOutputStream ResponseProcessor exception  for block
BP-45016343-10.20.30.13-1446197023889:blk_1141417801_67682582
java.io.EOFException: Premature EOF: no length prefix available
    at org.apache.hadoop.hdfs.protocolPB.PBHelper.
vintPrefixed(PBHelper.java:2241)
    at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(
PipelineAck.java:235)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$
ResponseProcessor.run(DFSOutputStream.java:971)
2016-09-21 11:46:19,274 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block
BP-45016343-10.20.30.13-1446197023889:blk_1141417801_67682582
in pipeline 
DatanodeInfoWithStorage[10.10.9.15:50010,DS-4209cc16-e676-4226-be0b-1328075bbdd7,DISK],
DatanodeInfoWithStorage[10.10.9.19:50010,DS-c06678d6-3f0d-40a2-82e2-62387e26af7b,DISK],
DatanodeInfoWithStorage[10.10.9.21:50010,DS-7bf90b1e-7862-4cc1-a121-31768021a3aa,DISK]:
bad datanode DatanodeInfoWithStorage[10.10.9.15:50010,DS-4209cc16-e676-
4226-be0b-1328075bbdd7,DISK]
2016-09-21 11:46:19,266 WARN org.apache.hadoop.hbase.util.JvmPauseMonitor:

*Detected pause in JVM or host machine (eg GC): pause of approximately
152322msGC pool 'ParNew' had collection(s): count=1 time=465msGC pool
'ConcurrentMarkSweep' had collection(s): count=2 time=152238ms*
2016-09-21 11:46:19,263 WARN org.apache.hadoop.hbase.util.Sleeper: We slept
153423ms instead of 3000ms, this is likely due to a long garbage collecting
pause and it's usually bad, see http://hbase.apache.org/book.
html#trouble.rs.runtime.zkexpired
2016-09-21 11:46:19,263 INFO org.apache.zookeeper.ClientCnxn: Client
session timed out, have not heard from server in 156121ms for sessionid
0x255c4a5f76332f6, closing socket connection and attempting reconnect
2016-09-21 11:46:19,263 INFO org.apache.zookeeper.ClientCnxn: Client
session timed out, have not heard from server in 153414ms for sessionid
0x355c4a5f3c2469f, closing socket connection and attempting reconnect
2016-09-21 11:46:19,263 INFO org.apache.zookeeper.ClientCnxn: Client
session timed out, have not heard from server in 156121ms for sessionid
0x255c4a5f763332c, closing socket connection and attempting reconnect

*RegionServer logs show server dead:*
2016-09-21 11:46:19,485 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer:
ABORTING region server fds-hadoop-prod06-mp,60020,1473551672107:
*org.apache.hadoop.hbase.YouAreDeadException*: Server REPORT rejected;
currently processing fds-hadoop-prod06-mp,60020,1473551672107 as *dead
server*
    at org.apache.hadoop.hbase.master.ServerManager.
checkIsDead(ServerManager.java:382)
    at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(
ServerManager.java:287)
    at org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(
MasterRpcServices.java:287)
    at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$
RegionServerStatusService$2.callBlockingMethod(
RegionServerStatusProtos.java:7912)
    at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2034)
    at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
    at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(
RpcExecutor.java:130)
    at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
    at java.lang.Thread.run(Thread.java:745)

org.apache.hadoop.hbase.YouAreDeadException:
org.apache.hadoop.hbase.YouAreDeadException:
Server REPORT rejected; currently processing
fds-hadoop-prod06-mp,60020,1473551672107
as dead server
    at org.apache.hadoop.hbase.master.ServerManager.
checkIsDead(ServerManager.java:382)
    at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(
ServerManager.java:287)
    at org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(
MasterRpcServices.java:287)
    at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$
RegionServerStatusService$2.callBlockingMethod(
RegionServerStatusProtos.java:7912)
    at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2034)
    at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
    at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(
RpcExecutor.java:130)
    at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
    at java.lang.Thread.run(Thread.java:745)

    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(
NativeConstructorAccessorImpl.java:57)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
    at org.apache.hadoop.ipc.RemoteException.instantiateException(
RemoteException.java:106)
    at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(
RemoteException.java:95)
    at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(
ProtobufUtil.java:328)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.
tryRegionServerReport(HRegionServer.java:1092)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.
run(HRegionServer.java:900)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.
apache.hadoop.hbase.YouAreDeadException):
org.apache.hadoop.hbase.YouAreDeadException:
Server REPORT rejected; currently processing
fds-hadoop-prod06-mp,60020,1473551672107
as dead server
    at org.apache.hadoop.hbase.master.ServerManager.
checkIsDead(ServerManager.java:382)
    at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(
ServerManager.java:287)
    at org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(
MasterRpcServices.java:287)
    at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$
RegionServerStatusService$2.callBlockingMethod(
RegionServerStatusProtos.java:7912)
    at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2034)
    at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
    at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(
RpcExecutor.java:130)
    at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
    at java.lang.Thread.run(Thread.java:745)

    at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(
RpcClientImpl.java:1219)
    at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(
AbstractRpcClient.java:216)
    at org.apache.hadoop.hbase.ipc.AbstractRpcClient$
BlockingRpcChannelImplementation.callBlockingMethod(
AbstractRpcClient.java:300)
    at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$
RegionServerStatusService$BlockingStub.regionServerReport(
RegionServerStatusProtos.java:8289)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.
tryRegionServerReport(HRegionServer.java:1090)
    ... 2 more
2016-09-21 11:46:19,505 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer:
RegionServer abort: loaded coprocessors are: [org.apache.phoenix.
coprocessor.ServerCachingEndpointImpl,
org.apache.hadoop.hbase.regionserver.LocalIndexMerger,
org.apache.hadoop.hbase.regionserver.LocalIndexSplitter, org.apache.phoenix.
coprocessor.UngroupedAggregateRegionObserver, org.apache.phoenix.
coprocessor.GroupedAggregateRegionObserver,
org.apache.phoenix.coprocessor.ScanRegionObserver,
org.apache.phoenix.hbase.index.Indexer, org.apache.phoenix.coprocessor.
SequenceRegionObserver]
2016-09-21 11:46:19,498 FATAL org.apache.hadoop.hbase.regionserver.LogRoller:
Aborting
java.io.IOException: cannot get log writer


*Error level RegionServer logs when the problem happened*2016-09-21
11:46:19,347 ERROR org.apache.hadoop.hbase.regionserver.wal.FSHLog: Error
syncing, request close of WAL
2016-09-21 11:46:19,348 ERROR org.apache.hadoop.hbase.regionserver.wal.FSHLog:
Error syncing, request close of WAL
2016-09-21 11:46:19,350 ERROR org.apache.hadoop.hbase.regionserver.wal.FSHLog:
Error syncing, request close of WAL
2016-09-21 11:46:19,349 ERROR org.apache.hadoop.hbase.regionserver.wal.FSHLog:
Error syncing, request close of WAL
2016-09-21 11:46:19,385 ERROR org.apache.hadoop.hbase.regionserver.wal.FSHLog:
Error syncing, request close of WAL
2016-09-21 11:46:19,351 ERROR org.apache.hadoop.hbase.regionserver.wal.FSHLog:
Error syncing, request close of WAL
2016-09-21 11:46:19,351 ERROR org.apache.hadoop.hbase.regionserver.wal.FSHLog:
Error syncing, request close of WAL
2016-09-21 11:46:19,388 ERROR org.apache.hadoop.hbase.regionserver.wal.FSHLog:
Error syncing, request close of WAL
2016-09-21 11:46:19,388 ERROR org.apache.hadoop.hbase.regionserver.wal.FSHLog:
Error syncing, request close of WAL
2016-09-21 11:46:19,384 ERROR org.apache.hadoop.hbase.regionserver.wal.FSHLog:
Error syncing, request close of WAL
2016-09-21 11:46:20,048 ERROR org.apache.hadoop.hbase.regionserver.HRegion:
Memstore size is 692752
2016-09-21 11:46:20,103 ERROR org.apache.hadoop.hbase.regionserver.HRegion:
Memstore size is 1831544
2016-09-21 11:46:20,111 ERROR org.apache.hadoop.hbase.regionserver.HRegion:
Memstore size is 44967984
2016-09-21 11:46:20,131 ERROR org.apache.hadoop.hbase.regionserver.HRegion:
Memstore size is 2334208
2016-09-21 11:46:20,308 ERROR org.apache.hadoop.hbase.regionserver.HRegion:
Memstore size is 4393560
2016-09-21 11:46:20,334 ERROR org.apache.hadoop.hbase.regionserver.HRegion:
Memstore size is 2905432
2016-09-21 11:46:20,373 ERROR org.apache.hadoop.hbase.regionserver.HRegion:
Memstore size is 2310808
2016-09-21 11:46:20,402 ERROR org.apache.hadoop.hbase.regionserver.HRegion:
Memstore size is 1112648
2016-09-21 11:46:20,422 ERROR org.apache.hadoop.hbase.regionserver.HRegion:
Memstore size is 6578048
2016-09-21 11:46:20,437 ERROR org.apache.hadoop.hbase.regionserver.HRegion:
Memstore size is 59242936
2016-09-21 11:46:20,608 ERROR org.apache.hadoop.hbase.regionserver.HRegion:
Memstore size is 4482224
2016-09-21 11:46:20,624 ERROR org.apache.hadoop.hbase.regionserver.HRegion:
Memstore size is 20144
2016-09-21 11:46:20,652 ERROR org.apache.hadoop.hbase.regionserver.HRegion:
Memstore size is 23568
2016-09-21 11:46:20,687 ERROR org.apache.hadoop.hbase.regionserver.HRegion:
Memstore size is 14395336
2016-09-21 11:46:20,706 ERROR org.apache.hadoop.hbase.regionserver.HRegion:
Memstore size is 14025824

Attached some metrics screenshots catched when the problem happen
All the metrics have gap because the RegionServer not able to handle /jmx
requests at all

1. One screenshot for MutateCount(Put Count) at table level
http://img0.ph.126.net/coIWlwP2w44jnl1kU5CbcA==/6631704382933867248.png
2. One screenshot for RegionServer level GC Time
http://img0.ph.126.net/ouciVyCvpqEkMPV3phnx5A==/6631661501980397103.png
3. One screenshot for Max Region MemStoreSize at table level
http://img0.ph.126.net/rEQhCk-VWx3HGbcZCM51iA==/6631798940933860065.png
4. One screenshot for RegionCount for each RegionServer
http://img0.ph.126.net/JKPA4-7NJEkFc3_TxOv63A==/6631679094166431135.png

By the way, all the legends in the screenshots are ordered by max value, so
the metric series with max value show at last

With all those information, what I get now is:
1. A lot of tables have sudden Put increase after the RegionServer back to
service, my understanding is data source streaming not able to process all
the data when the problem happened, so the data accumulated about two times
at a later time, those tables seems not the root cause
2. From the GC metrics, it is pretty easy to identify which region server
have the problem and all the logs catced from the problem server
3. All the region server have a sudden region count increase and decrease
except the dead one
4. From the Max MemStoreSize metric sceenshot, I prefer to think the table
with max MemStoreSize is the root cause, but the MutateCount metrics didn't
show there is a peak time before and after the problem on that table, so
not confident about this judgement

I need help on
1. Identify which table cause the RegionServer dead
2. Why RegionCount have a sudden increase and decrease for all the servers
not dead
3. Any other metrics missed but important to debug some Hotspotting cases
4. How I can get merics when the RegionServer not able to handle JMX
Restful service

Thanks a lot in advance and please let me know if need any other information

Reply via email to