Hi, HBase Users I encounter a RegionServer Dead case, and want to identify which HBase Table actually cause the problem
Based on HBase JMX Restful Service, what I have on hand is: 1. Table level Grafana monitoring for Put, Get, Scan, MemStoreSize 2. RegionServer level Grafana monitoring for GC and RegionCount 3. RegionServer logs *HRegionServer java process information* /usr/java/jdk1.7.0_67-cloudera/bin/java -Dproc_regionserver -XX:OnOutOfMemoryError=kill -9 %p -Djava.net.preferIPv4Stack=true -Xms34359738368 -Xmx34359738368 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/pang/log/hbase/gc-hbase.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=1 -XX:GCLogFileSize=512M -XX:OnOutOfMemoryError=/usr/ lib64/cmf/service/common/killparent.sh -Dhbase.log.dir=/pang/log/hbase -Dhbase.log.file=hbase-cmf-hbase-REGIONSERVER-fds-hadoop-prod06-mp.log.out -Dhbase.home.dir=/pang/cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p0.4/lib/hbase -Dhbase.id.str= -Dhbase.root.logger=WARN,RFA -Djava.library.path=/pang/ cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p0.4/lib/hadoop/lib/ native:/pang/cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p0.4/lib/ hbase/lib/native/Linux-amd64-64 -Dhbase.security.logger=INFO,RFAS org.apache.hadoop.hbase.regionserver.HRegionServer start *RegionServer logs show a GC pause* 2016-09-21 11:46:19,263 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block BP-45016343-10.20.30.13-1446197023889:blk_1141417801_67682582 java.io.EOFException: Premature EOF: no length prefix available at org.apache.hadoop.hdfs.protocolPB.PBHelper. vintPrefixed(PBHelper.java:2241) at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields( PipelineAck.java:235) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ ResponseProcessor.run(DFSOutputStream.java:971) 2016-09-21 11:46:19,274 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block BP-45016343-10.20.30.13-1446197023889:blk_1141417801_67682582 in pipeline DatanodeInfoWithStorage[10.10.9.15:50010,DS-4209cc16-e676-4226-be0b-1328075bbdd7,DISK], DatanodeInfoWithStorage[10.10.9.19:50010,DS-c06678d6-3f0d-40a2-82e2-62387e26af7b,DISK], DatanodeInfoWithStorage[10.10.9.21:50010,DS-7bf90b1e-7862-4cc1-a121-31768021a3aa,DISK]: bad datanode DatanodeInfoWithStorage[10.10.9.15:50010,DS-4209cc16-e676- 4226-be0b-1328075bbdd7,DISK] 2016-09-21 11:46:19,266 WARN org.apache.hadoop.hbase.util.JvmPauseMonitor: *Detected pause in JVM or host machine (eg GC): pause of approximately 152322msGC pool 'ParNew' had collection(s): count=1 time=465msGC pool 'ConcurrentMarkSweep' had collection(s): count=2 time=152238ms* 2016-09-21 11:46:19,263 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 153423ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book. html#trouble.rs.runtime.zkexpired 2016-09-21 11:46:19,263 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 156121ms for sessionid 0x255c4a5f76332f6, closing socket connection and attempting reconnect 2016-09-21 11:46:19,263 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 153414ms for sessionid 0x355c4a5f3c2469f, closing socket connection and attempting reconnect 2016-09-21 11:46:19,263 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 156121ms for sessionid 0x255c4a5f763332c, closing socket connection and attempting reconnect *RegionServer logs show server dead:* 2016-09-21 11:46:19,485 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server fds-hadoop-prod06-mp,60020,1473551672107: *org.apache.hadoop.hbase.YouAreDeadException*: Server REPORT rejected; currently processing fds-hadoop-prod06-mp,60020,1473551672107 as *dead server* at org.apache.hadoop.hbase.master.ServerManager. checkIsDead(ServerManager.java:382) at org.apache.hadoop.hbase.master.ServerManager.regionServerReport( ServerManager.java:287) at org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport( MasterRpcServices.java:287) at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$ RegionServerStatusService$2.callBlockingMethod( RegionServerStatusProtos.java:7912) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2034) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop( RpcExecutor.java:130) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107) at java.lang.Thread.run(Thread.java:745) org.apache.hadoop.hbase.YouAreDeadException: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing fds-hadoop-prod06-mp,60020,1473551672107 as dead server at org.apache.hadoop.hbase.master.ServerManager. checkIsDead(ServerManager.java:382) at org.apache.hadoop.hbase.master.ServerManager.regionServerReport( ServerManager.java:287) at org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport( MasterRpcServices.java:287) at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$ RegionServerStatusService$2.callBlockingMethod( RegionServerStatusProtos.java:7912) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2034) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop( RpcExecutor.java:130) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107) at java.lang.Thread.run(Thread.java:745) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance( NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance( DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.ipc.RemoteException.instantiateException( RemoteException.java:106) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException( RemoteException.java:95) at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException( ProtobufUtil.java:328) at org.apache.hadoop.hbase.regionserver.HRegionServer. tryRegionServerReport(HRegionServer.java:1092) at org.apache.hadoop.hbase.regionserver.HRegionServer. run(HRegionServer.java:900) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org. apache.hadoop.hbase.YouAreDeadException): org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing fds-hadoop-prod06-mp,60020,1473551672107 as dead server at org.apache.hadoop.hbase.master.ServerManager. checkIsDead(ServerManager.java:382) at org.apache.hadoop.hbase.master.ServerManager.regionServerReport( ServerManager.java:287) at org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport( MasterRpcServices.java:287) at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$ RegionServerStatusService$2.callBlockingMethod( RegionServerStatusProtos.java:7912) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2034) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop( RpcExecutor.java:130) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107) at java.lang.Thread.run(Thread.java:745) at org.apache.hadoop.hbase.ipc.RpcClientImpl.call( RpcClientImpl.java:1219) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod( AbstractRpcClient.java:216) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$ BlockingRpcChannelImplementation.callBlockingMethod( AbstractRpcClient.java:300) at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$ RegionServerStatusService$BlockingStub.regionServerReport( RegionServerStatusProtos.java:8289) at org.apache.hadoop.hbase.regionserver.HRegionServer. tryRegionServerReport(HRegionServer.java:1090) ... 2 more 2016-09-21 11:46:19,505 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: [org.apache.phoenix. coprocessor.ServerCachingEndpointImpl, org.apache.hadoop.hbase.regionserver.LocalIndexMerger, org.apache.hadoop.hbase.regionserver.LocalIndexSplitter, org.apache.phoenix. coprocessor.UngroupedAggregateRegionObserver, org.apache.phoenix. coprocessor.GroupedAggregateRegionObserver, org.apache.phoenix.coprocessor.ScanRegionObserver, org.apache.phoenix.hbase.index.Indexer, org.apache.phoenix.coprocessor. SequenceRegionObserver] 2016-09-21 11:46:19,498 FATAL org.apache.hadoop.hbase.regionserver.LogRoller: Aborting java.io.IOException: cannot get log writer *Error level RegionServer logs when the problem happened*2016-09-21 11:46:19,347 ERROR org.apache.hadoop.hbase.regionserver.wal.FSHLog: Error syncing, request close of WAL 2016-09-21 11:46:19,348 ERROR org.apache.hadoop.hbase.regionserver.wal.FSHLog: Error syncing, request close of WAL 2016-09-21 11:46:19,350 ERROR org.apache.hadoop.hbase.regionserver.wal.FSHLog: Error syncing, request close of WAL 2016-09-21 11:46:19,349 ERROR org.apache.hadoop.hbase.regionserver.wal.FSHLog: Error syncing, request close of WAL 2016-09-21 11:46:19,385 ERROR org.apache.hadoop.hbase.regionserver.wal.FSHLog: Error syncing, request close of WAL 2016-09-21 11:46:19,351 ERROR org.apache.hadoop.hbase.regionserver.wal.FSHLog: Error syncing, request close of WAL 2016-09-21 11:46:19,351 ERROR org.apache.hadoop.hbase.regionserver.wal.FSHLog: Error syncing, request close of WAL 2016-09-21 11:46:19,388 ERROR org.apache.hadoop.hbase.regionserver.wal.FSHLog: Error syncing, request close of WAL 2016-09-21 11:46:19,388 ERROR org.apache.hadoop.hbase.regionserver.wal.FSHLog: Error syncing, request close of WAL 2016-09-21 11:46:19,384 ERROR org.apache.hadoop.hbase.regionserver.wal.FSHLog: Error syncing, request close of WAL 2016-09-21 11:46:20,048 ERROR org.apache.hadoop.hbase.regionserver.HRegion: Memstore size is 692752 2016-09-21 11:46:20,103 ERROR org.apache.hadoop.hbase.regionserver.HRegion: Memstore size is 1831544 2016-09-21 11:46:20,111 ERROR org.apache.hadoop.hbase.regionserver.HRegion: Memstore size is 44967984 2016-09-21 11:46:20,131 ERROR org.apache.hadoop.hbase.regionserver.HRegion: Memstore size is 2334208 2016-09-21 11:46:20,308 ERROR org.apache.hadoop.hbase.regionserver.HRegion: Memstore size is 4393560 2016-09-21 11:46:20,334 ERROR org.apache.hadoop.hbase.regionserver.HRegion: Memstore size is 2905432 2016-09-21 11:46:20,373 ERROR org.apache.hadoop.hbase.regionserver.HRegion: Memstore size is 2310808 2016-09-21 11:46:20,402 ERROR org.apache.hadoop.hbase.regionserver.HRegion: Memstore size is 1112648 2016-09-21 11:46:20,422 ERROR org.apache.hadoop.hbase.regionserver.HRegion: Memstore size is 6578048 2016-09-21 11:46:20,437 ERROR org.apache.hadoop.hbase.regionserver.HRegion: Memstore size is 59242936 2016-09-21 11:46:20,608 ERROR org.apache.hadoop.hbase.regionserver.HRegion: Memstore size is 4482224 2016-09-21 11:46:20,624 ERROR org.apache.hadoop.hbase.regionserver.HRegion: Memstore size is 20144 2016-09-21 11:46:20,652 ERROR org.apache.hadoop.hbase.regionserver.HRegion: Memstore size is 23568 2016-09-21 11:46:20,687 ERROR org.apache.hadoop.hbase.regionserver.HRegion: Memstore size is 14395336 2016-09-21 11:46:20,706 ERROR org.apache.hadoop.hbase.regionserver.HRegion: Memstore size is 14025824 Attached some metrics screenshots catched when the problem happen All the metrics have gap because the RegionServer not able to handle /jmx requests at all 1. One screenshot for MutateCount(Put Count) at table level http://img0.ph.126.net/coIWlwP2w44jnl1kU5CbcA==/6631704382933867248.png 2. One screenshot for RegionServer level GC Time http://img0.ph.126.net/ouciVyCvpqEkMPV3phnx5A==/6631661501980397103.png 3. One screenshot for Max Region MemStoreSize at table level http://img0.ph.126.net/rEQhCk-VWx3HGbcZCM51iA==/6631798940933860065.png 4. One screenshot for RegionCount for each RegionServer http://img0.ph.126.net/JKPA4-7NJEkFc3_TxOv63A==/6631679094166431135.png By the way, all the legends in the screenshots are ordered by max value, so the metric series with max value show at last With all those information, what I get now is: 1. A lot of tables have sudden Put increase after the RegionServer back to service, my understanding is data source streaming not able to process all the data when the problem happened, so the data accumulated about two times at a later time, those tables seems not the root cause 2. From the GC metrics, it is pretty easy to identify which region server have the problem and all the logs catced from the problem server 3. All the region server have a sudden region count increase and decrease except the dead one 4. From the Max MemStoreSize metric sceenshot, I prefer to think the table with max MemStoreSize is the root cause, but the MutateCount metrics didn't show there is a peak time before and after the problem on that table, so not confident about this judgement I need help on 1. Identify which table cause the RegionServer dead 2. Why RegionCount have a sudden increase and decrease for all the servers not dead 3. Any other metrics missed but important to debug some Hotspotting cases 4. How I can get merics when the RegionServer not able to handle JMX Restful service Thanks a lot in advance and please let me know if need any other information