bq. the MutateCount metrics didn't show there is a peak time before and
after the problem on that table

In the graph, there was a purple line with spikes - I assume this was not
for the table with max MemStoreSize.
The legend in the graph is hard to read.
If the sizes of the Puts vary, that may explain what you observed w.r.t.
the two graphs.

For question #2, were there region splits corresponding to the increase ?
Please check master log. You can pastebin master log if needed.

For #4, if JMX doesn't respond, one option is to check region server log -
this is workaround.

On Thu, Sep 22, 2016 at 9:39 AM, Proust(Proust/Feng Guizhou) [Saipan] ­ <
pf...@coupang.com> wrote:

> Hi, HBase Users
>
> I encounter a RegionServer Dead case, and want to identify which HBase
> Table actually cause the problem
>
> Based on HBase JMX Restful Service, what I have on hand is:
> 1. Table level Grafana monitoring for Put, Get, Scan, MemStoreSize
> 2. RegionServer level Grafana monitoring for GC and RegionCount
> 3. RegionServer logs
>
> *HRegionServer java process information*
> /usr/java/jdk1.7.0_67-cloudera/bin/java -Dproc_regionserver
> -XX:OnOutOfMemoryError=kill -9 %p -Djava.net.preferIPv4Stack=true
> -Xms34359738368 -Xmx34359738368 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
> -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled
> -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps
> -Xloggc:/pang/log/hbase/gc-hbase.log -XX:+UseGCLogFileRotation
> -XX:NumberOfGCLogFiles=1 -XX:GCLogFileSize=512M
> -XX:OnOutOfMemoryError=/usr/
> lib64/cmf/service/common/killparent.sh -Dhbase.log.dir=/pang/log/hbase
> -Dhbase.log.file=hbase-cmf-hbase-REGIONSERVER-fds-hadoop-prod06-mp.log.out
> -Dhbase.home.dir=/pang/cloudera/parcels/CDH-5.5.2-1.
> cdh5.5.2.p0.4/lib/hbase
> -Dhbase.id.str= -Dhbase.root.logger=WARN,RFA -Djava.library.path=/pang/
> cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p0.4/lib/hadoop/lib/
> native:/pang/cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p0.4/lib/
> hbase/lib/native/Linux-amd64-64 -Dhbase.security.logger=INFO,RFAS
> org.apache.hadoop.hbase.regionserver.HRegionServer start
>
> *RegionServer logs show a GC pause*
> 2016-09-21 11:46:19,263 WARN org.apache.hadoop.hdfs.DFSClient:
> DFSOutputStream ResponseProcessor exception  for block
> BP-45016343-10.20.30.13-1446197023889:blk_1141417801_67682582
> java.io.EOFException: Premature EOF: no length prefix available
>     at org.apache.hadoop.hdfs.protocolPB.PBHelper.
> vintPrefixed(PBHelper.java:2241)
>     at org.apache.hadoop.hdfs.protocol.datatransfer.
> PipelineAck.readFields(
> PipelineAck.java:235)
>     at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$
> ResponseProcessor.run(DFSOutputStream.java:971)
> 2016-09-21 11:46:19,274 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block
> BP-45016343-10.20.30.13-1446197023889:blk_1141417801_67682582
> in pipeline DatanodeInfoWithStorage[10.10.9.15:50010,DS-4209cc16-e676-
> 4226-be0b-1328075bbdd7,DISK],
> DatanodeInfoWithStorage[10.10.9.19:50010,DS-c06678d6-3f0d-
> 40a2-82e2-62387e26af7b,DISK],
> DatanodeInfoWithStorage[10.10.9.21:50010,DS-7bf90b1e-7862-
> 4cc1-a121-31768021a3aa,DISK]:
> bad datanode DatanodeInfoWithStorage[10.10.9.15:50010,DS-4209cc16-e676-
> 4226-be0b-1328075bbdd7,DISK]
> 2016-09-21 11:46:19,266 WARN org.apache.hadoop.hbase.util.JvmPauseMonitor:
>
> *Detected pause in JVM or host machine (eg GC): pause of approximately
> 152322msGC pool 'ParNew' had collection(s): count=1 time=465msGC pool
> 'ConcurrentMarkSweep' had collection(s): count=2 time=152238ms*
> 2016-09-21 11:46:19,263 WARN org.apache.hadoop.hbase.util.Sleeper: We
> slept
> 153423ms instead of 3000ms, this is likely due to a long garbage collecting
> pause and it's usually bad, see http://hbase.apache.org/book.
> html#trouble.rs.runtime.zkexpired
> 2016-09-21 11:46:19,263 INFO org.apache.zookeeper.ClientCnxn: Client
> session timed out, have not heard from server in 156121ms for sessionid
> 0x255c4a5f76332f6, closing socket connection and attempting reconnect
> 2016-09-21 11:46:19,263 INFO org.apache.zookeeper.ClientCnxn: Client
> session timed out, have not heard from server in 153414ms for sessionid
> 0x355c4a5f3c2469f, closing socket connection and attempting reconnect
> 2016-09-21 11:46:19,263 INFO org.apache.zookeeper.ClientCnxn: Client
> session timed out, have not heard from server in 156121ms for sessionid
> 0x255c4a5f763332c, closing socket connection and attempting reconnect
>
> *RegionServer logs show server dead:*
> 2016-09-21 11:46:19,485 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer:
> ABORTING region server fds-hadoop-prod06-mp,60020,1473551672107:
> *org.apache.hadoop.hbase.YouAreDeadException*: Server REPORT rejected;
> currently processing fds-hadoop-prod06-mp,60020,1473551672107 as *dead
> server*
>     at org.apache.hadoop.hbase.master.ServerManager.
> checkIsDead(ServerManager.java:382)
>     at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(
> ServerManager.java:287)
>     at org.apache.hadoop.hbase.master.MasterRpcServices.
> regionServerReport(
> MasterRpcServices.java:287)
>     at org.apache.hadoop.hbase.protobuf.generated.
> RegionServerStatusProtos$
> RegionServerStatusService$2.callBlockingMethod(
> RegionServerStatusProtos.java:7912)
>     at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2034)
>     at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
>     at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(
> RpcExecutor.java:130)
>     at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
>     at java.lang.Thread.run(Thread.java:745)
>
> org.apache.hadoop.hbase.YouAreDeadException:
> org.apache.hadoop.hbase.YouAreDeadException:
> Server REPORT rejected; currently processing
> fds-hadoop-prod06-mp,60020,1473551672107
> as dead server
>     at org.apache.hadoop.hbase.master.ServerManager.
> checkIsDead(ServerManager.java:382)
>     at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(
> ServerManager.java:287)
>     at org.apache.hadoop.hbase.master.MasterRpcServices.
> regionServerReport(
> MasterRpcServices.java:287)
>     at org.apache.hadoop.hbase.protobuf.generated.
> RegionServerStatusProtos$
> RegionServerStatusService$2.callBlockingMethod(
> RegionServerStatusProtos.java:7912)
>     at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2034)
>     at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
>     at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(
> RpcExecutor.java:130)
>     at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
>     at java.lang.Thread.run(Thread.java:745)
>
>     at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>     at sun.reflect.NativeConstructorAccessorImpl.newInstance(
> NativeConstructorAccessorImpl.java:57)
>     at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
> DelegatingConstructorAccessorImpl.java:45)
>     at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>     at org.apache.hadoop.ipc.RemoteException.instantiateException(
> RemoteException.java:106)
>     at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(
> RemoteException.java:95)
>     at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(
> ProtobufUtil.java:328)
>     at org.apache.hadoop.hbase.regionserver.HRegionServer.
> tryRegionServerReport(HRegionServer.java:1092)
>     at org.apache.hadoop.hbase.regionserver.HRegionServer.
> run(HRegionServer.java:900)
>     at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.
> apache.hadoop.hbase.YouAreDeadException):
> org.apache.hadoop.hbase.YouAreDeadException:
> Server REPORT rejected; currently processing
> fds-hadoop-prod06-mp,60020,1473551672107
> as dead server
>     at org.apache.hadoop.hbase.master.ServerManager.
> checkIsDead(ServerManager.java:382)
>     at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(
> ServerManager.java:287)
>     at org.apache.hadoop.hbase.master.MasterRpcServices.
> regionServerReport(
> MasterRpcServices.java:287)
>     at org.apache.hadoop.hbase.protobuf.generated.
> RegionServerStatusProtos$
> RegionServerStatusService$2.callBlockingMethod(
> RegionServerStatusProtos.java:7912)
>     at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2034)
>     at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
>     at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(
> RpcExecutor.java:130)
>     at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
>     at java.lang.Thread.run(Thread.java:745)
>
>     at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(
> RpcClientImpl.java:1219)
>     at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(
> AbstractRpcClient.java:216)
>     at org.apache.hadoop.hbase.ipc.AbstractRpcClient$
> BlockingRpcChannelImplementation.callBlockingMethod(
> AbstractRpcClient.java:300)
>     at org.apache.hadoop.hbase.protobuf.generated.
> RegionServerStatusProtos$
> RegionServerStatusService$BlockingStub.regionServerReport(
> RegionServerStatusProtos.java:8289)
>     at org.apache.hadoop.hbase.regionserver.HRegionServer.
> tryRegionServerReport(HRegionServer.java:1090)
>     ... 2 more
> 2016-09-21 11:46:19,505 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer:
> RegionServer abort: loaded coprocessors are: [org.apache.phoenix.
> coprocessor.ServerCachingEndpointImpl,
> org.apache.hadoop.hbase.regionserver.LocalIndexMerger,
> org.apache.hadoop.hbase.regionserver.LocalIndexSplitter,
> org.apache.phoenix.
> coprocessor.UngroupedAggregateRegionObserver, org.apache.phoenix.
> coprocessor.GroupedAggregateRegionObserver,
> org.apache.phoenix.coprocessor.ScanRegionObserver,
> org.apache.phoenix.hbase.index.Indexer, org.apache.phoenix.coprocessor.
> SequenceRegionObserver]
> 2016-09-21 11:46:19,498 FATAL org.apache.hadoop.hbase.
> regionserver.LogRoller:
> Aborting
> java.io.IOException: cannot get log writer
>
>
> *Error level RegionServer logs when the problem happened*2016-09-21
> 11:46:19,347 ERROR org.apache.hadoop.hbase.regionserver.wal.FSHLog: Error
> syncing, request close of WAL
> 2016-09-21 11:46:19,348 ERROR org.apache.hadoop.hbase.
> regionserver.wal.FSHLog:
> Error syncing, request close of WAL
> 2016-09-21 11:46:19,350 ERROR org.apache.hadoop.hbase.
> regionserver.wal.FSHLog:
> Error syncing, request close of WAL
> 2016-09-21 11:46:19,349 ERROR org.apache.hadoop.hbase.
> regionserver.wal.FSHLog:
> Error syncing, request close of WAL
> 2016-09-21 11:46:19,385 ERROR org.apache.hadoop.hbase.
> regionserver.wal.FSHLog:
> Error syncing, request close of WAL
> 2016-09-21 11:46:19,351 ERROR org.apache.hadoop.hbase.
> regionserver.wal.FSHLog:
> Error syncing, request close of WAL
> 2016-09-21 11:46:19,351 ERROR org.apache.hadoop.hbase.
> regionserver.wal.FSHLog:
> Error syncing, request close of WAL
> 2016-09-21 11:46:19,388 ERROR org.apache.hadoop.hbase.
> regionserver.wal.FSHLog:
> Error syncing, request close of WAL
> 2016-09-21 11:46:19,388 ERROR org.apache.hadoop.hbase.
> regionserver.wal.FSHLog:
> Error syncing, request close of WAL
> 2016-09-21 11:46:19,384 ERROR org.apache.hadoop.hbase.
> regionserver.wal.FSHLog:
> Error syncing, request close of WAL
> 2016-09-21 11:46:20,048 ERROR org.apache.hadoop.hbase.
> regionserver.HRegion:
> Memstore size is 692752
> 2016-09-21 11:46:20,103 ERROR org.apache.hadoop.hbase.
> regionserver.HRegion:
> Memstore size is 1831544
> 2016-09-21 11:46:20,111 ERROR org.apache.hadoop.hbase.
> regionserver.HRegion:
> Memstore size is 44967984
> 2016-09-21 11:46:20,131 ERROR org.apache.hadoop.hbase.
> regionserver.HRegion:
> Memstore size is 2334208
> 2016-09-21 11:46:20,308 ERROR org.apache.hadoop.hbase.
> regionserver.HRegion:
> Memstore size is 4393560
> 2016-09-21 11:46:20,334 ERROR org.apache.hadoop.hbase.
> regionserver.HRegion:
> Memstore size is 2905432
> 2016-09-21 11:46:20,373 ERROR org.apache.hadoop.hbase.
> regionserver.HRegion:
> Memstore size is 2310808
> 2016-09-21 11:46:20,402 ERROR org.apache.hadoop.hbase.
> regionserver.HRegion:
> Memstore size is 1112648
> 2016-09-21 11:46:20,422 ERROR org.apache.hadoop.hbase.
> regionserver.HRegion:
> Memstore size is 6578048
> 2016-09-21 11:46:20,437 ERROR org.apache.hadoop.hbase.
> regionserver.HRegion:
> Memstore size is 59242936
> 2016-09-21 11:46:20,608 ERROR org.apache.hadoop.hbase.
> regionserver.HRegion:
> Memstore size is 4482224
> 2016-09-21 11:46:20,624 ERROR org.apache.hadoop.hbase.
> regionserver.HRegion:
> Memstore size is 20144
> 2016-09-21 11:46:20,652 ERROR org.apache.hadoop.hbase.
> regionserver.HRegion:
> Memstore size is 23568
> 2016-09-21 11:46:20,687 ERROR org.apache.hadoop.hbase.
> regionserver.HRegion:
> Memstore size is 14395336
> 2016-09-21 11:46:20,706 ERROR org.apache.hadoop.hbase.
> regionserver.HRegion:
> Memstore size is 14025824
>
> Attached some metrics screenshots catched when the problem happen
> All the metrics have gap because the RegionServer not able to handle /jmx
> requests at all
>
> 1. One screenshot for MutateCount(Put Count) at table level
> http://img0.ph.126.net/coIWlwP2w44jnl1kU5CbcA==/6631704382933867248.png
> 2. One screenshot for RegionServer level GC Time
> http://img0.ph.126.net/ouciVyCvpqEkMPV3phnx5A==/6631661501980397103.png
> 3. One screenshot for Max Region MemStoreSize at table level
> http://img0.ph.126.net/rEQhCk-VWx3HGbcZCM51iA==/6631798940933860065.png
> 4. One screenshot for RegionCount for each RegionServer
> http://img0.ph.126.net/JKPA4-7NJEkFc3_TxOv63A==/6631679094166431135.png
>
> By the way, all the legends in the screenshots are ordered by max value, so
> the metric series with max value show at last
>
> With all those information, what I get now is:
> 1. A lot of tables have sudden Put increase after the RegionServer back to
> service, my understanding is data source streaming not able to process all
> the data when the problem happened, so the data accumulated about two times
> at a later time, those tables seems not the root cause
> 2. From the GC metrics, it is pretty easy to identify which region server
> have the problem and all the logs catced from the problem server
> 3. All the region server have a sudden region count increase and decrease
> except the dead one
> 4. From the Max MemStoreSize metric sceenshot, I prefer to think the table
> with max MemStoreSize is the root cause, but the MutateCount metrics didn't
> show there is a peak time before and after the problem on that table, so
> not confident about this judgement
>
> I need help on
> 1. Identify which table cause the RegionServer dead
> 2. Why RegionCount have a sudden increase and decrease for all the servers
> not dead
> 3. Any other metrics missed but important to debug some Hotspotting cases
> 4. How I can get merics when the RegionServer not able to handle JMX
> Restful service
>
> Thanks a lot in advance and please let me know if need any other
> information
>

Reply via email to