For tuning G1GC, see the head of https://blogs.apache.org/hbase/
FYI On Thu, Sep 22, 2016 at 10:30 AM, Vladimir Rodionov <vladrodio...@gmail.com> wrote: > Your RS was declared dead because of a long GC. > > What you can do: > > 1. Tweak CMS config: -XX:CMSInitiatingOccupancyFraction=50 > -XX:CMSInitiatingOccupancyFractionOnly. > Plus increase heap size accordingly to accommodate decreasing of a working > set size (now CMS starts when 50% of heap is occupied). If you see a lot > of "promotion failure" messages in GC log - this may help you. > > 2. Migrate to 1.8 Java and switch to GCG1. Search "tuning GCG1 for HBase" > on the Internet > > 3. Analyze what has triggered long GC in RS. Usually it is some long > running M/R jobs or data import. Consider other approaches which are not so > intrusive (ex. bulk load of data) > > > -Vlad > > On Thu, Sep 22, 2016 at 10:20 AM, Ted Yu <yuzhih...@gmail.com> wrote: > > > bq. the MutateCount metrics didn't show there is a peak time before and > > after the problem on that table > > > > In the graph, there was a purple line with spikes - I assume this was not > > for the table with max MemStoreSize. > > The legend in the graph is hard to read. > > If the sizes of the Puts vary, that may explain what you observed w.r.t. > > the two graphs. > > > > For question #2, were there region splits corresponding to the increase ? > > Please check master log. You can pastebin master log if needed. > > > > For #4, if JMX doesn't respond, one option is to check region server log > - > > this is workaround. > > > > On Thu, Sep 22, 2016 at 9:39 AM, Proust(Proust/Feng Guizhou) [Saipan] < > > pf...@coupang.com> wrote: > > > > > Hi, HBase Users > > > > > > I encounter a RegionServer Dead case, and want to identify which HBase > > > Table actually cause the problem > > > > > > Based on HBase JMX Restful Service, what I have on hand is: > > > 1. Table level Grafana monitoring for Put, Get, Scan, MemStoreSize > > > 2. RegionServer level Grafana monitoring for GC and RegionCount > > > 3. RegionServer logs > > > > > > *HRegionServer java process information* > > > /usr/java/jdk1.7.0_67-cloudera/bin/java -Dproc_regionserver > > > -XX:OnOutOfMemoryError=kill -9 %p -Djava.net.preferIPv4Stack=true > > > -Xms34359738368 -Xmx34359738368 -XX:+UseParNewGC > -XX:+UseConcMarkSweepGC > > > -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled > > > -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps > > > -Xloggc:/pang/log/hbase/gc-hbase.log -XX:+UseGCLogFileRotation > > > -XX:NumberOfGCLogFiles=1 -XX:GCLogFileSize=512M > > > -XX:OnOutOfMemoryError=/usr/ > > > lib64/cmf/service/common/killparent.sh -Dhbase.log.dir=/pang/log/hbase > > > -Dhbase.log.file=hbase-cmf-hbase-REGIONSERVER-fds-hadoop- > > prod06-mp.log.out > > > -Dhbase.home.dir=/pang/cloudera/parcels/CDH-5.5.2-1. > > > cdh5.5.2.p0.4/lib/hbase > > > -Dhbase.id.str= -Dhbase.root.logger=WARN,RFA -Djava.library.path=/pang/ > > > cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p0.4/lib/hadoop/lib/ > > > native:/pang/cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p0.4/lib/ > > > hbase/lib/native/Linux-amd64-64 -Dhbase.security.logger=INFO,RFAS > > > org.apache.hadoop.hbase.regionserver.HRegionServer start > > > > > > *RegionServer logs show a GC pause* > > > 2016-09-21 11:46:19,263 WARN org.apache.hadoop.hdfs.DFSClient: > > > DFSOutputStream ResponseProcessor exception for block > > > BP-45016343-10.20.30.13-1446197023889:blk_1141417801_67682582 > > > java.io.EOFException: Premature EOF: no length prefix available > > > at org.apache.hadoop.hdfs.protocolPB.PBHelper. > > > vintPrefixed(PBHelper.java:2241) > > > at org.apache.hadoop.hdfs.protocol.datatransfer. > > > PipelineAck.readFields( > > > PipelineAck.java:235) > > > at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ > > > ResponseProcessor.run(DFSOutputStream.java:971) > > > 2016-09-21 11:46:19,274 WARN org.apache.hadoop.hdfs.DFSClient: Error > > > Recovery for block > > > BP-45016343-10.20.30.13-1446197023889:blk_1141417801_67682582 > > > in pipeline DatanodeInfoWithStorage[10.10.9.15:50010,DS-4209cc16-e676- > > > 4226-be0b-1328075bbdd7,DISK], > > > DatanodeInfoWithStorage[10.10.9.19:50010,DS-c06678d6-3f0d- > > > 40a2-82e2-62387e26af7b,DISK], > > > DatanodeInfoWithStorage[10.10.9.21:50010,DS-7bf90b1e-7862- > > > 4cc1-a121-31768021a3aa,DISK]: > > > bad datanode DatanodeInfoWithStorage[10.10.9.15:50010 > ,DS-4209cc16-e676- > > > 4226-be0b-1328075bbdd7,DISK] > > > 2016-09-21 11:46:19,266 WARN org.apache.hadoop.hbase.util. > > JvmPauseMonitor: > > > > > > *Detected pause in JVM or host machine (eg GC): pause of approximately > > > 152322msGC pool 'ParNew' had collection(s): count=1 time=465msGC pool > > > 'ConcurrentMarkSweep' had collection(s): count=2 time=152238ms* > > > 2016-09-21 11:46:19,263 WARN org.apache.hadoop.hbase.util.Sleeper: We > > > slept > > > 153423ms instead of 3000ms, this is likely due to a long garbage > > collecting > > > pause and it's usually bad, see http://hbase.apache.org/book. > > > html#trouble.rs.runtime.zkexpired > > > 2016-09-21 11:46:19,263 INFO org.apache.zookeeper.ClientCnxn: Client > > > session timed out, have not heard from server in 156121ms for sessionid > > > 0x255c4a5f76332f6, closing socket connection and attempting reconnect > > > 2016-09-21 11:46:19,263 INFO org.apache.zookeeper.ClientCnxn: Client > > > session timed out, have not heard from server in 153414ms for sessionid > > > 0x355c4a5f3c2469f, closing socket connection and attempting reconnect > > > 2016-09-21 11:46:19,263 INFO org.apache.zookeeper.ClientCnxn: Client > > > session timed out, have not heard from server in 156121ms for sessionid > > > 0x255c4a5f763332c, closing socket connection and attempting reconnect > > > > > > *RegionServer logs show server dead:* > > > 2016-09-21 11:46:19,485 FATAL > > > org.apache.hadoop.hbase.regionserver.HRegionServer: > > > ABORTING region server fds-hadoop-prod06-mp,60020,1473551672107: > > > *org.apache.hadoop.hbase.YouAreDeadException*: Server REPORT rejected; > > > currently processing fds-hadoop-prod06-mp,60020,1473551672107 as *dead > > > server* > > > at org.apache.hadoop.hbase.master.ServerManager. > > > checkIsDead(ServerManager.java:382) > > > at org.apache.hadoop.hbase.master.ServerManager. > regionServerReport( > > > ServerManager.java:287) > > > at org.apache.hadoop.hbase.master.MasterRpcServices. > > > regionServerReport( > > > MasterRpcServices.java:287) > > > at org.apache.hadoop.hbase.protobuf.generated. > > > RegionServerStatusProtos$ > > > RegionServerStatusService$2.callBlockingMethod( > > > RegionServerStatusProtos.java:7912) > > > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2034) > > > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107) > > > at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop( > > > RpcExecutor.java:130) > > > at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor. > > java:107) > > > at java.lang.Thread.run(Thread.java:745) > > > > > > org.apache.hadoop.hbase.YouAreDeadException: > > > org.apache.hadoop.hbase.YouAreDeadException: > > > Server REPORT rejected; currently processing > > > fds-hadoop-prod06-mp,60020,1473551672107 > > > as dead server > > > at org.apache.hadoop.hbase.master.ServerManager. > > > checkIsDead(ServerManager.java:382) > > > at org.apache.hadoop.hbase.master.ServerManager. > regionServerReport( > > > ServerManager.java:287) > > > at org.apache.hadoop.hbase.master.MasterRpcServices. > > > regionServerReport( > > > MasterRpcServices.java:287) > > > at org.apache.hadoop.hbase.protobuf.generated. > > > RegionServerStatusProtos$ > > > RegionServerStatusService$2.callBlockingMethod( > > > RegionServerStatusProtos.java:7912) > > > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2034) > > > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107) > > > at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop( > > > RpcExecutor.java:130) > > > at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor. > > java:107) > > > at java.lang.Thread.run(Thread.java:745) > > > > > > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > > > Method) > > > at sun.reflect.NativeConstructorAccessorImpl.newInstance( > > > NativeConstructorAccessorImpl.java:57) > > > at sun.reflect.DelegatingConstructorAccessorImpl.newInstance( > > > DelegatingConstructorAccessorImpl.java:45) > > > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > > > at org.apache.hadoop.ipc.RemoteException.instantiateException( > > > RemoteException.java:106) > > > at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException( > > > RemoteException.java:95) > > > at org.apache.hadoop.hbase.protobuf.ProtobufUtil. > getRemoteException( > > > ProtobufUtil.java:328) > > > at org.apache.hadoop.hbase.regionserver.HRegionServer. > > > tryRegionServerReport(HRegionServer.java:1092) > > > at org.apache.hadoop.hbase.regionserver.HRegionServer. > > > run(HRegionServer.java:900) > > > at java.lang.Thread.run(Thread.java:745) > > > Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org. > > > apache.hadoop.hbase.YouAreDeadException): > > > org.apache.hadoop.hbase.YouAreDeadException: > > > Server REPORT rejected; currently processing > > > fds-hadoop-prod06-mp,60020,1473551672107 > > > as dead server > > > at org.apache.hadoop.hbase.master.ServerManager. > > > checkIsDead(ServerManager.java:382) > > > at org.apache.hadoop.hbase.master.ServerManager. > regionServerReport( > > > ServerManager.java:287) > > > at org.apache.hadoop.hbase.master.MasterRpcServices. > > > regionServerReport( > > > MasterRpcServices.java:287) > > > at org.apache.hadoop.hbase.protobuf.generated. > > > RegionServerStatusProtos$ > > > RegionServerStatusService$2.callBlockingMethod( > > > RegionServerStatusProtos.java:7912) > > > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2034) > > > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107) > > > at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop( > > > RpcExecutor.java:130) > > > at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor. > > java:107) > > > at java.lang.Thread.run(Thread.java:745) > > > > > > at org.apache.hadoop.hbase.ipc.RpcClientImpl.call( > > > RpcClientImpl.java:1219) > > > at org.apache.hadoop.hbase.ipc.AbstractRpcClient. > callBlockingMethod( > > > AbstractRpcClient.java:216) > > > at org.apache.hadoop.hbase.ipc.AbstractRpcClient$ > > > BlockingRpcChannelImplementation.callBlockingMethod( > > > AbstractRpcClient.java:300) > > > at org.apache.hadoop.hbase.protobuf.generated. > > > RegionServerStatusProtos$ > > > RegionServerStatusService$BlockingStub.regionServerReport( > > > RegionServerStatusProtos.java:8289) > > > at org.apache.hadoop.hbase.regionserver.HRegionServer. > > > tryRegionServerReport(HRegionServer.java:1090) > > > ... 2 more > > > 2016-09-21 11:46:19,505 FATAL > > > org.apache.hadoop.hbase.regionserver.HRegionServer: > > > RegionServer abort: loaded coprocessors are: [org.apache.phoenix. > > > coprocessor.ServerCachingEndpointImpl, > > > org.apache.hadoop.hbase.regionserver.LocalIndexMerger, > > > org.apache.hadoop.hbase.regionserver.LocalIndexSplitter, > > > org.apache.phoenix. > > > coprocessor.UngroupedAggregateRegionObserver, org.apache.phoenix. > > > coprocessor.GroupedAggregateRegionObserver, > > > org.apache.phoenix.coprocessor.ScanRegionObserver, > > > org.apache.phoenix.hbase.index.Indexer, org.apache.phoenix. > coprocessor. > > > SequenceRegionObserver] > > > 2016-09-21 11:46:19,498 FATAL org.apache.hadoop.hbase. > > > regionserver.LogRoller: > > > Aborting > > > java.io.IOException: cannot get log writer > > > > > > > > > *Error level RegionServer logs when the problem happened*2016-09-21 > > > 11:46:19,347 ERROR org.apache.hadoop.hbase.regionserver.wal.FSHLog: > > Error > > > syncing, request close of WAL > > > 2016-09-21 11:46:19,348 ERROR org.apache.hadoop.hbase. > > > regionserver.wal.FSHLog: > > > Error syncing, request close of WAL > > > 2016-09-21 11:46:19,350 ERROR org.apache.hadoop.hbase. > > > regionserver.wal.FSHLog: > > > Error syncing, request close of WAL > > > 2016-09-21 11:46:19,349 ERROR org.apache.hadoop.hbase. > > > regionserver.wal.FSHLog: > > > Error syncing, request close of WAL > > > 2016-09-21 11:46:19,385 ERROR org.apache.hadoop.hbase. > > > regionserver.wal.FSHLog: > > > Error syncing, request close of WAL > > > 2016-09-21 11:46:19,351 ERROR org.apache.hadoop.hbase. > > > regionserver.wal.FSHLog: > > > Error syncing, request close of WAL > > > 2016-09-21 11:46:19,351 ERROR org.apache.hadoop.hbase. > > > regionserver.wal.FSHLog: > > > Error syncing, request close of WAL > > > 2016-09-21 11:46:19,388 ERROR org.apache.hadoop.hbase. > > > regionserver.wal.FSHLog: > > > Error syncing, request close of WAL > > > 2016-09-21 11:46:19,388 ERROR org.apache.hadoop.hbase. > > > regionserver.wal.FSHLog: > > > Error syncing, request close of WAL > > > 2016-09-21 11:46:19,384 ERROR org.apache.hadoop.hbase. > > > regionserver.wal.FSHLog: > > > Error syncing, request close of WAL > > > 2016-09-21 11:46:20,048 ERROR org.apache.hadoop.hbase. > > > regionserver.HRegion: > > > Memstore size is 692752 > > > 2016-09-21 11:46:20,103 ERROR org.apache.hadoop.hbase. > > > regionserver.HRegion: > > > Memstore size is 1831544 > > > 2016-09-21 11:46:20,111 ERROR org.apache.hadoop.hbase. > > > regionserver.HRegion: > > > Memstore size is 44967984 > > > 2016-09-21 11:46:20,131 ERROR org.apache.hadoop.hbase. > > > regionserver.HRegion: > > > Memstore size is 2334208 > > > 2016-09-21 11:46:20,308 ERROR org.apache.hadoop.hbase. > > > regionserver.HRegion: > > > Memstore size is 4393560 > > > 2016-09-21 11:46:20,334 ERROR org.apache.hadoop.hbase. > > > regionserver.HRegion: > > > Memstore size is 2905432 > > > 2016-09-21 11:46:20,373 ERROR org.apache.hadoop.hbase. > > > regionserver.HRegion: > > > Memstore size is 2310808 > > > 2016-09-21 11:46:20,402 ERROR org.apache.hadoop.hbase. > > > regionserver.HRegion: > > > Memstore size is 1112648 > > > 2016-09-21 11:46:20,422 ERROR org.apache.hadoop.hbase. > > > regionserver.HRegion: > > > Memstore size is 6578048 > > > 2016-09-21 11:46:20,437 ERROR org.apache.hadoop.hbase. > > > regionserver.HRegion: > > > Memstore size is 59242936 > > > 2016-09-21 11:46:20,608 ERROR org.apache.hadoop.hbase. > > > regionserver.HRegion: > > > Memstore size is 4482224 > > > 2016-09-21 11:46:20,624 ERROR org.apache.hadoop.hbase. > > > regionserver.HRegion: > > > Memstore size is 20144 > > > 2016-09-21 11:46:20,652 ERROR org.apache.hadoop.hbase. > > > regionserver.HRegion: > > > Memstore size is 23568 > > > 2016-09-21 11:46:20,687 ERROR org.apache.hadoop.hbase. > > > regionserver.HRegion: > > > Memstore size is 14395336 > > > 2016-09-21 11:46:20,706 ERROR org.apache.hadoop.hbase. > > > regionserver.HRegion: > > > Memstore size is 14025824 > > > > > > Attached some metrics screenshots catched when the problem happen > > > All the metrics have gap because the RegionServer not able to handle > /jmx > > > requests at all > > > > > > 1. One screenshot for MutateCount(Put Count) at table level > > > http://img0.ph.126.net/coIWlwP2w44jnl1kU5CbcA==/ > 6631704382933867248.png > > > 2. One screenshot for RegionServer level GC Time > > > http://img0.ph.126.net/ouciVyCvpqEkMPV3phnx5A==/ > 6631661501980397103.png > > > 3. One screenshot for Max Region MemStoreSize at table level > > > http://img0.ph.126.net/rEQhCk-VWx3HGbcZCM51iA==/ > 6631798940933860065.png > > > 4. One screenshot for RegionCount for each RegionServer > > > http://img0.ph.126.net/JKPA4-7NJEkFc3_TxOv63A==/ > 6631679094166431135.png > > > > > > By the way, all the legends in the screenshots are ordered by max > value, > > so > > > the metric series with max value show at last > > > > > > With all those information, what I get now is: > > > 1. A lot of tables have sudden Put increase after the RegionServer back > > to > > > service, my understanding is data source streaming not able to process > > all > > > the data when the problem happened, so the data accumulated about two > > times > > > at a later time, those tables seems not the root cause > > > 2. From the GC metrics, it is pretty easy to identify which region > server > > > have the problem and all the logs catced from the problem server > > > 3. All the region server have a sudden region count increase and > decrease > > > except the dead one > > > 4. From the Max MemStoreSize metric sceenshot, I prefer to think the > > table > > > with max MemStoreSize is the root cause, but the MutateCount metrics > > didn't > > > show there is a peak time before and after the problem on that table, > so > > > not confident about this judgement > > > > > > I need help on > > > 1. Identify which table cause the RegionServer dead > > > 2. Why RegionCount have a sudden increase and decrease for all the > > servers > > > not dead > > > 3. Any other metrics missed but important to debug some Hotspotting > cases > > > 4. How I can get merics when the RegionServer not able to handle JMX > > > Restful service > > > > > > Thanks a lot in advance and please let me know if need any other > > > information > > > > > >