ok.i have posted the more detail RS,Gc log and the ZK ,HBase config,https://github.com/eswidy/waterspider/tree/master/rscase Thanks
------------------ Original ------------------ From: "Ted Yu";<yuzhih...@gmail.com>; Date: Oct 20, 2016 To: "user@hbase.apache.org"<user@hbase.apache.org>; Subject: Re: HBase resgionServer crashed with no gc detected There was one 25 second pause before the abort. Can you pastebin your hbase-site.xml (and zookeeper configs) ? Do you have more of the region server log (prior to 18:14:14,928) ? Thanks On Wed, Oct 19, 2016 at 6:01 PM, who.cat <who....@qq.com> wrote: > i've upload the file to git hub ,and the url is : > https://github.com/eswidy/waterspider/blob/master/regionServer.log > > thanks so much. > > > > > ------------------ Original ------------------ > From: "Ted Yu";<yuzhih...@gmail.com>; > Date: Oct 19, 2016 > To: "user@hbase.apache.org"<user@hbase.apache.org>; > > Subject: Re: HBase resgionServer crashed with no gc detected > > > > The log file was not delivered by the mailing list. > > Consider using pastebin or third party site. > > On Tue, Oct 18, 2016 at 10:38 PM, who.cat <who....@qq.com> wrote: > > > thanks fyi.Yes,i did not turn the debug and try it now .I also doubt the > > heavy cpu load caused ,then checked cpu highest Utilization is 60%(Cpu > > user ) > > My region server gc parameter is :export SERVER_GC_OPTS="-verbose:gc > > -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:{{log_dir}}/gc.log-` > date > > +'%Y%m%d%H%M'`" > > The 10/12 log was rolled .i got the same crash log yesterday(10/18). > > Details in the attachment 'regionServer.log', and the JVM pause at > > "2016-10-17 18:44:07,232" in line 82 . > > Thanks so much. > > > > > > > > > > > > ------------------ ???????? ------------------ > > *??????:* "Ted Yu";<yuzhih...@gmail.com>; > > *????????:* 2016??10??19??(??????) ????11:17 > > *??????:* "user@hbase.apache.org"<user@hbase.apache.org>; > > *????:* Re: HBase resgionServer crashed with no gc detected > > > > Can you show more of the region server log prior to 23:48:13 (including > the > > pause) ? > > > > Was the region server under heavy load during the pause ? > > > > Consider turning on DEBUG logging if you haven't. > > > > Please also share GC parameters. > > > > Thanks > > > > On Tue, Oct 18, 2016 at 7:58 PM, who.cat <who....@qq.com> wrote: > > > > > Hi all: > > > I've a HDP big data cluster with 4 nodes and create by Ambari the > HBase > > > is 1.1.2. > > > As running YCSB for benchmark the RegionServer instance or the Hmaster > > > instance crashes which it's logs shows: > > > > > > ---------------------log start --------------------- > > > 2016-10-12 23:48:13,591 INFO [main-SendThread(Node1:2181)] > > > zookeeper.ClientCnxn: Unable to read additional data from server > > sessionid > > > 0x157b7f5f0bc0005, likely server has closed socket, closing socket > > > connection and attempting reconnect > > > 2016-10-12 23:48:13,595 INFO [HBase-Metrics2-1] > impl.MetricsSinkAdapter: > > > Sink timeline started > > > 2016-10-12 23:48:13,606 INFO [HBase-Metrics2-1] > impl.MetricsSystemImpl: > > > Scheduled snapshot period at 10 second(s). > > > 2016-10-12 23:48:13,606 INFO [HBase-Metrics2-1] > impl.MetricsSystemImpl: > > > HBase metrics system started > > > 2016-10-12 23:48:14,496 INFO [main-SendThread(Node4:2181)] > > > zookeeper.ClientCnxn: Opening socket connection to server Node4/ > > > 1.1.6.104:2181. Will not attempt to authenticate using SASL (unknown > > > error) > > > 2016-10-12 23:48:14,506 INFO [main-SendThread(Node4:2181)] > > > zookeeper.ClientCnxn: Socket connection established to Node4/ > > > 1.17.6.104:2181, initiating session > > > 2016-10-12 23:48:14,517 INFO [main-SendThread(Node4:2181)] > > > zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session > > > 0x157b7f5f0bc0005 has expired, closing socket connection > > > 2016-10-12 23:48:14,517 FATAL [main-EventThread] > > > regionserver.HRegionServer: ABORTING region server > > > node1,16020,1476260847716: regionserver:16020-0x157b7f5f0bc0005, > > > quorum=node2:2181,node1:2181,node4:2181, baseZNode=/hbase-unsecure > > > regionserver:16020-0x157b7f5f0bc0005 received expired from ZooKeeper, > > > aborting > > > org.apache.zookeeper.KeeperException$SessionExpiredException: > > > KeeperErrorCode = Session expired > > > at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher. > > > connectionEvent(ZooKeeperWatcher.java:585) > > > at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher. > > > process(ZooKeeperWatcher.java:517) > > > at org.apache.zookeeper.ClientCnxn$EventThread. > > > processEvent(ClientCnxn.java:534) > > > at org.apache.zookeeper.ClientCnxn$EventThread.run( > > > ClientCnxn.java:510) > > > 2016-10-12 23:48:14,518 FATAL [main-EventThread] > > > regionserver.HRegionServer: RegionServer abort: loaded coprocessors > are: > > > [org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint] > > > ---------------------log end--------------------- > > > > > > After checked the log ,it shows that the region server jvm paused a > long > > > time and the zkclient cannot send heartbeats, the session times out > Which > > > the 'reference guide' had descripted http://hbase.apache.org/book. > > > html#trouble.rs.runtime.zkexpired .So a read the log detail and to > find > > > the java GC event but there's no full gc occurred. > > > And more a found the same symptom in the DataNode instance . > > > > > > The node os is Centos7 maybe the kernel futex bug ,after checking > the > > > bug was fixed in my OS . > > > There's any other factor caused the problem except java GC? > > > Anyone who got the same problem ? Any ideas ? > > > Thank you . > > > > >