Thanks Ted.I uploaded another log https://github.com/eswidy/waterspider/tree/master/rscase/rs-more.log Followed you advice i increased the tickTime and works well at present. Maybe the problem caused by he bad I/O,I found the CPU I/O idle always more than 70% during the heavy load. But that make JVM pause ?
------------------ Original ------------------ From: "Ted Yu";<yuzhih...@gmail.com>; Send time: Thursday, Oct 20, 2016 10:27 AM To: "user@hbase.apache.org"<user@hbase.apache.org>; Subject: Re: HBase resgionServer crashed with no gc detected Your zookeeper.session.timeout is set as 90000 but tickTime=2000. The max timeout is bounded by 20 times tickTime. Please increase the tickTime in zoo.cfg I don't see region server log prior to 18:14:14,928 On Wed, Oct 19, 2016 at 7:13 PM, who.cat <who....@qq.com> wrote: > ok.i have posted the more detail RS,Gc log and the ZK ,HBase config, > https://github.com/eswidy/waterspider/tree/master/rscase > Thanks > > > > > ------------------ Original ------------------ > From: "Ted Yu";<yuzhih...@gmail.com>; > Date: Oct 20, 2016 > To: "user@hbase.apache.org"<user@hbase.apache.org>; > > Subject: Re: HBase resgionServer crashed with no gc detected > > > > There was one 25 second pause before the abort. > > Can you pastebin your hbase-site.xml (and zookeeper configs) ? > > Do you have more of the region server log (prior to 18:14:14,928) ? > > Thanks > > On Wed, Oct 19, 2016 at 6:01 PM, who.cat <who....@qq.com> wrote: > > > i've upload the file to git hub ,and the url is : > > https://github.com/eswidy/waterspider/blob/master/regionServer.log > > > > thanks so much. > > > > > > > > > > ------------------ Original ------------------ > > From: "Ted Yu";<yuzhih...@gmail.com>; > > Date: Oct 19, 2016 > > To: "user@hbase.apache.org"<user@hbase.apache.org>; > > > > Subject: Re: HBase resgionServer crashed with no gc detected > > > > > > > > The log file was not delivered by the mailing list. > > > > Consider using pastebin or third party site. > > > > On Tue, Oct 18, 2016 at 10:38 PM, who.cat <who....@qq.com> wrote: > > > > > thanks fyi.Yes,i did not turn the debug and try it now .I also doubt > the > > > heavy cpu load caused ,then checked cpu highest Utilization is > 60%(Cpu > > > user ) > > > My region server gc parameter is :export SERVER_GC_OPTS="-verbose:gc > > > -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:{{log_dir}}/gc.log-` > > date > > > +'%Y%m%d%H%M'`" > > > The 10/12 log was rolled .i got the same crash log yesterday(10/18). > > > Details in the attachment 'regionServer.log', and the JVM pause at > > > "2016-10-17 18:44:07,232" in line 82 . > > > Thanks so much. > > > > > > > > > > > > > > > > > > ------------------ ???????? ------------------ > > > *??????:* "Ted Yu";<yuzhih...@gmail.com>; > > > *????????:* 2016??10??19??(??????) ????11:17 > > > *??????:* "user@hbase.apache.org"<user@hbase.apache.org>; > > > *????:* Re: HBase resgionServer crashed with no gc detected > > > > > > Can you show more of the region server log prior to 23:48:13 (including > > the > > > pause) ? > > > > > > Was the region server under heavy load during the pause ? > > > > > > Consider turning on DEBUG logging if you haven't. > > > > > > Please also share GC parameters. > > > > > > Thanks > > > > > > On Tue, Oct 18, 2016 at 7:58 PM, who.cat <who....@qq.com> wrote: > > > > > > > Hi all: > > > > I've a HDP big data cluster with 4 nodes and create by Ambari the > > HBase > > > > is 1.1.2. > > > > As running YCSB for benchmark the RegionServer instance or the > Hmaster > > > > instance crashes which it's logs shows: > > > > > > > > ---------------------log start --------------------- > > > > 2016-10-12 23:48:13,591 INFO [main-SendThread(Node1:2181)] > > > > zookeeper.ClientCnxn: Unable to read additional data from server > > > sessionid > > > > 0x157b7f5f0bc0005, likely server has closed socket, closing socket > > > > connection and attempting reconnect > > > > 2016-10-12 23:48:13,595 INFO [HBase-Metrics2-1] > > impl.MetricsSinkAdapter: > > > > Sink timeline started > > > > 2016-10-12 23:48:13,606 INFO [HBase-Metrics2-1] > > impl.MetricsSystemImpl: > > > > Scheduled snapshot period at 10 second(s). > > > > 2016-10-12 23:48:13,606 INFO [HBase-Metrics2-1] > > impl.MetricsSystemImpl: > > > > HBase metrics system started > > > > 2016-10-12 23:48:14,496 INFO [main-SendThread(Node4:2181)] > > > > zookeeper.ClientCnxn: Opening socket connection to server Node4/ > > > > 1.1.6.104:2181. Will not attempt to authenticate using SASL (unknown > > > > error) > > > > 2016-10-12 23:48:14,506 INFO [main-SendThread(Node4:2181)] > > > > zookeeper.ClientCnxn: Socket connection established to Node4/ > > > > 1.17.6.104:2181, initiating session > > > > 2016-10-12 23:48:14,517 INFO [main-SendThread(Node4:2181)] > > > > zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, > session > > > > 0x157b7f5f0bc0005 has expired, closing socket connection > > > > 2016-10-12 23:48:14,517 FATAL [main-EventThread] > > > > regionserver.HRegionServer: ABORTING region server > > > > node1,16020,1476260847716: regionserver:16020-0x157b7f5f0bc0005, > > > > quorum=node2:2181,node1:2181,node4:2181, baseZNode=/hbase-unsecure > > > > regionserver:16020-0x157b7f5f0bc0005 received expired from > ZooKeeper, > > > > aborting > > > > org.apache.zookeeper.KeeperException$SessionExpiredException: > > > > KeeperErrorCode = Session expired > > > > at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher. > > > > connectionEvent(ZooKeeperWatcher.java:585) > > > > at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher. > > > > process(ZooKeeperWatcher.java:517) > > > > at org.apache.zookeeper.ClientCnxn$EventThread. > > > > processEvent(ClientCnxn.java:534) > > > > at org.apache.zookeeper.ClientCnxn$EventThread.run( > > > > ClientCnxn.java:510) > > > > 2016-10-12 23:48:14,518 FATAL [main-EventThread] > > > > regionserver.HRegionServer: RegionServer abort: loaded coprocessors > > are: > > > > [org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint] > > > > ---------------------log end--------------------- > > > > > > > > After checked the log ,it shows that the region server jvm paused a > > long > > > > time and the zkclient cannot send heartbeats, the session times out > > Which > > > > the 'reference guide' had descripted http://hbase.apache.org/book. > > > > html#trouble.rs.runtime.zkexpired .So a read the log detail and to > > find > > > > the java GC event but there's no full gc occurred. > > > > And more a found the same symptom in the DataNode instance . > > > > > > > > The node os is Centos7 maybe the kernel futex bug ,after checking > > the > > > > bug was fixed in my OS . > > > > There's any other factor caused the problem except java GC? > > > > Anyone who got the same problem ? Any ideas ? > > > > Thank you . > > > > > > > > >