I noticed the 8 occurrences of 0x703e... following region server name in the abort message. I wonder why the repetition ?
Cheers On Apr 29, 2013, at 2:17 AM, Viral Bajaria <viral.baja...@gmail.com> wrote: > On Sun, Apr 28, 2013 at 7:37 PM, ramkrishna vasudevan < > ramkrishna.s.vasude...@gmail.com> wrote: > >> So you mean that when the handler count is more than 5k this happens when >> it is lesser this does not. Have you repeated this behaviour? > >> What i doubt is when you say bouncing around different states i feel may be >> the ROOT assignment was a problem and something messed up there. >> If the reason was due to handler count then that needs different analysis. >> >> I think that if you can repeat the experiment and get the same behaviour, >> you can share the logs so that we can ascertain the exact problem. > > Yeah I have repeated the behavior. But it seems the issue is due to some > weird pauses in the region server whenever I bump up the region handler > count (logs are below). I doubt the issue is GC, since it should not take > such a long time because this is happening on startup with 48GB heap size. > There are no active clients either. > > I can safely say this is due to bumping up the region handler count is due > to the fact that I started 3 regionservers with 5000 handlers and 3 with > 15000 handlers. The one's with 15000 spun up all IPC handlers and then > errored out as show in the logs below. This is just the logs around the > error. Before the error there were a few more timeouts. > > I checked zookeeper servers (I have a 3-node cluster) and it did not GC > around the same time nor was it under any kind of load. > > Thanks, > Viral > > Region Server Logs > 2013-04-29 08:00:55,512 DEBUG > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Stats: total=98.34 MB, > free=11.61 GB, max=11.71 GB, blocks=0, accesses=0, hits=0, hitRatio=0, > cachingAccesses=0, cachingHits=0, cachingHitsRatio=0, evictions=0, > evicted=0, evictedPerRun=NaN > 2013-04-29 08:02:35,674 INFO org.apache.zookeeper.ClientCnxn: Client > session timed out, have not heard from server in 40592ms for sessionid > 0x703e48a8cfd81be6, closing socket connection and attempting reconnect > 2013-04-29 08:02:36,286 INFO org.apache.zookeeper.ClientCnxn: Opening > socket connection to server 10.152.152.84:2181. Will not attempt to > authenticate using SASL (Unable to locate a login configuration) > 2013-04-29 08:02:36,287 INFO org.apache.zookeeper.ClientCnxn: Socket > connection established to 10.152.152.84:2181, initiating session > 2013-04-29 08:02:36,288 INFO org.apache.zookeeper.ClientCnxn: Unable to > reconnect to ZooKeeper service, session 0x703e48a8cfd81be6 has expired, > closing socket connection > 2013-04-29 08:03:16,287 FATAL > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server > <hostname>,60020,1367221255417: > regionserver:60020-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6 > regionserver:60020-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6 > received expired from ZooKeeper, aborting > org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:389) > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:286) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519) > at > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) > 2013-04-29 08:03:16,288 FATAL > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server > <hostname>,60020,1367221255417: Unhandled exception: > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; > currently processing <hostname>,60020,1367221255417 as dead server > org.apache.hadoop.hbase.YouAreDeadException: > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; > currently processing <hostname>,60020,1367221255417 as dead server > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) > at java.lang.reflect.Constructor.newInstance(Constructor.java:513) > at > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95) > at > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:880) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:748) > at java.lang.Thread.run(Thread.java:662)