On Sun, Apr 28, 2013 at 7:37 PM, ramkrishna vasudevan < ramkrishna.s.vasude...@gmail.com> wrote:
> So you mean that when the handler count is more than 5k this happens when > it is lesser this does not. Have you repeated this behaviour? > > What i doubt is when you say bouncing around different states i feel may be > the ROOT assignment was a problem and something messed up there. > If the reason was due to handler count then that needs different analysis. > > I think that if you can repeat the experiment and get the same behaviour, > you can share the logs so that we can ascertain the exact problem. > Yeah I have repeated the behavior. But it seems the issue is due to some weird pauses in the region server whenever I bump up the region handler count (logs are below). I doubt the issue is GC, since it should not take such a long time because this is happening on startup with 48GB heap size. There are no active clients either. I can safely say this is due to bumping up the region handler count is due to the fact that I started 3 regionservers with 5000 handlers and 3 with 15000 handlers. The one's with 15000 spun up all IPC handlers and then errored out as show in the logs below. This is just the logs around the error. Before the error there were a few more timeouts. I checked zookeeper servers (I have a 3-node cluster) and it did not GC around the same time nor was it under any kind of load. Thanks, Viral Region Server Logs 2013-04-29 08:00:55,512 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Stats: total=98.34 MB, free=11.61 GB, max=11.71 GB, blocks=0, accesses=0, hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0, cachingHitsRatio=0, evictions=0, evicted=0, evictedPerRun=NaN 2013-04-29 08:02:35,674 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 40592ms for sessionid 0x703e48a8cfd81be6, closing socket connection and attempting reconnect 2013-04-29 08:02:36,286 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server 10.152.152.84:2181. Will not attempt to authenticate using SASL (Unable to locate a login configuration) 2013-04-29 08:02:36,287 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to 10.152.152.84:2181, initiating session 2013-04-29 08:02:36,288 INFO org.apache.zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x703e48a8cfd81be6 has expired, closing socket connection 2013-04-29 08:03:16,287 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server <hostname>,60020,1367221255417: regionserver:60020-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6 regionserver:60020-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6-0x703e48a8cfd81be6 received expired from ZooKeeper, aborting org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:389) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:286) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) 2013-04-29 08:03:16,288 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server <hostname>,60020,1367221255417: Unhandled exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing <hostname>,60020,1367221255417 as dead server org.apache.hadoop.hbase.YouAreDeadException: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing <hostname>,60020,1367221255417 as dead server at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79) at org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:880) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:748) at java.lang.Thread.run(Thread.java:662)