As a side note, I obviously never changed the logger level from the default cloudera installation. Are there performance hits for running in INFO/DEBUG/? What do most people suggest?

Thanks

On 8/24/11 5:19 PM, Mark wrote:
I noticed that after running some hefty jobs on our cluster that 3 out of 5 of our HBase region servers were killed. First off, when this happens and there are only 2 servers is there a possibility of data corruption and/or loss? Secondly and more importantly, why does this happen and how can I resolve it?

Thanks!

Here is the relevant part of my log:

2011-08-24 15:08:34,989 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=6.66 MB, free=790.02 MB, max=796.67 MB, blocks=22, accesses=84215, hits=84188, hitRatio=99.96%%, cachingAccesses=84189, cachingHits=84167, cachingHitsRatio=99.97%%, evictions=0, evicted=0, evictedPerRun=NaN 2011-08-24 15:12:03,348 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed 2011-08-24 15:13:34,989 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=6.66 MB, free=790.02 MB, max=796.67 MB, blocks=22, accesses=84215, hits=84188, hitRatio=99.96%%, cachingAccesses=84189, cachingHits=84167, cachingHitsRatio=99.97%%, evictions=0, evicted=0, evictedPerRun=NaN 2011-08-24 15:18:34,990 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=6.66 MB, free=790.02 MB, max=796.67 MB, blocks=22, accesses=84215, hits=84188, hitRatio=99.96%%, cachingAccesses=84189, cachingHits=84167, cachingHitsRatio=99.97%%, evictions=0, evicted=0, evictedPerRun=NaN 2011-08-24 15:20:47,202 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 26666ms for sessionid 0x131ec6ce0b00004, closing socket connection and attempting reconnect 2011-08-24 15:20:48,929 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server hadoop-master.ioffer.com/10.101.101.0:2181 2011-08-24 15:20:57,463 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 26666ms for sessionid 0x131ec6ce0b00003, closing socket connection and attempting reconnect 2011-08-24 15:20:59,156 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server hadoop-master.ioffer.com/10.101.101.0:2181 2011-08-24 15:21:09,961 WARN org.apache.zookeeper.ClientCnxn: Session 0x131ec6ce0b00004 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection timed out
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119) 2011-08-24 15:21:11,415 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server hadoop-master.ioffer.com/10.101.101.0:2181 2011-08-24 15:21:11,416 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to hadoop-master.ioffer.com/10.101.101.0:2181, initiating session 2011-08-24 15:21:11,445 INFO org.apache.zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x131ec6ce0b00004 has expired, closing socket connection 2011-08-24 15:21:11,452 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server serverName=hadoop1.ioffer.com,60020,1313931812841, load=(requests=246, regions=2, usedHeap=43, maxHeap=3983): regionserver:60020-0x131ec6ce0b00004 regionserver:60020-0x131ec6ce0b00004 received expired from ZooKeeper, aborting org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:343) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:261) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506) 2011-08-24 15:21:11,466 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: requests=82, regions=2, stores=2, storefiles=1, storefileIndexSize=0, memstoreSize=0, compactionQueueSize=0, flushQueueSize=0, usedHeap=42, maxHeap=3983, blockCacheSize=6980720, blockCacheFree=828393552, blockCacheCount=22, blockCacheHitCount=84188, blockCacheMissCount=27, blockCacheEvictedCount=0, blockCacheHitRatio=99, blockCacheHitCachingRatio=99 2011-08-24 15:21:11,467 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: regionserver:60020-0x131ec6ce0b00004 regionserver:60020-0x131ec6ce0b00004 received expired from ZooKeeper, aborting 2011-08-24 15:21:11,467 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down 2011-08-24 15:21:11,570 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop-master.ioffer.com/10.101.101.0:9000. Already tried 0 time(s). 2011-08-24 15:21:13,516 INFO org.apache.hadoop.hbase.regionserver.LogRoller: LogRoller exiting. 2011-08-24 15:21:17,193 INFO org.apache.hadoop.hbase.regionserver.CompactSplitThread: regionserver60020.compactor exiting 2011-08-24 15:21:18,727 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: regionserver60020.cacheFlusher exiting 2011-08-24 15:21:20,157 WARN org.apache.zookeeper.ClientCnxn: Session 0x131ec6ce0b00003 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection timed out
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119) 2011-08-24 15:21:21,919 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server hadoop-master.ioffer.com/10.101.101.0:2181 2011-08-24 15:21:21,920 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to hadoop-master.ioffer.com/10.101.101.0:2181, initiating session 2011-08-24 15:21:21,921 INFO org.apache.zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x131ec6ce0b00003 has expired, closing socket connection 2011-08-24 15:21:21,921 INFO org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: This client just lost it's session with ZooKeeper, trying to reconnect. 2011-08-24 15:21:21,921 INFO org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Trying to reconnect to zookeeper 2011-08-24 15:21:21,923 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=hadoop-master.ioffer.com:2181 sessionTimeout=180000 watcher=hconnection 2011-08-24 15:21:21,923 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server hadoop-master.ioffer.com/10.101.101.0:2181 2011-08-24 15:21:21,926 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to hadoop-master.ioffer.com/10.101.101.0:2181, initiating session 2011-08-24 15:21:21,935 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server hadoop-master.ioffer.com/10.101.101.0:2181, sessionid = 0x131ec6ce0b000cd, negotiated timeout = 40000 2011-08-24 15:21:21,939 INFO org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Reconnected successfully. This disconnect could have been caused by a network partition or a long-running GC pause, either way it's recommended that you verify your environment. 2011-08-24 15:21:21,939 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down 2011-08-24 15:21:27,210 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server serverName=hadoop1.ioffer.com,60020,1313931812841, load=(requests=246, regions=2, usedHeap=43, maxHeap=3983): Unhandled exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing hadoop1.ioffer.com,60020,1313931812841 as dead server org.apache.hadoop.hbase.YouAreDeadException: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing hadoop1.ioffer.com,60020,1313931812841 as dead server at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79) at org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:733) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:594)
    at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing hadoop1.ioffer.com,60020,1313931812841 as dead server at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:201) at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:259) at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:641)
    at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)

    at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771)
at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
    at $Proxy5.regionServerReport(Unknown Source)
at org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:727)
    ... 2 more
2011-08-24 15:21:27,211 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: requests=82, regions=2, stores=2, storefiles=1, storefileIndexSize=0, memstoreSize=0, compactionQueueSize=0, flushQueueSize=0, usedHeap=40, maxHeap=3983, blockCacheSize=6980720, blockCacheFree=828393552, blockCacheCount=22, blockCacheHitCount=84188, blockCacheMissCount=27, blockCacheEvictedCount=0, blockCacheHitRatio=99, blockCacheHitCachingRatio=99 2011-08-24 15:21:27,211 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Unhandled exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing hadoop1.ioffer.com,60020,1313931812841 as dead server 2011-08-24 15:21:27,211 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server on 60020 2011-08-24 15:21:27,211 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC Server listener on 60020 2011-08-24 15:21:27,212 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 3 on 60020: exiting 2011-08-24 15:21:27,212 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 5 on 60020: exiting 2011-08-24 15:21:27,212 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 9 on 60020: exiting 2011-08-24 15:21:27,212 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 8 on 60020: exiting

Reply via email to