As a side note, I obviously never changed the logger level from the
default cloudera installation. Are there performance hits for running in
INFO/DEBUG/? What do most people suggest?
Thanks
On 8/24/11 5:19 PM, Mark wrote:
I noticed that after running some hefty jobs on our cluster that 3 out
of 5 of our HBase region servers were killed. First off, when this
happens and there are only 2 servers is there a possibility of data
corruption and/or loss? Secondly and more importantly, why does this
happen and how can I resolve it?
Thanks!
Here is the relevant part of my log:
2011-08-24 15:08:34,989 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=6.66
MB, free=790.02 MB, max=796.67 MB, blocks=22, accesses=84215,
hits=84188, hitRatio=99.96%%, cachingAccesses=84189,
cachingHits=84167, cachingHitsRatio=99.97%%, evictions=0, evicted=0,
evictedPerRun=NaN
2011-08-24 15:12:03,348 DEBUG
org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period
3600000ms elapsed
2011-08-24 15:13:34,989 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=6.66
MB, free=790.02 MB, max=796.67 MB, blocks=22, accesses=84215,
hits=84188, hitRatio=99.96%%, cachingAccesses=84189,
cachingHits=84167, cachingHitsRatio=99.97%%, evictions=0, evicted=0,
evictedPerRun=NaN
2011-08-24 15:18:34,990 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=6.66
MB, free=790.02 MB, max=796.67 MB, blocks=22, accesses=84215,
hits=84188, hitRatio=99.96%%, cachingAccesses=84189,
cachingHits=84167, cachingHitsRatio=99.97%%, evictions=0, evicted=0,
evictedPerRun=NaN
2011-08-24 15:20:47,202 INFO org.apache.zookeeper.ClientCnxn: Client
session timed out, have not heard from server in 26666ms for sessionid
0x131ec6ce0b00004, closing socket connection and attempting reconnect
2011-08-24 15:20:48,929 INFO org.apache.zookeeper.ClientCnxn: Opening
socket connection to server hadoop-master.ioffer.com/10.101.101.0:2181
2011-08-24 15:20:57,463 INFO org.apache.zookeeper.ClientCnxn: Client
session timed out, have not heard from server in 26666ms for sessionid
0x131ec6ce0b00003, closing socket connection and attempting reconnect
2011-08-24 15:20:59,156 INFO org.apache.zookeeper.ClientCnxn: Opening
socket connection to server hadoop-master.ioffer.com/10.101.101.0:2181
2011-08-24 15:21:09,961 WARN org.apache.zookeeper.ClientCnxn: Session
0x131ec6ce0b00004 for server null, unexpected error, closing socket
connection and attempting reconnect
java.net.ConnectException: Connection timed out
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119)
2011-08-24 15:21:11,415 INFO org.apache.zookeeper.ClientCnxn: Opening
socket connection to server hadoop-master.ioffer.com/10.101.101.0:2181
2011-08-24 15:21:11,416 INFO org.apache.zookeeper.ClientCnxn: Socket
connection established to hadoop-master.ioffer.com/10.101.101.0:2181,
initiating session
2011-08-24 15:21:11,445 INFO org.apache.zookeeper.ClientCnxn: Unable
to reconnect to ZooKeeper service, session 0x131ec6ce0b00004 has
expired, closing socket connection
2011-08-24 15:21:11,452 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
server serverName=hadoop1.ioffer.com,60020,1313931812841,
load=(requests=246, regions=2, usedHeap=43, maxHeap=3983):
regionserver:60020-0x131ec6ce0b00004
regionserver:60020-0x131ec6ce0b00004 received expired from ZooKeeper,
aborting
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired
at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:343)
at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:261)
at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
at
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
2011-08-24 15:21:11,466 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
requests=82, regions=2, stores=2, storefiles=1, storefileIndexSize=0,
memstoreSize=0, compactionQueueSize=0, flushQueueSize=0, usedHeap=42,
maxHeap=3983, blockCacheSize=6980720, blockCacheFree=828393552,
blockCacheCount=22, blockCacheHitCount=84188, blockCacheMissCount=27,
blockCacheEvictedCount=0, blockCacheHitRatio=99,
blockCacheHitCachingRatio=99
2011-08-24 15:21:11,467 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED:
regionserver:60020-0x131ec6ce0b00004
regionserver:60020-0x131ec6ce0b00004 received expired from ZooKeeper,
aborting
2011-08-24 15:21:11,467 INFO org.apache.zookeeper.ClientCnxn:
EventThread shut down
2011-08-24 15:21:11,570 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: hadoop-master.ioffer.com/10.101.101.0:9000. Already
tried 0 time(s).
2011-08-24 15:21:13,516 INFO
org.apache.hadoop.hbase.regionserver.LogRoller: LogRoller exiting.
2011-08-24 15:21:17,193 INFO
org.apache.hadoop.hbase.regionserver.CompactSplitThread:
regionserver60020.compactor exiting
2011-08-24 15:21:18,727 INFO
org.apache.hadoop.hbase.regionserver.MemStoreFlusher:
regionserver60020.cacheFlusher exiting
2011-08-24 15:21:20,157 WARN org.apache.zookeeper.ClientCnxn: Session
0x131ec6ce0b00003 for server null, unexpected error, closing socket
connection and attempting reconnect
java.net.ConnectException: Connection timed out
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119)
2011-08-24 15:21:21,919 INFO org.apache.zookeeper.ClientCnxn: Opening
socket connection to server hadoop-master.ioffer.com/10.101.101.0:2181
2011-08-24 15:21:21,920 INFO org.apache.zookeeper.ClientCnxn: Socket
connection established to hadoop-master.ioffer.com/10.101.101.0:2181,
initiating session
2011-08-24 15:21:21,921 INFO org.apache.zookeeper.ClientCnxn: Unable
to reconnect to ZooKeeper service, session 0x131ec6ce0b00003 has
expired, closing socket connection
2011-08-24 15:21:21,921 INFO
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
This client just lost it's session with ZooKeeper, trying to reconnect.
2011-08-24 15:21:21,921 INFO
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
Trying to reconnect to zookeeper
2011-08-24 15:21:21,923 INFO org.apache.zookeeper.ZooKeeper:
Initiating client connection,
connectString=hadoop-master.ioffer.com:2181 sessionTimeout=180000
watcher=hconnection
2011-08-24 15:21:21,923 INFO org.apache.zookeeper.ClientCnxn: Opening
socket connection to server hadoop-master.ioffer.com/10.101.101.0:2181
2011-08-24 15:21:21,926 INFO org.apache.zookeeper.ClientCnxn: Socket
connection established to hadoop-master.ioffer.com/10.101.101.0:2181,
initiating session
2011-08-24 15:21:21,935 INFO org.apache.zookeeper.ClientCnxn: Session
establishment complete on server
hadoop-master.ioffer.com/10.101.101.0:2181, sessionid =
0x131ec6ce0b000cd, negotiated timeout = 40000
2011-08-24 15:21:21,939 INFO
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
Reconnected successfully. This disconnect could have been caused by a
network partition or a long-running GC pause, either way it's
recommended that you verify your environment.
2011-08-24 15:21:21,939 INFO org.apache.zookeeper.ClientCnxn:
EventThread shut down
2011-08-24 15:21:27,210 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
server serverName=hadoop1.ioffer.com,60020,1313931812841,
load=(requests=246, regions=2, usedHeap=43, maxHeap=3983): Unhandled
exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
rejected; currently processing hadoop1.ioffer.com,60020,1313931812841
as dead server
org.apache.hadoop.hbase.YouAreDeadException:
org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
currently processing hadoop1.ioffer.com,60020,1313931812841 as dead
server
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at
org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95)
at
org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:733)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:594)
at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
currently processing hadoop1.ioffer.com,60020,1313931812841 as dead
server
at
org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:201)
at
org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:259)
at
org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:641)
at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)
at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771)
at
org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
at $Proxy5.regionServerReport(Unknown Source)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:727)
... 2 more
2011-08-24 15:21:27,211 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
requests=82, regions=2, stores=2, storefiles=1, storefileIndexSize=0,
memstoreSize=0, compactionQueueSize=0, flushQueueSize=0, usedHeap=40,
maxHeap=3983, blockCacheSize=6980720, blockCacheFree=828393552,
blockCacheCount=22, blockCacheHitCount=84188, blockCacheMissCount=27,
blockCacheEvictedCount=0, blockCacheHitRatio=99,
blockCacheHitCachingRatio=99
2011-08-24 15:21:27,211 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Unhandled
exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
rejected; currently processing hadoop1.ioffer.com,60020,1313931812841
as dead server
2011-08-24 15:21:27,211 INFO org.apache.hadoop.ipc.HBaseServer:
Stopping server on 60020
2011-08-24 15:21:27,211 INFO org.apache.hadoop.ipc.HBaseServer:
Stopping IPC Server listener on 60020
2011-08-24 15:21:27,212 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 3 on 60020: exiting
2011-08-24 15:21:27,212 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 5 on 60020: exiting
2011-08-24 15:21:27,212 INFO org.apache.hadoop.ipc.HBaseServer: PRI
IPC Server handler 9 on 60020: exiting
2011-08-24 15:21:27,212 INFO org.apache.hadoop.ipc.HBaseServer: PRI
IPC Server handler 8 on 60020: exiting