We have a 6 node cluster, 5 with region serves.  2 of the region servers have 
been stable for days, but 3 of them keep crashing.  Here are the logs around 
around when the crash occurs. (btw, we are shoving approximately the twitter 
firehose into hbase via flume) I'm an hbase newbie, but I have been reading. 
Not sure what else is needed to help debug. When the problem occurs, it is 
concurrent with issues that appear all the way down in the data node.  Also, 
interesting to note, all 3 servers seem to fail differently somewhat:

Servers (6):
00.hadoop is the master
01.hadoop, 03.hadoop (there is no 02.hadoop - flaky machine that got taken out) 
- are stable Region Servers
04-06.hadoop crash at different times, very different logs, Region Servers


Master Attributes:
HBase Version 0.89.20100924+28, r
Hadoop Version 0.20.2+737, r98c55c28258aa6f42250569bd7fa431ac657bdbd

HBase Root Directory hdfs://00.hadoop.****:54310/hbase

Load average
103.0
Average number of regions per regionserver. Naive computation.
Regions On FS
577


(This is after recently restarting 04-06)


Address   Start Code    Load
01.hadoop:60030  1291268910638   requests=223, regions=118, usedHeap=519, 
maxHeap=987
03.hadoop:60030  1291269219610   requests=23, regions=111, usedHeap=862, 
maxHeap=987
04.hadoop:60030  1291910665912   requests=169, regions=82, usedHeap=194, 
maxHeap=888
05.hadoop:60030  1291909584060   requests=232, regions=110, usedHeap=477, 
maxHeap=888
06.hadoop60030   1291909723787   requests=99, regions=94, usedHeap=394, 
maxHeap=888
Total:
servers: 5
requests=746, regions=515



Following are various logs around the time of the failures for 04-06

************************************************************************************
04.hadoop  
************************************************************************************
REGION SERVER ->
2010-12-09 05:01:58,640 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
compaction completed on region 
article,a83858a08f2270d319f75a7b43c756c2453988e7,1291811929868.58f6d9fc80
c78f3ca490b0280b4f1226. in 0sec
2010-12-09 05:04:30,898 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: 
LRU Stats: total=1.46 MB, free=176.33 MB, max=177.79 MB, blocks=0, 
accesses=200583, hits=0, hitRatio=
0.00%%, evictions=0, evicted=0, evictedPerRun=NaN
2010-12-09 05:04:59,612 INFO 
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using syncFs -- 
HDFS-200
2010-12-09 05:04:59,618 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: 
Roll 
/hbase/.logs/04.hadoop.****,60020,1291858770125/10.100.154.103%3A60020.1291870585253,
 entries=5318
6, filesize=63776311. New hlog 
/hbase/.logs/04.hadoop.****,60020,1291858770125/10.100.154.103%3A60020.1291871099582
2010-12-09 05:08:02,033 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: 
Found 1 hlogs to remove  out of total 6; oldest outstanding sequenceid is 
32891801 from region article
,4b2039b791e894dd479b90661ca97087f61645d3,1291813201039.7dff47c63bb5648bb1ee6670c60553a5.
2010-12-09 05:08:02,034 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: IPC 
Server handler 13 on 60020 took 182416ms appending an edit to hlog; editcount=0
2010-12-09 05:08:02,066 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: 
moving old hlog file 
/hbase/.logs/04.hadoop.****,60020,1291858770125/10.100.154.103%3A60020.12918679476
65 whose highest sequenceid is 32880227 to 
/hbase/.oldlogs/10.100.154.103%3A60020.1291867947665
2010-12-09 05:08:02,066 INFO org.apache.zookeeper.ClientCnxn: Client session 
timed out, have not heard from server in 182936ms for sessionid 
0x12ca58c08ff0048, closing socket conn
ection and attempting reconnect
2010-12-09 05:08:02,089 FATAL 
org.apache.hadoop.hbase.regionserver.HRegionServer: Aborting region server 
serverName=04.hadoop.****,60020,1291858770125, load=(requests=24, regions=
64, usedHeap=548, maxHeap=888): Unhandled exception
org.apache.hadoop.hbase.YouAreDeadException: 
org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently 
processing 04.hadoop.****,60020,1291858770125 as dead s
erver
        at 
org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:217)
        at 
org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:271)
        at 
org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:744)
        at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:561)
        at 
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)

        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
        at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
        at 
org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
        at 
org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48)
        at 
org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.doRun(HRegionServer.java:635)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.access$000(HRegionServer.java:126)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer$1.run(HRegionServer.java:518)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer$1.run(HRegionServer.java:516)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:337)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1046)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:516)
        at java.lang.Thread.run(Thread.java:662)
2010-12-09 05:08:02,090 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: 
request=0.0, regions=64, stores=64, storefiles=136, storefileIndexSize=27, 
memsto
reSize=353, compactionQueueSize=0, usedHeap=549, maxHeap=888, 
blockCacheSize=1530552, blockCacheFree=184893160, blockCacheCount=0, 
blockCacheHitRatio=0
2010-12-09 05:08:02,090 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server 
on 60020
2010-12-09 05:08:02,090 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server 
handler 0 on 60020: exiting2010-12-09 05:08:02,090 INFO 
org.apache.hadoop.ipc.HBaseServer: Stopping IPC Server listener on 60020


MASTER HBASE ->

2010-12-09 05:05:06,072 INFO org.apache.hadoop.hbase.master.BaseScanner: All 1 
.META. region(s) scanned
2010-12-09 05:05:40,004 INFO org.apache.hadoop.hbase.master.ServerManager: 
04.hadoop.jive,60020,1291858770125 znode expired
2010-12-09 05:05:40,005 DEBUG org.apache.hadoop.hbase.master.ServerManager: 
Added=04.hadoop.jive,60020,1291858770125 to dead servers, added shutdown 
processing operation
2010-12-09 05:05:40,005 DEBUG 
org.apache.hadoop.hbase.master.RegionServerOperationQueue: Processing todo: 
ProcessServerShutdown of 04.hadoop.jive,60020,12918587701252010-12-09 
05:05:40,005 INFO org.apache.hadoop.hbase.master.RegionServerOperation: Process 
shutdown of server 04.hadoop.jive,60020,1291858770125: logSplit: false, 
rootRescanned: f
alse, numberOfMetaRegions: 1, onlineMetaRegions.size(): 1
2010-12-09 05:05:40,008 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: 
Splitting 7 hlog(s) in 
hdfs://00.hadoop.jive:54310/hbase/.logs/04.hadoop.jive,60020,12918587701252010-12-09
 05:05:40,008 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: Splitting 
hlog 1 of 7: 
hdfs://00.hadoop.jive:54310/hbase/.logs/04.hadoop.jive,60020,1291858770125/10.1
00.154.103%3A60020.1291867947665, length=637577092010-12-09 05:05:40,008 INFO 
org.apache.hadoop.hbase.util.FSUtils: Recovering 
filehdfs://00.hadoop.jive:54310/hbase/.logs/04.hadoop.jive,60020,1291858770125/10.100.154.103%3A60020
.1291867947665


NAMENODE ->

2010-12-09 05:08:02,471 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
NameSystem.addStoredBlock: blockMap updated: 10.100.159.13:50010 is added to 
blk_1531008743226086399_251615 size 63757709
2010-12-09 05:08:02,473 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 
on 54310, call 
complete(/hbase/.logs/04.hadoop.jive,60020,1291858770125/10.100.154.103%3A60020.1291867947665,
 DFSClient_hb_m_10.194.194.79:60000_1291788452343) from 10.194.194.79:44117: 
error: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease 
on 
/hbase/.logs/04.hadoop.jive,60020,1291858770125/10.100.154.103%3A60020.1291867947665
 File does not exist. [Lease.  Holder: 
DFSClient_hb_m_10.194.194.79:60000_1291788452343, pendingcreates: 1]
org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on 
/hbase/.logs/04.hadoop.jive,60020,1291858770125/10.100.154.103%3A60020.1291867947665
 File does not exist. [Lease.  Holder: 
DFSClient_hb_m_10.194.194.79:60000_1291788452343, pendingcreates: 1]
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1488)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1479)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:1534)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:1522)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNode.complete(NameNode.java:610)
        at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:528)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1319)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1315)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1313)
2010-12-09 05:08:04,206 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
NameSystem.addStoredBlock: blockMap updated: 10.99.97.106:50010 is added to 
blk_-734280257049179934_251614 size 7330

DATANODE ->

2010-12-09 05:08:02,212 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
IOException in BlockReceiver.lastNodeRun: java.io.IOException: Broken pipe
        at sun.nio.ch.FileDispatcher.write0(Native Method)
        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)        
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:100)        at 
sun.nio.ch.IOUtil.write(IOUtil.java:71)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)
        at 
org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
        at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)    
    at 
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
        at 
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)     
   at java.io.DataOutputStream.writeLong(DataOutputStream.java:207)
        at 
org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.write(DataTransferProtocol.java:133)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.lastDataNodeRun(BlockReceiver.java:840)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:870)
        at java.lang.Thread.run(Thread.java:662)
2010-12-09 05:08:02,213 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
checkDiskError: exception: 
2010-12-09 05:08:02,213 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
checkDiskError: exception: java.io.IOException: Broken pipe
        at sun.nio.ch.FileDispatcher.write0(Native Method)
        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:100)
        at sun.nio.ch.IOUtil.write(IOUtil.java:71)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)
        at 
org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
        at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
        at 
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
        at 
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
        at java.io.DataOutputStream.writeLong(DataOutputStream.java:207)
        at 
org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.write(DataTransferProtocol.java:133)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.lastDataNodeRun(BlockReceiver.java:840)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:870)
        at java.lang.Thread.run(Thread.java:662)
2010-12-09 05:08:02,221 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
PacketResponder blk_-8817504198034990390_251613 0 Exception 
java.io.IOException: Broken pipe
        at sun.nio.ch.FileDispatcher.write0(Native Method)
        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:100)
        at sun.nio.ch.IOUtil.write(IOUtil.java:71)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)
        at 
org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
        at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
        at 
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
        at 
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
        at java.io.DataOutputStream.writeLong(DataOutputStream.java:207)
        at 
org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.write(DataTransferProtocol.java:133)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.lastDataNodeRun(BlockReceiver.java:840)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:870)
        at java.lang.Thread.run(Thread.java:662)

2010-12-09 05:08:02,221 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
PacketResponder 0 for block blk_-8817504198034990390_251613 terminating
2010-12-09 05:08:02,224 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
writeBlock blk_-8817504198034990390_251613 received exception 
java.io.IOException: Connection reset by peer
2010-12-09 05:08:02,224 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Scheduling block blk_4505306716668305567_250752 file 
/dist1/data/hadoop-data/current/subdir19/blk_4505306716668305567 for deletion
2010-12-09 05:08:02,226 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeRegistration(10.100.154.103:50010, 
storageID=DS-866555524-10.100.154.103-50010-1291262739506, infoPort=50075, 
ipcPort=50020):DataXceiver
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcher.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:237)
        at sun.nio.ch.IOUtil.read(IOUtil.java:210)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
        at 
org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
        at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
        at java.io.DataInputStream.read(DataInputStream.java:132)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:267)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:357)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:378)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:534)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:417)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122)
2010-12-09 05:08:02,226 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Deleted block blk_1422641301942366074_250732 at file 
/dist1/data/hadoop-data/current/subdir19/blk_1422641301942366074
2010-12-09 05:08:02,227 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Deleted block blk_4505306716668305567_250752 at file 
/dist1/data/hadoop-data/current/subdir19/blk_4505306716668305567



************************************************************************************
05.hadoop
************************************************************************************
REGION SERVER ->
2010-12-09 04:29:25,047 DEBUG org.apache.hadoop.hbase.regionserver.Store: 
Compaction size of data: 250.5m; Skipped 1 file(s), size: 206896367
2010-12-09 04:29:25,047 INFO org.apache.hadoop.hbase.regionserver.Store: 
Started compaction of 2 file(s) in data of 
article,060c4b75067378c045eead9edec2d4fb7d8f3e5e,1291812563166.95e92e97823e11e57850720a63433a62.
  into 
hdfs://00.hadoop.****:54310/hbase/article/95e92e97823e11e57850720a63433a62/.tmp,
 sequenceid=33050221
2010-12-09 04:29:26,328 INFO org.apache.hadoop.hbase.regionserver.Store: 
Completed compaction of 2 file(s) in data of 
article,060c4b75067378c045eead9edec2d4fb7d8f3e5e,1291812563166.95e92e97823e11e57850720a63433a62.;
 new storefile is 
hdfs://00.hadoop.****:54310/hbase/article/95e92e97823e11e57850720a63433a62/data/2787768957794924068;
 store size is 250.5m
2010-12-09 04:29:26,328 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
compaction completed on region 
article,060c4b75067378c045eead9edec2d4fb7d8f3e5e,1291812563166.95e92e97823e11e57850720a63433a62.
 in 1sec
2010-12-09 04:29:26,328 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
Starting compaction on region 
article,ed0cdc7ae6db73432e2ca47c973d545bee11ddd8,1291813400028.85a8d31d099f409dd36f3bcc4be1eba6.
2010-12-09 04:29:26,330 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
compaction completed on region 
article,ed0cdc7ae6db73432e2ca47c973d545bee11ddd8,1291813400028.85a8d31d099f409dd36f3bcc4be1eba6.
 in 0sec
2010-12-09 04:32:33,081 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGIONSERVER_STOP
2010-12-09 04:32:33,081 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server 
on 60020
2010-12-09 04:32:33,089 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream 
ResponseProcessor exception  for block 
blk_2012253533858803341_250248java.io.EOFException
        at java.io.DataInputStream.readFully(DataInputStream.java:180)
        at java.io.DataInputStream.readLong(DataInputStream.java:399)
        at 
org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:120)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2621)

2010-12-09 04:32:33,089 INFO org.apache.zookeeper.ClientCnxn: Unable to read 
additional data from server sessionid 0x12ca58c08ff0049, likely server has 
closed socket, closing sock
et connection and attempting reconnect
2010-12-09 04:32:33,089 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: IPC 
Server handler 5 on 60020 took 75446ms appending an edit to hlog; 
editcount=50505
2010-12-09 04:32:33,090 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server 
handler 18 on 60020: exiting
2010-12-09 04:32:33,139 INFO org.apache.hadoop.hdfs.DFSClient: Error Recovery 
for block blk_2012253533858803341_250248 waiting for responder to exit. 
2010-12-09 04:32:33,139 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server 
handler 8 on 60020: exiting
2010-12-09 04:32:33,139 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server 
handler 1 on 60020: exiting
2010-12-09 04:32:33,139 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server 
handler 10 on 60020: exiting
2010-12-09 04:32:33,140 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server 
handler 0 on 60020: exiting2010-12-09 04:32:33,140 INFO 
org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 5 on 60020: exiting
2010-12-09 04:32:33,139 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server 
handler 8 on 60020: exiting2010-12-09 04:32:33,140 INFO 
org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 4 on 60020: exiting
2010-12-09 04:32:33,140 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server 
handler 3 on 60020: exiting2010-12-09 04:32:33,142 INFO 
org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 7 on 60020: exiting
2010-12-09 04:32:33,142 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server 
handler 9 on 60020: exiting2010-12-09 04:32:33,143 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping infoServer
2010-12-09 04:32:33,143 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server 
handler 2 on 60020: exiting
2010-12-09 04:32:33,143 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC 
Server Responder
2010-12-09 04:32:33,143 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server 
handler 6 on 60020: exiting
2010-12-09 04:32:33,144 INFO org.mortbay.log: Stopped 
selectchannelconnec...@0.0.0.0:60030




NAMENODE ->
2010-12-09 04:32:08,408 WARN org.apache.hadoop.hdfs.StateChange: DIR* 
NameSystem.startFile: failed to create file 
/hbase/.logs/05.hadoop.jive,60020,1291862294517/10.100.159.13%3A60020.1291868481555
 for DFSClient_hb_m_10.194.194.79:60000_1291788452343 on client 10.194.194.79, 
because this file is already being created by NN_Recovery on 10.100.159.13
2010-12-09 04:32:08,408 INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 
on 54310, call 
append(/hbase/.logs/05.hadoop.jive,60020,1291862294517/10.100.159.13%3A60020.1291868481555,
 DFSClient_hb_m_10.194.194.79:60000_1291788452343) from 10.194.194.79:59118: 
error: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to 
create file /hba
se/.logs/05.hadoop.jive,60020,1291862294517/10.100.159.13%3A60020.1291868481555 
for DFSClient_hb_m_10.194.194.79:60000_1291788452343 on client 10.194.194.79, 
because this file is already being created by NN_Recovery on 10.100.159.13
org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create 
file 
/hbase/.logs/05.hadoop.jive,60020,1291862294517/10.100.159.13%3A60020.1291868481555
 for DFSClient_hb_m_10.194.194.79:60000_1291788452343 on client 10.194.194.79, 
because this file is already being created by NN_Recovery on 10.100.159.13
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1194)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1282)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:541)       
 at sun.reflect.GeneratedMethodAccessor37.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:528)        at 
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1319)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1315)        
at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1313)2010-12-09 
04:32:09,411 WARN org.apache.hadoop.hdfs.StateChange: DIR* 
NameSystem.startFile: failed to create file 
/hbase/.logs/05.hadoop.jive,60020,1291862294517/10.100.159.13%3A6
0020.1291868481555 for DFSClient_hb_m_10.194.194.79:60000_1291788452343 on 
client 10.194.194.79, because this file is already being created by NN_Recovery 
on 10.100.159.132010-12-09 04:32:09,411 INFO org.apache.hadoop.ipc.Server: IPC 
Server handler 9 on 54310, call 
append(/hbase/.logs/05.hadoop.jive,60020,1291862294517/10.100.159.13%3A60020.1291868
481555, DFSClient_hb_m_10.194.194.79:60000_1291788452343) from 
10.194.194.79:59118: error: 
org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create 
file 
/hbase/.logs/05.hadoop.jive,60020,1291862294517/10.100.159.13%3A60020.1291868481555
 for DFSClient_hb_m_10.194.194.79:60000_1291788452343 on client 10.194.194.79, 
because this file is 
already being created by NN_Recovery on 
10.100.159.13org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: 
failed to create file 
/hbase/.logs/05.hadoop.jive,60020,1291862294517/10.100.159.13%3A60020.1291868481555
 for DFSClie
nt_hb_m_10.194.194.79:60000_1291788452343 on client 10.194.194.79, because this 
file is already being created by NN_Recovery on 10.100.159.13        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1194)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1282)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:541)
        at sun.reflect.GeneratedMethodAccessor37.invoke(Unknown Source)        
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)        at 
org.apache.hadoop.ipc.RPC$Server.call(RPC.java:528)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1319)        
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1315)



DATANODE 

2010-12-09 04:32:09,534 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
PacketResponder blk_2012253533858803341_250248 1 : Thread is interrupted.
2010-12-09 04:32:09,534 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeRegistration(10.100.159.13:50010, 
storageID=DS-145997007-10.100.159.13-50010-1291262743182, infoPort=50075, 
ipcPort=50020):DataXceiver
java.io.IOException: Interrupted receiveBlock
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:579)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:417)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122)
2010-12-09 04:32:09,534 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
PacketResponder 1 for block blk_2012253533858803341_250248 terminating


************************************************************************************
06.hadoop
************************************************************************************
REGION SERVER ->

2010-12-09 06:17:47,215 DEBUG 
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; 
article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f.
2010-12-09 06:17:47,430 DEBUG 
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; 
article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f.
2010-12-09 06:17:47,549 DEBUG 
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; 
article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f.
2010-12-09 06:17:47,557 DEBUG 
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; 
article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f.
2010-12-09 06:17:47,563 DEBUG 
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; 
article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f.
2010-12-09 06:17:48,619 DEBUG 
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; 
article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f.
2010-12-09 06:17:48,619 DEBUG 
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; 
article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f.
2010-12-09 06:17:48,619 DEBUG 
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; 
article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f.
2010-12-09 06:17:48,619 DEBUG 
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; 
article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f.
2010-12-09 06:17:48,619 DEBUG 
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; 
article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f.
2010-12-09 06:17:49,703 DEBUG 
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; 
article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f.
2010-12-09 06:17:49,703 DEBUG 
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; 
article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f.
2010-12-09 06:17:49,703 DEBUG 
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; 
article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f.
2010-12-09 06:17:49,705 DEBUG 
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; 
article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f.
2010-12-09 06:17:49,729 DEBUG 
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; 
article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f.

A bunch of these ^

Then a little later::
.... 

2010-12-09 06:23:23,561 DEBUG 
org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested 
for region article,993143fca3613014990cc6975cacdd62f4147e03,12918133247
53.19413f33fc660536d9d4f315a1283440. because: regionserver60020.cacheFlusher
2010-12-09 06:23:27,058 INFO 
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using syncFs -- 
HDFS-200
2010-12-09 06:23:27,080 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: 
Roll 
/hbase/.logs/06.hadoop.****,60020,1291864620851/10.99.97.106%3A60020.1291875047947,
 entries=53315,
 filesize=63781925. New hlog 
/hbase/.logs/06.hadoop.****,60020,1291864620851/10.99.97.106%3A60020.1291875807006
2010-12-09 06:23:27,080 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: 
Found 1 hlogs to remove  out of total 8; oldest outstanding sequenceid is 
33526881 from region article
,1221b0013c61fd7446cb822f698d0bcafa487342,1291872610788.bb2aa6ae8c694a096a00e7bb91fc46c3.
2010-12-09 06:23:27,080 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: 
moving old hlog file 
/hbase/.logs/06.hadoop.****,60020,1291864620851/10.99.97.106%3A60020.1291872064424
 whose highest sequenceid is 33506639 to 
/hbase/.oldlogs/10.99.97.106%3A60020.1291872064424
2010-12-09 06:25:39,162 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: IPC 
Server handler 12 on 60020 took 119034ms appending an edit to hlog; 
editcount=3683
2010-12-09 06:25:39,162 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer 
Exception: java.io.IOException: Broken pipe
        at sun.nio.ch.FileDispatcher.write0(Native Method)
        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:100)
        at sun.nio.ch.IOUtil.write(IOUtil.java:71)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)
        at 
org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
        at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
        at 
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
        at 
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
        at java.io.DataOutputStream.write(DataOutputStream.java:90)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2509)
2010-12-09 06:25:39,163 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream 
ResponseProcessor exception  for block 
blk_1141147913164105086_258331java.io.EOFException
        at java.io.DataInputStream.readFully(DataInputStream.java:180)
        at java.io.DataInputStream.readLong(DataInputStream.java:399)
        at 
org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:120)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2621)

2010-12-09 06:25:39,163 INFO org.apache.hadoop.hdfs.DFSClient: Error Recovery 
for block blk_1141147913164105086_258331 waiting for responder to exit. 
2010-12-09 06:25:39,164 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream 
ResponseProcessor exception  for block 
blk_-7076574401277381592_258324java.io.EOFException
        at java.io.DataInputStream.readFully(DataInputStream.java:180)
        at java.io.DataInputStream.readLong(DataInputStream.java:399)
        at 
org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:120)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2621)

2010-12-09 06:25:39,164 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery 
for block blk_-7076574401277381592_258324 bad datanode[0] 10.99.97.106:50010
2010-12-09 06:25:39,164 INFO org.apache.zookeeper.ClientCnxn: Unable to read 
additional data from server sessionid 0x12ca58c08ff004b, likely server has 
closed socket, closing socket connection and attempting reconnect
2010-12-09 06:25:39,164 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery 
for block blk_-7076574401277381592_258324 in pipeline 10.99.97.106:50010, 
10.100.154.103:50010: bad datanode 10.99.97.106:50010
2010-12-09 06:25:39,165 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery 
for block blk_1141147913164105086_258331 bad datanode[0] 10.99.97.106:50010
2010-12-09 06:25:39,167 INFO org.apache.zookeeper.ClientCnxn: Unable to read 
additional data from server sessionid 0x12ca58c08ff004c, likely server has 
closed socket, closing socket connection and attempting reconnect
2010-12-09 06:25:39,169 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: /10.100.154.103:50020. Already tried 0 time(s).
2010-12-09 06:25:39,169 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery 
for block blk_1141147913164105086_258331 in pipeline 10.99.97.106:50010, 
10.100.159.13:50010: bad datanode 10.99.97.106:50010
2010-12-09 06:25:39,209 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGIONSERVER_STOP
2010-12-09 06:25:39,210 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server 
on 60020
2010-12-09 06:25:39,210 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server 
handler 3 on 60020: exiting
2010-12-09 06:25:39,210 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server 
handler 9 on 60020: exiting
2010-12-09 06:25:39,210 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server 
handler 24 on 60020: exiting
2010-12-09 06:25:39,210 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server 
handler 1 on 60020: exiting
2010-12-09 06:25:39,210 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server 
handler 15 on 60020: exiting
2010-12-09 06:25:39,215 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server 
handler 0 on 60020: exiting
2010-12-09 06:25:39,214 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server 
handler 3 on 60020: exiting
2010-12-09 06:25:39,214 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC 
Server listener on 60020
2010-12-09 06:25:39,215 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server 
handler 2 on 60020: exiting
2010-12-09 06:25:39,215 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server 
handler 7 on 60020: exiting
2010-12-09 06:25:39,213 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server 
handler 21 on 60020: exiting
2010-12-09 06:25:39,215 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server 
handler 4 on 60020: exiting
2010-12-09 06:25:39,215 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC 
Server Responder
2010-12-09 06:25:39,215 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server 
handler 5 on 60020: exiting
2010-12-09 06:25:39,216 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server 
handler 8 on 60020: exiting
2010-12-09 06:25:39,216 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server 
handler 9 on 60020: exiting
2010-12-09 06:25:39,216 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server 
handler 6 on 60020: exiting
2010-12-09 06:25:39,216 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping infoServer
2010-12-09 06:25:39,223 WARN org.apache.hadoop.hdfs.DFSClient: Failed recovery 
attempt #0 from primary datanode 10.100.159.13:50010



Reply via email to