We have a 6 node cluster, 5 with region serves. 2 of the region servers have been stable for days, but 3 of them keep crashing. Here are the logs around around when the crash occurs. (btw, we are shoving approximately the twitter firehose into hbase via flume) I'm an hbase newbie, but I have been reading. Not sure what else is needed to help debug. When the problem occurs, it is concurrent with issues that appear all the way down in the data node. Also, interesting to note, all 3 servers seem to fail differently somewhat:
Servers (6): 00.hadoop is the master 01.hadoop, 03.hadoop (there is no 02.hadoop - flaky machine that got taken out) - are stable Region Servers 04-06.hadoop crash at different times, very different logs, Region Servers Master Attributes: HBase Version 0.89.20100924+28, r Hadoop Version 0.20.2+737, r98c55c28258aa6f42250569bd7fa431ac657bdbd HBase Root Directory hdfs://00.hadoop.****:54310/hbase Load average 103.0 Average number of regions per regionserver. Naive computation. Regions On FS 577 (This is after recently restarting 04-06) Address Start Code Load 01.hadoop:60030 1291268910638 requests=223, regions=118, usedHeap=519, maxHeap=987 03.hadoop:60030 1291269219610 requests=23, regions=111, usedHeap=862, maxHeap=987 04.hadoop:60030 1291910665912 requests=169, regions=82, usedHeap=194, maxHeap=888 05.hadoop:60030 1291909584060 requests=232, regions=110, usedHeap=477, maxHeap=888 06.hadoop60030 1291909723787 requests=99, regions=94, usedHeap=394, maxHeap=888 Total: servers: 5 requests=746, regions=515 Following are various logs around the time of the failures for 04-06 ************************************************************************************ 04.hadoop ************************************************************************************ REGION SERVER -> 2010-12-09 05:01:58,640 INFO org.apache.hadoop.hbase.regionserver.HRegion: compaction completed on region article,a83858a08f2270d319f75a7b43c756c2453988e7,1291811929868.58f6d9fc80 c78f3ca490b0280b4f1226. in 0sec 2010-12-09 05:04:30,898 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=1.46 MB, free=176.33 MB, max=177.79 MB, blocks=0, accesses=200583, hits=0, hitRatio= 0.00%%, evictions=0, evicted=0, evictedPerRun=NaN 2010-12-09 05:04:59,612 INFO org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using syncFs -- HDFS-200 2010-12-09 05:04:59,618 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/04.hadoop.****,60020,1291858770125/10.100.154.103%3A60020.1291870585253, entries=5318 6, filesize=63776311. New hlog /hbase/.logs/04.hadoop.****,60020,1291858770125/10.100.154.103%3A60020.1291871099582 2010-12-09 05:08:02,033 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: Found 1 hlogs to remove out of total 6; oldest outstanding sequenceid is 32891801 from region article ,4b2039b791e894dd479b90661ca97087f61645d3,1291813201039.7dff47c63bb5648bb1ee6670c60553a5. 2010-12-09 05:08:02,034 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: IPC Server handler 13 on 60020 took 182416ms appending an edit to hlog; editcount=0 2010-12-09 05:08:02,066 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: moving old hlog file /hbase/.logs/04.hadoop.****,60020,1291858770125/10.100.154.103%3A60020.12918679476 65 whose highest sequenceid is 32880227 to /hbase/.oldlogs/10.100.154.103%3A60020.1291867947665 2010-12-09 05:08:02,066 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 182936ms for sessionid 0x12ca58c08ff0048, closing socket conn ection and attempting reconnect 2010-12-09 05:08:02,089 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: Aborting region server serverName=04.hadoop.****,60020,1291858770125, load=(requests=24, regions= 64, usedHeap=548, maxHeap=888): Unhandled exception org.apache.hadoop.hbase.YouAreDeadException: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing 04.hadoop.****,60020,1291858770125 as dead s erver at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:217) at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:271) at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:744) at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:561) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94) at org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48) at org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66) at org.apache.hadoop.hbase.regionserver.HRegionServer.doRun(HRegionServer.java:635) at org.apache.hadoop.hbase.regionserver.HRegionServer.access$000(HRegionServer.java:126) at org.apache.hadoop.hbase.regionserver.HRegionServer$1.run(HRegionServer.java:518) at org.apache.hadoop.hbase.regionserver.HRegionServer$1.run(HRegionServer.java:516) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:337) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1046) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:516) at java.lang.Thread.run(Thread.java:662) 2010-12-09 05:08:02,090 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: request=0.0, regions=64, stores=64, storefiles=136, storefileIndexSize=27, memsto reSize=353, compactionQueueSize=0, usedHeap=549, maxHeap=888, blockCacheSize=1530552, blockCacheFree=184893160, blockCacheCount=0, blockCacheHitRatio=0 2010-12-09 05:08:02,090 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server on 60020 2010-12-09 05:08:02,090 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 0 on 60020: exiting2010-12-09 05:08:02,090 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC Server listener on 60020 MASTER HBASE -> 2010-12-09 05:05:06,072 INFO org.apache.hadoop.hbase.master.BaseScanner: All 1 .META. region(s) scanned 2010-12-09 05:05:40,004 INFO org.apache.hadoop.hbase.master.ServerManager: 04.hadoop.jive,60020,1291858770125 znode expired 2010-12-09 05:05:40,005 DEBUG org.apache.hadoop.hbase.master.ServerManager: Added=04.hadoop.jive,60020,1291858770125 to dead servers, added shutdown processing operation 2010-12-09 05:05:40,005 DEBUG org.apache.hadoop.hbase.master.RegionServerOperationQueue: Processing todo: ProcessServerShutdown of 04.hadoop.jive,60020,12918587701252010-12-09 05:05:40,005 INFO org.apache.hadoop.hbase.master.RegionServerOperation: Process shutdown of server 04.hadoop.jive,60020,1291858770125: logSplit: false, rootRescanned: f alse, numberOfMetaRegions: 1, onlineMetaRegions.size(): 1 2010-12-09 05:05:40,008 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Splitting 7 hlog(s) in hdfs://00.hadoop.jive:54310/hbase/.logs/04.hadoop.jive,60020,12918587701252010-12-09 05:05:40,008 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: Splitting hlog 1 of 7: hdfs://00.hadoop.jive:54310/hbase/.logs/04.hadoop.jive,60020,1291858770125/10.1 00.154.103%3A60020.1291867947665, length=637577092010-12-09 05:05:40,008 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering filehdfs://00.hadoop.jive:54310/hbase/.logs/04.hadoop.jive,60020,1291858770125/10.100.154.103%3A60020 .1291867947665 NAMENODE -> 2010-12-09 05:08:02,471 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.100.159.13:50010 is added to blk_1531008743226086399_251615 size 63757709 2010-12-09 05:08:02,473 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 54310, call complete(/hbase/.logs/04.hadoop.jive,60020,1291858770125/10.100.154.103%3A60020.1291867947665, DFSClient_hb_m_10.194.194.79:60000_1291788452343) from 10.194.194.79:44117: error: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /hbase/.logs/04.hadoop.jive,60020,1291858770125/10.100.154.103%3A60020.1291867947665 File does not exist. [Lease. Holder: DFSClient_hb_m_10.194.194.79:60000_1291788452343, pendingcreates: 1] org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /hbase/.logs/04.hadoop.jive,60020,1291858770125/10.100.154.103%3A60020.1291867947665 File does not exist. [Lease. Holder: DFSClient_hb_m_10.194.194.79:60000_1291788452343, pendingcreates: 1] at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1488) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1479) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:1534) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:1522) at org.apache.hadoop.hdfs.server.namenode.NameNode.complete(NameNode.java:610) at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:528) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1319) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1315) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1313) 2010-12-09 05:08:04,206 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.99.97.106:50010 is added to blk_-734280257049179934_251614 size 7330 DATANODE -> 2010-12-09 05:08:02,212 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in BlockReceiver.lastNodeRun: java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcher.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:100) at sun.nio.ch.IOUtil.write(IOUtil.java:71) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334) at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107) at java.io.DataOutputStream.writeLong(DataOutputStream.java:207) at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.write(DataTransferProtocol.java:133) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.lastDataNodeRun(BlockReceiver.java:840) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:870) at java.lang.Thread.run(Thread.java:662) 2010-12-09 05:08:02,213 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: checkDiskError: exception: 2010-12-09 05:08:02,213 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: checkDiskError: exception: java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcher.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:100) at sun.nio.ch.IOUtil.write(IOUtil.java:71) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334) at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107) at java.io.DataOutputStream.writeLong(DataOutputStream.java:207) at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.write(DataTransferProtocol.java:133) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.lastDataNodeRun(BlockReceiver.java:840) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:870) at java.lang.Thread.run(Thread.java:662) 2010-12-09 05:08:02,221 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder blk_-8817504198034990390_251613 0 Exception java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcher.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:100) at sun.nio.ch.IOUtil.write(IOUtil.java:71) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334) at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107) at java.io.DataOutputStream.writeLong(DataOutputStream.java:207) at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.write(DataTransferProtocol.java:133) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.lastDataNodeRun(BlockReceiver.java:840) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:870) at java.lang.Thread.run(Thread.java:662) 2010-12-09 05:08:02,221 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for block blk_-8817504198034990390_251613 terminating 2010-12-09 05:08:02,224 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_-8817504198034990390_251613 received exception java.io.IOException: Connection reset by peer 2010-12-09 05:08:02,224 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block blk_4505306716668305567_250752 file /dist1/data/hadoop-data/current/subdir19/blk_4505306716668305567 for deletion 2010-12-09 05:08:02,226 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.100.154.103:50010, storageID=DS-866555524-10.100.154.103-50010-1291262739506, infoPort=50075, ipcPort=50020):DataXceiver java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:237) at sun.nio.ch.IOUtil.read(IOUtil.java:210) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) at java.io.BufferedInputStream.read1(BufferedInputStream.java:256) at java.io.BufferedInputStream.read(BufferedInputStream.java:317) at java.io.DataInputStream.read(DataInputStream.java:132) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:267) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:357) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:378) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:534) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:417) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122) 2010-12-09 05:08:02,226 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block blk_1422641301942366074_250732 at file /dist1/data/hadoop-data/current/subdir19/blk_1422641301942366074 2010-12-09 05:08:02,227 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block blk_4505306716668305567_250752 at file /dist1/data/hadoop-data/current/subdir19/blk_4505306716668305567 ************************************************************************************ 05.hadoop ************************************************************************************ REGION SERVER -> 2010-12-09 04:29:25,047 DEBUG org.apache.hadoop.hbase.regionserver.Store: Compaction size of data: 250.5m; Skipped 1 file(s), size: 206896367 2010-12-09 04:29:25,047 INFO org.apache.hadoop.hbase.regionserver.Store: Started compaction of 2 file(s) in data of article,060c4b75067378c045eead9edec2d4fb7d8f3e5e,1291812563166.95e92e97823e11e57850720a63433a62. into hdfs://00.hadoop.****:54310/hbase/article/95e92e97823e11e57850720a63433a62/.tmp, sequenceid=33050221 2010-12-09 04:29:26,328 INFO org.apache.hadoop.hbase.regionserver.Store: Completed compaction of 2 file(s) in data of article,060c4b75067378c045eead9edec2d4fb7d8f3e5e,1291812563166.95e92e97823e11e57850720a63433a62.; new storefile is hdfs://00.hadoop.****:54310/hbase/article/95e92e97823e11e57850720a63433a62/data/2787768957794924068; store size is 250.5m 2010-12-09 04:29:26,328 INFO org.apache.hadoop.hbase.regionserver.HRegion: compaction completed on region article,060c4b75067378c045eead9edec2d4fb7d8f3e5e,1291812563166.95e92e97823e11e57850720a63433a62. in 1sec 2010-12-09 04:29:26,328 INFO org.apache.hadoop.hbase.regionserver.HRegion: Starting compaction on region article,ed0cdc7ae6db73432e2ca47c973d545bee11ddd8,1291813400028.85a8d31d099f409dd36f3bcc4be1eba6. 2010-12-09 04:29:26,330 INFO org.apache.hadoop.hbase.regionserver.HRegion: compaction completed on region article,ed0cdc7ae6db73432e2ca47c973d545bee11ddd8,1291813400028.85a8d31d099f409dd36f3bcc4be1eba6. in 0sec 2010-12-09 04:32:33,081 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGIONSERVER_STOP 2010-12-09 04:32:33,081 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server on 60020 2010-12-09 04:32:33,089 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block blk_2012253533858803341_250248java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at java.io.DataInputStream.readLong(DataInputStream.java:399) at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:120) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2621) 2010-12-09 04:32:33,089 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x12ca58c08ff0049, likely server has closed socket, closing sock et connection and attempting reconnect 2010-12-09 04:32:33,089 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: IPC Server handler 5 on 60020 took 75446ms appending an edit to hlog; editcount=50505 2010-12-09 04:32:33,090 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 18 on 60020: exiting 2010-12-09 04:32:33,139 INFO org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_2012253533858803341_250248 waiting for responder to exit. 2010-12-09 04:32:33,139 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 8 on 60020: exiting 2010-12-09 04:32:33,139 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 1 on 60020: exiting 2010-12-09 04:32:33,139 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 10 on 60020: exiting 2010-12-09 04:32:33,140 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 0 on 60020: exiting2010-12-09 04:32:33,140 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 5 on 60020: exiting 2010-12-09 04:32:33,139 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 8 on 60020: exiting2010-12-09 04:32:33,140 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 4 on 60020: exiting 2010-12-09 04:32:33,140 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 3 on 60020: exiting2010-12-09 04:32:33,142 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 7 on 60020: exiting 2010-12-09 04:32:33,142 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 9 on 60020: exiting2010-12-09 04:32:33,143 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping infoServer 2010-12-09 04:32:33,143 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 2 on 60020: exiting 2010-12-09 04:32:33,143 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC Server Responder 2010-12-09 04:32:33,143 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 6 on 60020: exiting 2010-12-09 04:32:33,144 INFO org.mortbay.log: Stopped selectchannelconnec...@0.0.0.0:60030 NAMENODE -> 2010-12-09 04:32:08,408 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.startFile: failed to create file /hbase/.logs/05.hadoop.jive,60020,1291862294517/10.100.159.13%3A60020.1291868481555 for DFSClient_hb_m_10.194.194.79:60000_1291788452343 on client 10.194.194.79, because this file is already being created by NN_Recovery on 10.100.159.13 2010-12-09 04:32:08,408 INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 on 54310, call append(/hbase/.logs/05.hadoop.jive,60020,1291862294517/10.100.159.13%3A60020.1291868481555, DFSClient_hb_m_10.194.194.79:60000_1291788452343) from 10.194.194.79:59118: error: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /hba se/.logs/05.hadoop.jive,60020,1291862294517/10.100.159.13%3A60020.1291868481555 for DFSClient_hb_m_10.194.194.79:60000_1291788452343 on client 10.194.194.79, because this file is already being created by NN_Recovery on 10.100.159.13 org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /hbase/.logs/05.hadoop.jive,60020,1291862294517/10.100.159.13%3A60020.1291868481555 for DFSClient_hb_m_10.194.194.79:60000_1291788452343 on client 10.194.194.79, because this file is already being created by NN_Recovery on 10.100.159.13 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1194) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1282) at org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:541) at sun.reflect.GeneratedMethodAccessor37.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:528) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1319) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1315) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1313)2010-12-09 04:32:09,411 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.startFile: failed to create file /hbase/.logs/05.hadoop.jive,60020,1291862294517/10.100.159.13%3A6 0020.1291868481555 for DFSClient_hb_m_10.194.194.79:60000_1291788452343 on client 10.194.194.79, because this file is already being created by NN_Recovery on 10.100.159.132010-12-09 04:32:09,411 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 54310, call append(/hbase/.logs/05.hadoop.jive,60020,1291862294517/10.100.159.13%3A60020.1291868 481555, DFSClient_hb_m_10.194.194.79:60000_1291788452343) from 10.194.194.79:59118: error: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /hbase/.logs/05.hadoop.jive,60020,1291862294517/10.100.159.13%3A60020.1291868481555 for DFSClient_hb_m_10.194.194.79:60000_1291788452343 on client 10.194.194.79, because this file is already being created by NN_Recovery on 10.100.159.13org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /hbase/.logs/05.hadoop.jive,60020,1291862294517/10.100.159.13%3A60020.1291868481555 for DFSClie nt_hb_m_10.194.194.79:60000_1291788452343 on client 10.194.194.79, because this file is already being created by NN_Recovery on 10.100.159.13 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1194) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1282) at org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:541) at sun.reflect.GeneratedMethodAccessor37.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:528) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1319) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1315) DATANODE 2010-12-09 04:32:09,534 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder blk_2012253533858803341_250248 1 : Thread is interrupted. 2010-12-09 04:32:09,534 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.100.159.13:50010, storageID=DS-145997007-10.100.159.13-50010-1291262743182, infoPort=50075, ipcPort=50020):DataXceiver java.io.IOException: Interrupted receiveBlock at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:579) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:417) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122) 2010-12-09 04:32:09,534 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 for block blk_2012253533858803341_250248 terminating ************************************************************************************ 06.hadoop ************************************************************************************ REGION SERVER -> 2010-12-09 06:17:47,215 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f. 2010-12-09 06:17:47,430 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f. 2010-12-09 06:17:47,549 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f. 2010-12-09 06:17:47,557 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f. 2010-12-09 06:17:47,563 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f. 2010-12-09 06:17:48,619 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f. 2010-12-09 06:17:48,619 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f. 2010-12-09 06:17:48,619 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f. 2010-12-09 06:17:48,619 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f. 2010-12-09 06:17:48,619 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f. 2010-12-09 06:17:49,703 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f. 2010-12-09 06:17:49,703 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f. 2010-12-09 06:17:49,703 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f. 2010-12-09 06:17:49,705 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f. 2010-12-09 06:17:49,729 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f. A bunch of these ^ Then a little later:: .... 2010-12-09 06:23:23,561 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested for region article,993143fca3613014990cc6975cacdd62f4147e03,12918133247 53.19413f33fc660536d9d4f315a1283440. because: regionserver60020.cacheFlusher 2010-12-09 06:23:27,058 INFO org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using syncFs -- HDFS-200 2010-12-09 06:23:27,080 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/06.hadoop.****,60020,1291864620851/10.99.97.106%3A60020.1291875047947, entries=53315, filesize=63781925. New hlog /hbase/.logs/06.hadoop.****,60020,1291864620851/10.99.97.106%3A60020.1291875807006 2010-12-09 06:23:27,080 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: Found 1 hlogs to remove out of total 8; oldest outstanding sequenceid is 33526881 from region article ,1221b0013c61fd7446cb822f698d0bcafa487342,1291872610788.bb2aa6ae8c694a096a00e7bb91fc46c3. 2010-12-09 06:23:27,080 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: moving old hlog file /hbase/.logs/06.hadoop.****,60020,1291864620851/10.99.97.106%3A60020.1291872064424 whose highest sequenceid is 33506639 to /hbase/.oldlogs/10.99.97.106%3A60020.1291872064424 2010-12-09 06:25:39,162 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: IPC Server handler 12 on 60020 took 119034ms appending an edit to hlog; editcount=3683 2010-12-09 06:25:39,162 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcher.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:100) at sun.nio.ch.IOUtil.write(IOUtil.java:71) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334) at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) at java.io.DataOutputStream.write(DataOutputStream.java:90) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2509) 2010-12-09 06:25:39,163 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block blk_1141147913164105086_258331java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at java.io.DataInputStream.readLong(DataInputStream.java:399) at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:120) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2621) 2010-12-09 06:25:39,163 INFO org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_1141147913164105086_258331 waiting for responder to exit. 2010-12-09 06:25:39,164 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block blk_-7076574401277381592_258324java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at java.io.DataInputStream.readLong(DataInputStream.java:399) at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:120) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2621) 2010-12-09 06:25:39,164 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-7076574401277381592_258324 bad datanode[0] 10.99.97.106:50010 2010-12-09 06:25:39,164 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x12ca58c08ff004b, likely server has closed socket, closing socket connection and attempting reconnect 2010-12-09 06:25:39,164 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-7076574401277381592_258324 in pipeline 10.99.97.106:50010, 10.100.154.103:50010: bad datanode 10.99.97.106:50010 2010-12-09 06:25:39,165 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_1141147913164105086_258331 bad datanode[0] 10.99.97.106:50010 2010-12-09 06:25:39,167 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x12ca58c08ff004c, likely server has closed socket, closing socket connection and attempting reconnect 2010-12-09 06:25:39,169 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.100.154.103:50020. Already tried 0 time(s). 2010-12-09 06:25:39,169 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_1141147913164105086_258331 in pipeline 10.99.97.106:50010, 10.100.159.13:50010: bad datanode 10.99.97.106:50010 2010-12-09 06:25:39,209 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGIONSERVER_STOP 2010-12-09 06:25:39,210 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server on 60020 2010-12-09 06:25:39,210 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 3 on 60020: exiting 2010-12-09 06:25:39,210 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 9 on 60020: exiting 2010-12-09 06:25:39,210 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 24 on 60020: exiting 2010-12-09 06:25:39,210 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 1 on 60020: exiting 2010-12-09 06:25:39,210 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 15 on 60020: exiting 2010-12-09 06:25:39,215 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 0 on 60020: exiting 2010-12-09 06:25:39,214 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 3 on 60020: exiting 2010-12-09 06:25:39,214 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC Server listener on 60020 2010-12-09 06:25:39,215 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 2 on 60020: exiting 2010-12-09 06:25:39,215 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 7 on 60020: exiting 2010-12-09 06:25:39,213 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 21 on 60020: exiting 2010-12-09 06:25:39,215 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 4 on 60020: exiting 2010-12-09 06:25:39,215 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC Server Responder 2010-12-09 06:25:39,215 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 5 on 60020: exiting 2010-12-09 06:25:39,216 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 8 on 60020: exiting 2010-12-09 06:25:39,216 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 9 on 60020: exiting 2010-12-09 06:25:39,216 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 6 on 60020: exiting 2010-12-09 06:25:39,216 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping infoServer 2010-12-09 06:25:39,223 WARN org.apache.hadoop.hdfs.DFSClient: Failed recovery attempt #0 from primary datanode 10.100.159.13:50010