Lance, Both those lines indicate the problem:
IPC Server handler 13 on 60020 took 182416ms Client session timed out, have not heard from server in 182936ms It's very clear that your region servers are suffering from pause-of-the-world garbage collection issues. Basically this one GC'ed for 3 minutes, which is over the 1 minute timeout to consider a region server dead. The rest is just the side effect of that. This subject comes often on the mailing list, for example: http://search-hadoop.com/m/t97q027tr7f2 Make sure you don't swap, give plenty of RAM to HBase, use LZO compression, don't underprovision your machines, etc J-D On Thu, Dec 9, 2010 at 11:27 AM, Lance Riedel <lancerie...@gmail.com> wrote: > > We have a 6 node cluster, 5 with region serves. 2 of the region servers have > been stable for days, but 3 of them keep crashing. Here are the logs around > around when the crash occurs. (btw, we are shoving approximately the twitter > firehose into hbase via flume) I'm an hbase newbie, but I have been reading. > Not sure what else is needed to help debug. When the problem occurs, it is > concurrent with issues that appear all the way down in the data node. Also, > interesting to note, all 3 servers seem to fail differently somewhat: > > Servers (6): > 00.hadoop is the master > 01.hadoop, 03.hadoop (there is no 02.hadoop - flaky machine that got taken > out) - are stable Region Servers > 04-06.hadoop crash at different times, very different logs, Region Servers > > > Master Attributes: > HBase Version 0.89.20100924+28, r > Hadoop Version 0.20.2+737, r98c55c28258aa6f42250569bd7fa431ac657bdbd > > HBase Root Directory hdfs://00.hadoop.****:54310/hbase > > Load average > 103.0 > Average number of regions per regionserver. Naive computation. > Regions On FS > 577 > > > (This is after recently restarting 04-06) > > > Address Start Code Load > 01.hadoop:60030 1291268910638 requests=223, regions=118, usedHeap=519, > maxHeap=987 > 03.hadoop:60030 1291269219610 requests=23, regions=111, usedHeap=862, > maxHeap=987 > 04.hadoop:60030 1291910665912 requests=169, regions=82, usedHeap=194, > maxHeap=888 > 05.hadoop:60030 1291909584060 requests=232, regions=110, usedHeap=477, > maxHeap=888 > 06.hadoop60030 1291909723787 requests=99, regions=94, usedHeap=394, > maxHeap=888 > Total: > servers: 5 > requests=746, regions=515 > > > > Following are various logs around the time of the failures for 04-06 > > ************************************************************************************ > 04.hadoop > ************************************************************************************ > REGION SERVER -> > 2010-12-09 05:01:58,640 INFO org.apache.hadoop.hbase.regionserver.HRegion: > compaction completed on region > article,a83858a08f2270d319f75a7b43c756c2453988e7,1291811929868.58f6d9fc80 > c78f3ca490b0280b4f1226. in 0sec > 2010-12-09 05:04:30,898 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: > LRU Stats: total=1.46 MB, free=176.33 MB, max=177.79 MB, blocks=0, > accesses=200583, hits=0, hitRatio= > 0.00%%, evictions=0, evicted=0, evictedPerRun=NaN > 2010-12-09 05:04:59,612 INFO > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using syncFs > -- HDFS-200 > 2010-12-09 05:04:59,618 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: > Roll > /hbase/.logs/04.hadoop.****,60020,1291858770125/10.100.154.103%3A60020.1291870585253, > entries=5318 > 6, filesize=63776311. New hlog > /hbase/.logs/04.hadoop.****,60020,1291858770125/10.100.154.103%3A60020.1291871099582 > 2010-12-09 05:08:02,033 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: > Found 1 hlogs to remove out of total 6; oldest outstanding sequenceid is > 32891801 from region article > ,4b2039b791e894dd479b90661ca97087f61645d3,1291813201039.7dff47c63bb5648bb1ee6670c60553a5. > 2010-12-09 05:08:02,034 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: > IPC Server handler 13 on 60020 took 182416ms appending an edit to hlog; > editcount=0 > 2010-12-09 05:08:02,066 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: > moving old hlog file > /hbase/.logs/04.hadoop.****,60020,1291858770125/10.100.154.103%3A60020.12918679476 > 65 whose highest sequenceid is 32880227 to > /hbase/.oldlogs/10.100.154.103%3A60020.1291867947665 > 2010-12-09 05:08:02,066 INFO org.apache.zookeeper.ClientCnxn: Client session > timed out, have not heard from server in 182936ms for sessionid > 0x12ca58c08ff0048, closing socket conn > ection and attempting reconnect > 2010-12-09 05:08:02,089 FATAL > org.apache.hadoop.hbase.regionserver.HRegionServer: Aborting region server > serverName=04.hadoop.****,60020,1291858770125, load=(requests=24, regions= > 64, usedHeap=548, maxHeap=888): Unhandled exception > org.apache.hadoop.hbase.YouAreDeadException: > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; > currently processing 04.hadoop.****,60020,1291858770125 as dead s > erver > at > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:217) > at > org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:271) > at > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:744) > at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:561) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039) > > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) > at java.lang.reflect.Constructor.newInstance(Constructor.java:513) > at > org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94) > at > org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48) > at > org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.doRun(HRegionServer.java:635) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.access$000(HRegionServer.java:126) > at > org.apache.hadoop.hbase.regionserver.HRegionServer$1.run(HRegionServer.java:518) > at > org.apache.hadoop.hbase.regionserver.HRegionServer$1.run(HRegionServer.java:516) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:337) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1046) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:516) > at java.lang.Thread.run(Thread.java:662) > 2010-12-09 05:08:02,090 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: > request=0.0, regions=64, stores=64, storefiles=136, storefileIndexSize=27, > memsto > reSize=353, compactionQueueSize=0, usedHeap=549, maxHeap=888, > blockCacheSize=1530552, blockCacheFree=184893160, blockCacheCount=0, > blockCacheHitRatio=0 > 2010-12-09 05:08:02,090 INFO org.apache.hadoop.ipc.HBaseServer: Stopping > server on 60020 > 2010-12-09 05:08:02,090 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC > Server handler 0 on 60020: exiting2010-12-09 05:08:02,090 INFO > org.apache.hadoop.ipc.HBaseServer: Stopping IPC Server listener on 60020 > > > MASTER HBASE -> > > 2010-12-09 05:05:06,072 INFO org.apache.hadoop.hbase.master.BaseScanner: All > 1 .META. region(s) scanned > 2010-12-09 05:05:40,004 INFO org.apache.hadoop.hbase.master.ServerManager: > 04.hadoop.jive,60020,1291858770125 znode expired > 2010-12-09 05:05:40,005 DEBUG org.apache.hadoop.hbase.master.ServerManager: > Added=04.hadoop.jive,60020,1291858770125 to dead servers, added shutdown > processing operation > 2010-12-09 05:05:40,005 DEBUG > org.apache.hadoop.hbase.master.RegionServerOperationQueue: Processing todo: > ProcessServerShutdown of 04.hadoop.jive,60020,12918587701252010-12-09 > 05:05:40,005 INFO org.apache.hadoop.hbase.master.RegionServerOperation: > Process shutdown of server 04.hadoop.jive,60020,1291858770125: logSplit: > false, rootRescanned: f > alse, numberOfMetaRegions: 1, onlineMetaRegions.size(): 1 > 2010-12-09 05:05:40,008 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: > Splitting 7 hlog(s) in > hdfs://00.hadoop.jive:54310/hbase/.logs/04.hadoop.jive,60020,12918587701252010-12-09 > 05:05:40,008 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: Splitting > hlog 1 of 7: > hdfs://00.hadoop.jive:54310/hbase/.logs/04.hadoop.jive,60020,1291858770125/10.1 > 00.154.103%3A60020.1291867947665, length=637577092010-12-09 05:05:40,008 INFO > org.apache.hadoop.hbase.util.FSUtils: Recovering > filehdfs://00.hadoop.jive:54310/hbase/.logs/04.hadoop.jive,60020,1291858770125/10.100.154.103%3A60020 > .1291867947665 > > > NAMENODE -> > > 2010-12-09 05:08:02,471 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.100.159.13:50010 is added to > blk_1531008743226086399_251615 size 63757709 > 2010-12-09 05:08:02,473 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 9 on 54310, call > complete(/hbase/.logs/04.hadoop.jive,60020,1291858770125/10.100.154.103%3A60020.1291867947665, > DFSClient_hb_m_10.194.194.79:60000_1291788452343) from 10.194.194.79:44117: > error: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease > on > /hbase/.logs/04.hadoop.jive,60020,1291858770125/10.100.154.103%3A60020.1291867947665 > File does not exist. [Lease. Holder: > DFSClient_hb_m_10.194.194.79:60000_1291788452343, pendingcreates: 1] > org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on > /hbase/.logs/04.hadoop.jive,60020,1291858770125/10.100.154.103%3A60020.1291867947665 > File does not exist. [Lease. Holder: > DFSClient_hb_m_10.194.194.79:60000_1291788452343, pendingcreates: 1] > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1488) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1479) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:1534) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:1522) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.complete(NameNode.java:610) > at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:528) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1319) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1315) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1313) > 2010-12-09 05:08:04,206 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.99.97.106:50010 is added to > blk_-734280257049179934_251614 size 7330 > > DATANODE -> > > 2010-12-09 05:08:02,212 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > IOException in BlockReceiver.lastNodeRun: java.io.IOException: Broken pipe > at sun.nio.ch.FileDispatcher.write0(Native Method) > at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29) > at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:100) at > sun.nio.ch.IOUtil.write(IOUtil.java:71) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334) > at > org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55) > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) > at > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146) > at > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107) > at java.io.DataOutputStream.writeLong(DataOutputStream.java:207) > at > org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.write(DataTransferProtocol.java:133) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.lastDataNodeRun(BlockReceiver.java:840) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:870) > at java.lang.Thread.run(Thread.java:662) > 2010-12-09 05:08:02,213 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > checkDiskError: exception: > 2010-12-09 05:08:02,213 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > checkDiskError: exception: java.io.IOException: Broken pipe > at sun.nio.ch.FileDispatcher.write0(Native Method) > at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29) > at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:100) > at sun.nio.ch.IOUtil.write(IOUtil.java:71) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334) > at > org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55) > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) > at > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146) > at > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107) > at java.io.DataOutputStream.writeLong(DataOutputStream.java:207) > at > org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.write(DataTransferProtocol.java:133) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.lastDataNodeRun(BlockReceiver.java:840) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:870) > at java.lang.Thread.run(Thread.java:662) > 2010-12-09 05:08:02,221 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > PacketResponder blk_-8817504198034990390_251613 0 Exception > java.io.IOException: Broken pipe > at sun.nio.ch.FileDispatcher.write0(Native Method) > at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29) > at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:100) > at sun.nio.ch.IOUtil.write(IOUtil.java:71) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334) > at > org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55) > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) > at > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146) > at > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107) > at java.io.DataOutputStream.writeLong(DataOutputStream.java:207) > at > org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.write(DataTransferProtocol.java:133) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.lastDataNodeRun(BlockReceiver.java:840) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:870) > at java.lang.Thread.run(Thread.java:662) > > 2010-12-09 05:08:02,221 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > PacketResponder 0 for block blk_-8817504198034990390_251613 terminating > 2010-12-09 05:08:02,224 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > writeBlock blk_-8817504198034990390_251613 received exception > java.io.IOException: Connection reset by peer > 2010-12-09 05:08:02,224 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Scheduling block blk_4505306716668305567_250752 file > /dist1/data/hadoop-data/current/subdir19/blk_4505306716668305567 for deletion > 2010-12-09 05:08:02,226 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: > DatanodeRegistration(10.100.154.103:50010, > storageID=DS-866555524-10.100.154.103-50010-1291262739506, infoPort=50075, > ipcPort=50020):DataXceiver > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcher.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:237) > at sun.nio.ch.IOUtil.read(IOUtil.java:210) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) > at > org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55) > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:256) > at java.io.BufferedInputStream.read(BufferedInputStream.java:317) > at java.io.DataInputStream.read(DataInputStream.java:132) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:267) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:357) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:378) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:534) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:417) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122) > 2010-12-09 05:08:02,226 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Deleted block blk_1422641301942366074_250732 at file > /dist1/data/hadoop-data/current/subdir19/blk_1422641301942366074 > 2010-12-09 05:08:02,227 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Deleted block blk_4505306716668305567_250752 at file > /dist1/data/hadoop-data/current/subdir19/blk_4505306716668305567 > > > > ************************************************************************************ > 05.hadoop > ************************************************************************************ > REGION SERVER -> > 2010-12-09 04:29:25,047 DEBUG org.apache.hadoop.hbase.regionserver.Store: > Compaction size of data: 250.5m; Skipped 1 file(s), size: 206896367 > 2010-12-09 04:29:25,047 INFO org.apache.hadoop.hbase.regionserver.Store: > Started compaction of 2 file(s) in data of > article,060c4b75067378c045eead9edec2d4fb7d8f3e5e,1291812563166.95e92e97823e11e57850720a63433a62. > into > hdfs://00.hadoop.****:54310/hbase/article/95e92e97823e11e57850720a63433a62/.tmp, > sequenceid=33050221 > 2010-12-09 04:29:26,328 INFO org.apache.hadoop.hbase.regionserver.Store: > Completed compaction of 2 file(s) in data of > article,060c4b75067378c045eead9edec2d4fb7d8f3e5e,1291812563166.95e92e97823e11e57850720a63433a62.; > new storefile is > hdfs://00.hadoop.****:54310/hbase/article/95e92e97823e11e57850720a63433a62/data/2787768957794924068; > store size is 250.5m > 2010-12-09 04:29:26,328 INFO org.apache.hadoop.hbase.regionserver.HRegion: > compaction completed on region > article,060c4b75067378c045eead9edec2d4fb7d8f3e5e,1291812563166.95e92e97823e11e57850720a63433a62. > in 1sec > 2010-12-09 04:29:26,328 INFO org.apache.hadoop.hbase.regionserver.HRegion: > Starting compaction on region > article,ed0cdc7ae6db73432e2ca47c973d545bee11ddd8,1291813400028.85a8d31d099f409dd36f3bcc4be1eba6. > 2010-12-09 04:29:26,330 INFO org.apache.hadoop.hbase.regionserver.HRegion: > compaction completed on region > article,ed0cdc7ae6db73432e2ca47c973d545bee11ddd8,1291813400028.85a8d31d099f409dd36f3bcc4be1eba6. > in 0sec > 2010-12-09 04:32:33,081 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGIONSERVER_STOP > 2010-12-09 04:32:33,081 INFO org.apache.hadoop.ipc.HBaseServer: Stopping > server on 60020 > 2010-12-09 04:32:33,089 WARN org.apache.hadoop.hdfs.DFSClient: > DFSOutputStream ResponseProcessor exception for block > blk_2012253533858803341_250248java.io.EOFException > at java.io.DataInputStream.readFully(DataInputStream.java:180) > at java.io.DataInputStream.readLong(DataInputStream.java:399) > at > org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:120) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2621) > > 2010-12-09 04:32:33,089 INFO org.apache.zookeeper.ClientCnxn: Unable to read > additional data from server sessionid 0x12ca58c08ff0049, likely server has > closed socket, closing sock > et connection and attempting reconnect > 2010-12-09 04:32:33,089 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: > IPC Server handler 5 on 60020 took 75446ms appending an edit to hlog; > editcount=50505 > 2010-12-09 04:32:33,090 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 18 on 60020: exiting > 2010-12-09 04:32:33,139 INFO org.apache.hadoop.hdfs.DFSClient: Error Recovery > for block blk_2012253533858803341_250248 waiting for responder to exit. > 2010-12-09 04:32:33,139 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 8 on 60020: exiting > 2010-12-09 04:32:33,139 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC > Server handler 1 on 60020: exiting > 2010-12-09 04:32:33,139 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 10 on 60020: exiting > 2010-12-09 04:32:33,140 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC > Server handler 0 on 60020: exiting2010-12-09 04:32:33,140 INFO > org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 5 on 60020: exiting > 2010-12-09 04:32:33,139 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC > Server handler 8 on 60020: exiting2010-12-09 04:32:33,140 INFO > org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 4 on 60020: exiting > 2010-12-09 04:32:33,140 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC > Server handler 3 on 60020: exiting2010-12-09 04:32:33,142 INFO > org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 7 on 60020: exiting > 2010-12-09 04:32:33,142 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC > Server handler 9 on 60020: exiting2010-12-09 04:32:33,143 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping infoServer > 2010-12-09 04:32:33,143 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC > Server handler 2 on 60020: exiting > 2010-12-09 04:32:33,143 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC > Server Responder > 2010-12-09 04:32:33,143 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC > Server handler 6 on 60020: exiting > 2010-12-09 04:32:33,144 INFO org.mortbay.log: Stopped > selectchannelconnec...@0.0.0.0:60030 > > > > > NAMENODE -> > 2010-12-09 04:32:08,408 WARN org.apache.hadoop.hdfs.StateChange: DIR* > NameSystem.startFile: failed to create file > /hbase/.logs/05.hadoop.jive,60020,1291862294517/10.100.159.13%3A60020.1291868481555 > for DFSClient_hb_m_10.194.194.79:60000_1291788452343 on client > 10.194.194.79, because this file is already being created by NN_Recovery on > 10.100.159.13 > 2010-12-09 04:32:08,408 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 6 on 54310, call > append(/hbase/.logs/05.hadoop.jive,60020,1291862294517/10.100.159.13%3A60020.1291868481555, > DFSClient_hb_m_10.194.194.79:60000_1291788452343) from 10.194.194.79:59118: > error: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed > to create file /hba > se/.logs/05.hadoop.jive,60020,1291862294517/10.100.159.13%3A60020.1291868481555 > for DFSClient_hb_m_10.194.194.79:60000_1291788452343 on client > 10.194.194.79, because this file is already being created by NN_Recovery on > 10.100.159.13 > org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to > create file > /hbase/.logs/05.hadoop.jive,60020,1291862294517/10.100.159.13%3A60020.1291868481555 > for DFSClient_hb_m_10.194.194.79:60000_1291788452343 on client > 10.194.194.79, because this file is already being created by NN_Recovery on > 10.100.159.13 > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1194) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1282) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:541) > at sun.reflect.GeneratedMethodAccessor37.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:528) at > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1319) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1315) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063) > at > org.apache.hadoop.ipc.Server$Handler.run(Server.java:1313)2010-12-09 > 04:32:09,411 WARN org.apache.hadoop.hdfs.StateChange: DIR* > NameSystem.startFile: failed to create file > /hbase/.logs/05.hadoop.jive,60020,1291862294517/10.100.159.13%3A6 > 0020.1291868481555 for DFSClient_hb_m_10.194.194.79:60000_1291788452343 on > client 10.194.194.79, because this file is already being created by > NN_Recovery on 10.100.159.132010-12-09 04:32:09,411 INFO > org.apache.hadoop.ipc.Server: IPC Server handler 9 on 54310, call > append(/hbase/.logs/05.hadoop.jive,60020,1291862294517/10.100.159.13%3A60020.1291868 > 481555, DFSClient_hb_m_10.194.194.79:60000_1291788452343) from > 10.194.194.79:59118: error: > org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to > create file > /hbase/.logs/05.hadoop.jive,60020,1291862294517/10.100.159.13%3A60020.1291868481555 > for DFSClient_hb_m_10.194.194.79:60000_1291788452343 on client > 10.194.194.79, because this file is > already being created by NN_Recovery on > 10.100.159.13org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: > failed to create file > /hbase/.logs/05.hadoop.jive,60020,1291862294517/10.100.159.13%3A60020.1291868481555 > for DFSClie > nt_hb_m_10.194.194.79:60000_1291788452343 on client 10.194.194.79, because > this file is already being created by NN_Recovery on 10.100.159.13 at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1194) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1282) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:541) > at sun.reflect.GeneratedMethodAccessor37.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) at > org.apache.hadoop.ipc.RPC$Server.call(RPC.java:528) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1319) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1315) > > > > DATANODE > > 2010-12-09 04:32:09,534 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > PacketResponder blk_2012253533858803341_250248 1 : Thread is interrupted. > 2010-12-09 04:32:09,534 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: > DatanodeRegistration(10.100.159.13:50010, > storageID=DS-145997007-10.100.159.13-50010-1291262743182, infoPort=50075, > ipcPort=50020):DataXceiver > java.io.IOException: Interrupted receiveBlock > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:579) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:417) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122) > 2010-12-09 04:32:09,534 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > PacketResponder 1 for block blk_2012253533858803341_250248 terminating > > > ************************************************************************************ > 06.hadoop > ************************************************************************************ > REGION SERVER -> > > 2010-12-09 06:17:47,215 DEBUG > org.apache.hadoop.hbase.regionserver.HRegionServer: > NotServingRegionException; > article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f. > 2010-12-09 06:17:47,430 DEBUG > org.apache.hadoop.hbase.regionserver.HRegionServer: > NotServingRegionException; > article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f. > 2010-12-09 06:17:47,549 DEBUG > org.apache.hadoop.hbase.regionserver.HRegionServer: > NotServingRegionException; > article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f. > 2010-12-09 06:17:47,557 DEBUG > org.apache.hadoop.hbase.regionserver.HRegionServer: > NotServingRegionException; > article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f. > 2010-12-09 06:17:47,563 DEBUG > org.apache.hadoop.hbase.regionserver.HRegionServer: > NotServingRegionException; > article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f. > 2010-12-09 06:17:48,619 DEBUG > org.apache.hadoop.hbase.regionserver.HRegionServer: > NotServingRegionException; > article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f. > 2010-12-09 06:17:48,619 DEBUG > org.apache.hadoop.hbase.regionserver.HRegionServer: > NotServingRegionException; > article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f. > 2010-12-09 06:17:48,619 DEBUG > org.apache.hadoop.hbase.regionserver.HRegionServer: > NotServingRegionException; > article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f. > 2010-12-09 06:17:48,619 DEBUG > org.apache.hadoop.hbase.regionserver.HRegionServer: > NotServingRegionException; > article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f. > 2010-12-09 06:17:48,619 DEBUG > org.apache.hadoop.hbase.regionserver.HRegionServer: > NotServingRegionException; > article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f. > 2010-12-09 06:17:49,703 DEBUG > org.apache.hadoop.hbase.regionserver.HRegionServer: > NotServingRegionException; > article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f. > 2010-12-09 06:17:49,703 DEBUG > org.apache.hadoop.hbase.regionserver.HRegionServer: > NotServingRegionException; > article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f. > 2010-12-09 06:17:49,703 DEBUG > org.apache.hadoop.hbase.regionserver.HRegionServer: > NotServingRegionException; > article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f. > 2010-12-09 06:17:49,705 DEBUG > org.apache.hadoop.hbase.regionserver.HRegionServer: > NotServingRegionException; > article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f. > 2010-12-09 06:17:49,729 DEBUG > org.apache.hadoop.hbase.regionserver.HRegionServer: > NotServingRegionException; > article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f. > > A bunch of these ^ > > Then a little later:: > .... > > 2010-12-09 06:23:23,561 DEBUG > org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested > for region article,993143fca3613014990cc6975cacdd62f4147e03,12918133247 > 53.19413f33fc660536d9d4f315a1283440. because: regionserver60020.cacheFlusher > 2010-12-09 06:23:27,058 INFO > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using syncFs > -- HDFS-200 > 2010-12-09 06:23:27,080 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: > Roll > /hbase/.logs/06.hadoop.****,60020,1291864620851/10.99.97.106%3A60020.1291875047947, > entries=53315, > filesize=63781925. New hlog > /hbase/.logs/06.hadoop.****,60020,1291864620851/10.99.97.106%3A60020.1291875807006 > 2010-12-09 06:23:27,080 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: > Found 1 hlogs to remove out of total 8; oldest outstanding sequenceid is > 33526881 from region article > ,1221b0013c61fd7446cb822f698d0bcafa487342,1291872610788.bb2aa6ae8c694a096a00e7bb91fc46c3. > 2010-12-09 06:23:27,080 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: > moving old hlog file > /hbase/.logs/06.hadoop.****,60020,1291864620851/10.99.97.106%3A60020.1291872064424 > whose highest sequenceid is 33506639 to > /hbase/.oldlogs/10.99.97.106%3A60020.1291872064424 > 2010-12-09 06:25:39,162 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: > IPC Server handler 12 on 60020 took 119034ms appending an edit to hlog; > editcount=3683 > 2010-12-09 06:25:39,162 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer > Exception: java.io.IOException: Broken pipe > at sun.nio.ch.FileDispatcher.write0(Native Method) > at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29) > at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:100) > at sun.nio.ch.IOUtil.write(IOUtil.java:71) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334) > at > org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55) > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) > at > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146) > at > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107) > at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) > at java.io.DataOutputStream.write(DataOutputStream.java:90) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2509) > 2010-12-09 06:25:39,163 WARN org.apache.hadoop.hdfs.DFSClient: > DFSOutputStream ResponseProcessor exception for block > blk_1141147913164105086_258331java.io.EOFException > at java.io.DataInputStream.readFully(DataInputStream.java:180) > at java.io.DataInputStream.readLong(DataInputStream.java:399) > at > org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:120) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2621) > > 2010-12-09 06:25:39,163 INFO org.apache.hadoop.hdfs.DFSClient: Error Recovery > for block blk_1141147913164105086_258331 waiting for responder to exit. > 2010-12-09 06:25:39,164 WARN org.apache.hadoop.hdfs.DFSClient: > DFSOutputStream ResponseProcessor exception for block > blk_-7076574401277381592_258324java.io.EOFException > at java.io.DataInputStream.readFully(DataInputStream.java:180) > at java.io.DataInputStream.readLong(DataInputStream.java:399) > at > org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:120) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2621) > > 2010-12-09 06:25:39,164 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery > for block blk_-7076574401277381592_258324 bad datanode[0] 10.99.97.106:50010 > 2010-12-09 06:25:39,164 INFO org.apache.zookeeper.ClientCnxn: Unable to read > additional data from server sessionid 0x12ca58c08ff004b, likely server has > closed socket, closing socket connection and attempting reconnect > 2010-12-09 06:25:39,164 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery > for block blk_-7076574401277381592_258324 in pipeline 10.99.97.106:50010, > 10.100.154.103:50010: bad datanode 10.99.97.106:50010 > 2010-12-09 06:25:39,165 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery > for block blk_1141147913164105086_258331 bad datanode[0] 10.99.97.106:50010 > 2010-12-09 06:25:39,167 INFO org.apache.zookeeper.ClientCnxn: Unable to read > additional data from server sessionid 0x12ca58c08ff004c, likely server has > closed socket, closing socket connection and attempting reconnect > 2010-12-09 06:25:39,169 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: /10.100.154.103:50020. Already tried 0 time(s). > 2010-12-09 06:25:39,169 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery > for block blk_1141147913164105086_258331 in pipeline 10.99.97.106:50010, > 10.100.159.13:50010: bad datanode 10.99.97.106:50010 > 2010-12-09 06:25:39,209 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGIONSERVER_STOP > 2010-12-09 06:25:39,210 INFO org.apache.hadoop.ipc.HBaseServer: Stopping > server on 60020 > 2010-12-09 06:25:39,210 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 3 on 60020: exiting > 2010-12-09 06:25:39,210 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 9 on 60020: exiting > 2010-12-09 06:25:39,210 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 24 on 60020: exiting > 2010-12-09 06:25:39,210 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC > Server handler 1 on 60020: exiting > 2010-12-09 06:25:39,210 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 15 on 60020: exiting > 2010-12-09 06:25:39,215 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC > Server handler 0 on 60020: exiting > 2010-12-09 06:25:39,214 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC > Server handler 3 on 60020: exiting > 2010-12-09 06:25:39,214 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC > Server listener on 60020 > 2010-12-09 06:25:39,215 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC > Server handler 2 on 60020: exiting > 2010-12-09 06:25:39,215 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC > Server handler 7 on 60020: exiting > 2010-12-09 06:25:39,213 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 21 on 60020: exiting > 2010-12-09 06:25:39,215 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC > Server handler 4 on 60020: exiting > 2010-12-09 06:25:39,215 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC > Server Responder > 2010-12-09 06:25:39,215 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC > Server handler 5 on 60020: exiting > 2010-12-09 06:25:39,216 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC > Server handler 8 on 60020: exiting > 2010-12-09 06:25:39,216 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC > Server handler 9 on 60020: exiting > 2010-12-09 06:25:39,216 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC > Server handler 6 on 60020: exiting > 2010-12-09 06:25:39,216 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping infoServer > 2010-12-09 06:25:39,223 WARN org.apache.hadoop.hdfs.DFSClient: Failed > recovery attempt #0 from primary datanode 10.100.159.13:50010 > > > >