What seems common to all of the examples you have provided is that region servers are not able to get new blocks from DFS for the write ahead log. This is a difficult problem area for HBase as it is dependent on the proper functioning of the underlying filesystem. We are working with core-dev on improving how the HDFS client library in general handles transient problems at the DFS layer. For your reference:
https://issues.apache.org/jira/browse/HBASE-1084 https://issues.apache.org/jira/browse/HADOOP-4681 https://issues.apache.org/jira/browse/HADOOP-3185 and, a related topic, on avoiding data loss with the necessary fs layer primitives: https://issues.apache.org/jira/browse/HADOOP-4379 https://issues.apache.org/jira/browse/HADOOP-5744 I have some operational experience with HBase and HDFS. Based on my experience, there appears to be not adequate DFS resources deployed on your cluster for the filesystem to handle the load. You said you have one master and three regionservers? The regionservers are running on the same nodes as the DFS datanodes I presume. Also mapreduce tasks? Also the program that is pounding the cluster with inserts? What is the hardware spec of those nodes? How many CPUs? How many cores? How much RAM? Can you consider adding additional nodes to spread the load on DFS? Best regards, - Andy ________________________________ From: llpind <sonny_h...@hotmail.com> To: hbase-user@hadoop.apache.org Sent: Tuesday, May 26, 2009 10:39:14 AM Subject: Re: HBase looses regions. Finally failed between 7M-8M records. below is the last tail output. The other two region server don't have much activity in the logs, but i can post those if necessary. =================================================== 2009-05-26 10:28:06,550 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_3233282543359573226_1303 bad datanode[0] 192.168.240.175:50010 2009-05-26 10:28:06,550 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_3233282543359573226_1303 in pipeline 192.168.240.175:50010, 192.168.240.180:50010: bad datanode 192.168.240.175:50010 2009-05-26 10:28:11,714 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: java.net.SocketTimeoutException: 5000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/192.168.240.175:60733 remote=/192.168.240.180:50010] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:162) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) at java.io.DataOutputStream.write(DataOutputStream.java:90) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2209) 2009-05-26 10:28:11,715 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_3233282543359573226_1311 bad datanode[0] 192.168.240.180:50010 2009-05-26 10:28:11,715 FATAL org.apache.hadoop.hbase.regionserver.LogRoller: Log rolling failed with ioe: java.io.IOException: All datanodes 192.168.240.180:50010 are bad. Aborting... at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2444) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:1996) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2160) 2009-05-26 10:28:11,716 FATAL org.apache.hadoop.hbase.regionserver.HLog: Could not append. Requesting close of log java.io.IOException: All datanodes 192.168.240.180:50010 are bad. Aborting... at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2444) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:1996) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2160) 2009-05-26 10:28:11,717 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: java.io.IOException: All datanodes 192.168.240.180:50010 are bad. Aborting... 2009-05-26 10:28:11,726 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: request=0.0, regions=2, stores=4, storefiles=6, storefileIndexSize=0, memcacheSize=40, usedHeap=94, maxHeap=2999 2009-05-26 10:28:11,726 INFO org.apache.hadoop.hbase.regionserver.LogRoller: LogRoller exiting. 2009-05-26 10:28:11,726 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 5 on 60020, call batchUpdates([...@41fb8e4, [Lorg.apache.hadoop.hbase.io.BatchUpdate;@3ea382d9) from 192.168.240.152:17086: error: java.io.IOException: All datanodes 192.168.240.180:50010 are bad. Aborting... java.io.IOException: All datanodes 192.168.240.180:50010 are bad. Aborting... at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2444) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:1996) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2160) 2009-05-26 10:28:12,894 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server on 60020 2009-05-26 10:28:12,895 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 2 on 60020: exiting 2009-05-26 10:28:12,895 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 6 on 60020: exiting 2009-05-26 10:28:12,895 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC Server listener on 60020 2009-05-26 10:28:12,896 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 0 on 60020: exiting 2009-05-26 10:28:12,896 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 1 on 60020: exiting 2009-05-26 10:28:12,896 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 3 on 60020: exiting 2009-05-26 10:28:12,897 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 5 on 60020: exiting 2009-05-26 10:28:12,897 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 4 on 60020: exiting 2009-05-26 10:28:12,898 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 7 on 60020: exiting 2009-05-26 10:28:12,898 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC Server Responder 2009-05-26 10:28:12,898 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 8 on 60020: exiting 2009-05-26 10:28:12,898 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 9 on 60020: exiting 2009-05-26 10:28:12,898 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping infoServer 2009-05-26 10:28:12,901 INFO org.mortbay.util.ThreadedServer: Stopping Acceptor ServerSocket[addr=0.0.0.0/0.0.0.0,port=0,localport=60030] 2009-05-26 10:28:12,908 INFO org.mortbay.http.SocketListener: Stopped SocketListener on 0.0.0.0:60030 2009-05-26 10:28:13,345 INFO org.mortbay.util.Container: Stopped HttpContext[/logs,/logs] 2009-05-26 10:28:13,346 INFO org.mortbay.util.Container: Stopped org.mortbay.jetty.servlet.webapplicationhand...@6ad3c65d 2009-05-26 10:28:13,687 INFO org.mortbay.util.Container: Stopped WebApplicationContext[/static,/static] 2009-05-26 10:28:13,687 INFO org.mortbay.util.Container: Stopped org.mortbay.jetty.servlet.webapplicationhand...@3adec8b3 2009-05-26 10:28:14,039 INFO org.mortbay.util.Container: Stopped WebApplicationContext[/,/] 2009-05-26 10:28:14,040 INFO org.mortbay.util.Container: Stopped org.mortbay.jetty.ser...@6e79839 2009-05-26 10:28:14,040 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: worker thread exiting 2009-05-26 10:28:14,040 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: On abort, closed hlog 2009-05-26 10:28:14,040 INFO org.apache.hadoop.hbase.regionserver.MemcacheFlusher: regionserver/0.0.0.0:60020.cacheFlusher exiting 2009-05-26 10:28:14,040 INFO org.apache.hadoop.hbase.regionserver.LogFlusher: regionserver/0.0.0.0:60020.logFlusher exiting 2009-05-26 10:28:14,040 INFO org.apache.hadoop.hbase.regionserver.CompactSplitThread: regionserver/0.0.0.0:60020.compactor exiting 2009-05-26 10:28:14,040 INFO org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker: regionserver/0.0.0.0:60020.majorCompactionChecker exiting 2009-05-26 10:28:14,041 INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed TableA,ROW_KEY,1243357190459 2009-05-26 10:28:14,041 INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed TableA,,1243357190459 2009-05-26 10:28:14,041 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: aborting server at: 192.168.240.175:60020 2009-05-26 10:28:14,044 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver/0.0.0.0:60020 exiting 2009-05-26 10:28:14,270 INFO org.apache.hadoop.hbase.Leases: regionserver/0.0.0.0:60020.leaseChecker closing leases 2009-05-26 10:28:14,271 INFO org.apache.hadoop.hbase.Leases: regionserver/0.0.0.0:60020.leaseChecker closed leases 2009-05-26 10:28:14,273 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Starting shutdown thread. 2009-05-26 10:28:14,273 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Shutdown thread complete =================================================== -- View this message in context: http://www.nabble.com/HBase-looses-regions.-tp23657983p23727987.html Sent from the HBase User mailing list archive at Nabble.com.