What seems common to all of the examples you have provided is that region 
servers are not able to get new blocks from DFS for the write ahead log. This 
is a difficult problem area for HBase as it is dependent on the proper 
functioning of the underlying filesystem. We are working with core-dev on 
improving how the HDFS client library in general handles transient problems at 
the DFS layer. For your reference:

    https://issues.apache.org/jira/browse/HBASE-1084
    https://issues.apache.org/jira/browse/HADOOP-4681
    https://issues.apache.org/jira/browse/HADOOP-3185

and, a related topic, on avoiding data loss with the necessary fs layer 
primitives:

    https://issues.apache.org/jira/browse/HADOOP-4379
    https://issues.apache.org/jira/browse/HADOOP-5744

I have some operational experience with HBase and HDFS. Based on my experience, 
there appears to be not adequate DFS resources deployed on your cluster for the 
filesystem to handle the load. You said you have one master and three 
regionservers? The regionservers are running on the same nodes as the DFS 
datanodes I presume. Also mapreduce tasks? Also the program that is pounding 
the cluster with inserts? What is the hardware spec of those nodes? How many 
CPUs? How many cores? How much RAM? Can you consider adding additional nodes to 
spread the load on DFS? 

Best regards,

   - Andy






________________________________
From: llpind <sonny_h...@hotmail.com>
To: hbase-user@hadoop.apache.org
Sent: Tuesday, May 26, 2009 10:39:14 AM
Subject: Re: HBase looses regions.


Finally failed between 7M-8M records.  below is the last tail output.  The
other two region server don't have much activity in the logs, but i can post
those if necessary. 

===================================================

2009-05-26 10:28:06,550 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_3233282543359573226_1303 bad datanode[0]
192.168.240.175:50010
2009-05-26 10:28:06,550 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_3233282543359573226_1303 in pipeline
192.168.240.175:50010, 192.168.240.180:50010: bad datanode
192.168.240.175:50010
2009-05-26 10:28:11,714 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer
Exception: java.net.SocketTimeoutException: 5000 millis timeout while
waiting for channel to be ready for write. ch :
java.nio.channels.SocketChannel[connected local=/192.168.240.175:60733
remote=/192.168.240.180:50010]
        at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:162)
        at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
        at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
        at java.io.DataOutputStream.write(DataOutputStream.java:90)
        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2209)

2009-05-26 10:28:11,715 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_3233282543359573226_1311 bad datanode[0]
192.168.240.180:50010
2009-05-26 10:28:11,715 FATAL
org.apache.hadoop.hbase.regionserver.LogRoller: Log rolling failed with ioe:
java.io.IOException: All datanodes 192.168.240.180:50010 are bad.
Aborting...
        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2444)
        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:1996)
        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2160)
2009-05-26 10:28:11,716 FATAL org.apache.hadoop.hbase.regionserver.HLog:
Could not append. Requesting close of log
java.io.IOException: All datanodes 192.168.240.180:50010 are bad.
Aborting...
        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2444)
        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:1996)
        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2160)
2009-05-26 10:28:11,717 ERROR
org.apache.hadoop.hbase.regionserver.HRegionServer: java.io.IOException: All
datanodes 192.168.240.180:50010 are bad. Aborting...
2009-05-26 10:28:11,726 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
request=0.0, regions=2, stores=4, storefiles=6, storefileIndexSize=0,
memcacheSize=40, usedHeap=94, maxHeap=2999
2009-05-26 10:28:11,726 INFO org.apache.hadoop.hbase.regionserver.LogRoller:
LogRoller exiting.
2009-05-26 10:28:11,726 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 5 on 60020, call batchUpdates([...@41fb8e4,
[Lorg.apache.hadoop.hbase.io.BatchUpdate;@3ea382d9) from
192.168.240.152:17086: error: java.io.IOException: All datanodes
192.168.240.180:50010 are bad. Aborting...
java.io.IOException: All datanodes 192.168.240.180:50010 are bad.
Aborting...
        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2444)
        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:1996)
        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2160)
2009-05-26 10:28:12,894 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
server on 60020
2009-05-26 10:28:12,895 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 2 on 60020: exiting
2009-05-26 10:28:12,895 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 6 on 60020: exiting
2009-05-26 10:28:12,895 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC
Server listener on 60020
2009-05-26 10:28:12,896 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 0 on 60020: exiting
2009-05-26 10:28:12,896 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 1 on 60020: exiting
2009-05-26 10:28:12,896 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 3 on 60020: exiting
2009-05-26 10:28:12,897 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 5 on 60020: exiting
2009-05-26 10:28:12,897 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 4 on 60020: exiting
2009-05-26 10:28:12,898 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 7 on 60020: exiting
2009-05-26 10:28:12,898 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC
Server Responder
2009-05-26 10:28:12,898 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 8 on 60020: exiting
2009-05-26 10:28:12,898 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 9 on 60020: exiting
2009-05-26 10:28:12,898 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping infoServer
2009-05-26 10:28:12,901 INFO org.mortbay.util.ThreadedServer: Stopping
Acceptor ServerSocket[addr=0.0.0.0/0.0.0.0,port=0,localport=60030]
2009-05-26 10:28:12,908 INFO org.mortbay.http.SocketListener: Stopped
SocketListener on 0.0.0.0:60030
2009-05-26 10:28:13,345 INFO org.mortbay.util.Container: Stopped
HttpContext[/logs,/logs]
2009-05-26 10:28:13,346 INFO org.mortbay.util.Container: Stopped
org.mortbay.jetty.servlet.webapplicationhand...@6ad3c65d
2009-05-26 10:28:13,687 INFO org.mortbay.util.Container: Stopped
WebApplicationContext[/static,/static]
2009-05-26 10:28:13,687 INFO org.mortbay.util.Container: Stopped
org.mortbay.jetty.servlet.webapplicationhand...@3adec8b3
2009-05-26 10:28:14,039 INFO org.mortbay.util.Container: Stopped
WebApplicationContext[/,/]
2009-05-26 10:28:14,040 INFO org.mortbay.util.Container: Stopped
org.mortbay.jetty.ser...@6e79839
2009-05-26 10:28:14,040 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: worker thread exiting
2009-05-26 10:28:14,040 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: On abort, closed hlog
2009-05-26 10:28:14,040 INFO
org.apache.hadoop.hbase.regionserver.MemcacheFlusher:
regionserver/0.0.0.0:60020.cacheFlusher exiting
2009-05-26 10:28:14,040 INFO
org.apache.hadoop.hbase.regionserver.LogFlusher:
regionserver/0.0.0.0:60020.logFlusher exiting
2009-05-26 10:28:14,040 INFO
org.apache.hadoop.hbase.regionserver.CompactSplitThread:
regionserver/0.0.0.0:60020.compactor exiting
2009-05-26 10:28:14,040 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker:
regionserver/0.0.0.0:60020.majorCompactionChecker exiting
2009-05-26 10:28:14,041 INFO org.apache.hadoop.hbase.regionserver.HRegion:
Closed TableA,ROW_KEY,1243357190459
2009-05-26 10:28:14,041 INFO org.apache.hadoop.hbase.regionserver.HRegion:
Closed TableA,,1243357190459
2009-05-26 10:28:14,041 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: aborting server at:
192.168.240.175:60020
2009-05-26 10:28:14,044 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer:
regionserver/0.0.0.0:60020 exiting
2009-05-26 10:28:14,270 INFO org.apache.hadoop.hbase.Leases:
regionserver/0.0.0.0:60020.leaseChecker closing leases
2009-05-26 10:28:14,271 INFO org.apache.hadoop.hbase.Leases:
regionserver/0.0.0.0:60020.leaseChecker closed leases
2009-05-26 10:28:14,273 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Starting shutdown
thread.
2009-05-26 10:28:14,273 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Shutdown thread complete


===================================================
-- 
View this message in context: 
http://www.nabble.com/HBase-looses-regions.-tp23657983p23727987.html
Sent from the HBase User mailing list archive at Nabble.com.


      

Reply via email to