Hi, We are facing problems with really slow HBase region server recoveries ~ 20 minuted. Version is hbase 0.94.3 compiled with hadoop.profile=2.0.
Hadoop version is CDH 4.2 with HDFS 3703 and HDFS 3912 patched and stale node timeouts configured correctly. Time for dead node detection is still 10 minutes. We see that our region server is trying to read an HLog is stuck there for a long time. Logs here: 2013-04-12 21:14:30,248 WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /10.156.194.251:50010 for file /hbase/feeds/fbe25f94ed4fa37fb0781e4a8efae142/home/1d102c5238874a5d82adbcc09bf06599 for block BP-696828882-10.168.7.226-1364886167971:blk_-3289968688911401881_9428:java.net.SocketTimeoutException: 15000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.156.192.173:52818remote=/ 10.156.194.251:50010] I would think that HDFS 3703 would make the server fail fast and go to the third datanode. Currently, the recovery seems way too slow for production usage... Varun