I ran across an mailing list posting from 1/4/2009 that seemed to indicate
increasing dfs.datanode.handler.count could help improve DFS stability
(http://mail-archives.apache.org/mod_mbox/hbase-user/200901.mbox/%[email protected]%3E
). The posting seems to indicate the wiki was updated, but I don't seen
anything in the wiki about increasing dfs.datanode.handler.count. I have seen
a few other notes that seem to show examples that have raised
dfs.datanode.handler.count including one from an IBM article
(http://software.intel.com/en-us/articles/hadoop-and-hbase-optimization-for-read-intensive-search-applications/
) and the Pro Hadoop book, but other than that the only other mention I see is
from cloudera seems luke-warm on increasing dfs.datanode.handler.count
(http://www.cloudera.com/blog/2009/03/configuration-parameters-what-can-you-just-ignore/
).
Given the post is from 2009 I thought I'd ask if anyone has had any success
improving stability of HBase/DFS when increasing dfs.datanode.handler.count.
The specific error we are seeing somewhat frequently ( few hundred times per
day ) in the datanode longs is as follows:
2011-04-09 00:12:48,035 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(10.18.0.33:50010,
storageID=DS-1501576934-10.18.0.33-50010-1296248656454, infoPort=50075,
ipcPort=50020):DataXceiver
java.io.IOException: Block blk_-163126943925471435_28809750 is not valid.
The above seems to correspond to ClosedChannelExceptions in the hbase
regionserver logs as well as some warnings about long write to hlog ( some in
the 50+ seconds ).
The biggest end-user facing issue we are seeing is that Task Trackers keep
getting blacklisted. It's quite possible our problem is unrelated to anything
HBase, but I thought it was worth asking given what we've been seeing.
We are currently running 0.91 on an 18 node cluster with ~3k total regions
and each region server is running with 2G of memory.
Any insight would be appreciated.
Thanks
Andy