Thanks for the response stack.  Yes we tried increasing 
dfs.datanode.handler.count to 8.   At this point I would say it didn't seem to 
resolve the issue we are seeing, but we it also doesn't seem to be hurting 
anything so for right now we're going to leave it in at 8 while we continue to 
debug.

  In regard to the original error I posted ( Block 'x' is not valid ) we have 
chased that down thanks to your suggestion of looking at the logs for the 
history of the block.  It _looks_ like our 'is not valid' block errors are 
unrelated and due to chmod or deleting mapreduce output directories directly 
after a run.  We are still isolating that but it looks like it's not HBase 
releated so I'll move that to another list.  Thank you very much for your 
debugging suggestions.

   The one issue we are still seeing is that we will occasionally have a 
regionserver die with the following exception.  I need to chase that down a 
little more but it seems similar to a post from 2/13/2011 
(http://www.mail-archive.com/[email protected]/msg05550.html ) that I'm not 
sure was ever resolved or not.  If anyone has any insight on how to debug the 
following error a little more I would appreciate any thoughts you might have.

2011-04-14 06:05:13,001 ERROR org.apache.hadoop.hdfs.DFSClient: Exception 
closing file 
/user/hbase/.logs/hd10.dfs.returnpath.net,60020,1302555127291/hd10.dfs.returnpath.net%3A60020.1302781635921
 : java.io.IOException: Error Recovery for block 
blk_1315316969665710488_29842654 failed  because recovery from primary datanode 
10.18.0.16:50010 failed 6 times.  Pipeline was 10.18.0.16:50010. Aborting...
java.io.IOException: Error Recovery for block blk_1315316969665710488_29842654 
failed  because recovery from primary datanode 10.18.0.16:50010 failed 6 times. 
 Pipeline was 10.18.0.16:50010. Aborting...
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2841)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2305)

Other than the above exception causing a region server to die occasionally 
everything seems to be working well.

Note we have now upgraded to Cloudera CDH Version 3 Update 0 ( hadoop 
0.20.2+923.21 and hbase 0.90.1+15.18 ) and still see the above exception.  We 
do have ulimit set ( memory unlimited and files 32k ) for the user running 
hbase.  

Thanks again for your help

 Andy

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Stack
Sent: Sunday, April 10, 2011 1:16 PM
To: [email protected]
Cc: Andy Sautins
Subject: Re: DFS stability running HBase and dfs.datanode.handler.count...

Did you try upping it Andy?  Andrew Purtell's recommendation though old would 
have come of experience.  The Intel article reads like sales but there is 
probably merit to its suggestion.  The Cloudera article is more unsure about 
the effect of upping handlers though it allows "...could be set a bit higher."

I just looked at our prod frontend and its set to 3 still.  I don't see your 
exceptions in our DN log.

What version of hadoop?  You say hbase 0.91.  You mean 0.90.1?

ulimit and nproc are set sufficiently high for hadoop/hbase user?

If you grep 163126943925471435_28809750 in namenode log, do you see a delete 
occur before a later open?

St.Ack

On Sat, Apr 9, 2011 at 4:35 PM, Andy Sautins <[email protected]> 
wrote:
>
>    I ran across an mailing list posting from 1/4/2009 that seemed to indicate 
> increasing dfs.datanode.handler.count could help improve DFS stability 
> (http://mail-archives.apache.org/mod_mbox/hbase-user/200901.mbox/%[email protected]%3E
>  ).  The posting seems to indicate the wiki was updated, but I don't seen 
> anything in the wiki about increasing dfs.datanode.handler.count.   I have 
> seen a few other notes that seem to show examples that have raised 
> dfs.datanode.handler.count including one from an IBM article 
> (http://software.intel.com/en-us/articles/hadoop-and-hbase-optimization-for-read-intensive-search-applications/
>  ) and the Pro Hadoop book, but other than that the only other mention I see 
> is from cloudera seems luke-warm on increasing dfs.datanode.handler.count 
> (http://www.cloudera.com/blog/2009/03/configuration-parameters-what-can-you-just-ignore/
>  ).
>
>    Given the post is from 2009 I thought I'd ask if anyone has had any 
> success improving stability of HBase/DFS when increasing 
> dfs.datanode.handler.count.  The specific error we are seeing somewhat  
> frequently ( few hundred times per day ) in the datanode longs is as follows:
>
> 2011-04-09 00:12:48,035 ERROR 
> org.apache.hadoop.hdfs.server.datanode.DataNode: 
> DatanodeRegistration(10.18.0.33:50010, 
> storageID=DS-1501576934-10.18.0.33-50010-1296248656454, 
> infoPort=50075, ipcPort=50020):DataXceiver
> java.io.IOException: Block blk_-163126943925471435_28809750 is not valid.
>
>   The above seems to correspond to ClosedChannelExceptions in the hbase 
> regionserver logs as well as some warnings about long write to hlog ( some in 
> the 50+ seconds ).
>
>    The biggest end-user facing issue we are seeing is that Task Trackers keep 
> getting blacklisted.  It's quite possible our problem is unrelated to 
> anything HBase, but I thought it was worth asking given what we've been 
> seeing.
>
>   We are currently running 0.91 on an 18 node cluster with ~3k total regions 
> and each region server is running with 2G of memory.
>
>   Any insight would be appreciated.
>
>   Thanks
>
>    Andy
>

Reply via email to