Just to chime in here, the other thing we changed was our max_file_size is
now set to 2gb instead of 512mb. This could be causing long compaction
times. If a compaction takes too long it won't respond and can be marked as
dead. I have had this happen on my dev cluster a few times.

-Ben

On Thu, Apr 14, 2011 at 11:20 AM, Jean-Daniel Cryans <[email protected]>wrote:

> This is probably a red herring, for example if the region server had a
> big GC pause then the master could have already split the log and the
> region server wouldn't be able to close it (that's our version of IO
> fencing). So from that exception look back in the log and see if
> there's anything like :
>
> INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have
> not heard from server in some_big_number ms
>
> J-D
>
> On Thu, Apr 14, 2011 at 7:24 AM, Andy Sautins
> <[email protected]> wrote:
> >
> >  Thanks for the response stack.  Yes we tried increasing
> dfs.datanode.handler.count to 8.   At this point I would say it didn't seem
> to resolve the issue we are seeing, but we it also doesn't seem to be
> hurting anything so for right now we're going to leave it in at 8 while we
> continue to debug.
> >
> >  In regard to the original error I posted ( Block 'x' is not valid ) we
> have chased that down thanks to your suggestion of looking at the logs for
> the history of the block.  It _looks_ like our 'is not valid' block errors
> are unrelated and due to chmod or deleting mapreduce output directories
> directly after a run.  We are still isolating that but it looks like it's
> not HBase releated so I'll move that to another list.  Thank you very much
> for your debugging suggestions.
> >
> >   The one issue we are still seeing is that we will occasionally have a
> regionserver die with the following exception.  I need to chase that down a
> little more but it seems similar to a post from 2/13/2011 (
> http://www.mail-archive.com/[email protected]/msg05550.html ) that I'm
> not sure was ever resolved or not.  If anyone has any insight on how to
> debug the following error a little more I would appreciate any thoughts you
> might have.
> >
> > 2011-04-14 06:05:13,001 ERROR org.apache.hadoop.hdfs.DFSClient: Exception
> closing file /user/hbase/.logs/hd10.dfs.returnpath.net
> ,60020,1302555127291/hd10.dfs.returnpath.net%3A60020.1302781635921 :
> java.io.IOException: Error Recovery for block
> blk_1315316969665710488_29842654 failed  because recovery from primary
> datanode 10.18.0.16:50010 failed 6 times.  Pipeline was 10.18.0.16:50010.
> Aborting...
> > java.io.IOException: Error Recovery for block
> blk_1315316969665710488_29842654 failed  because recovery from primary
> datanode 10.18.0.16:50010 failed 6 times.  Pipeline was 10.18.0.16:50010.
> Aborting...
> >        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2841)
> >        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2305)
> >
> > Other than the above exception causing a region server to die
> occasionally everything seems to be working well.
> >
> > Note we have now upgraded to Cloudera CDH Version 3 Update 0 ( hadoop
> 0.20.2+923.21 and hbase 0.90.1+15.18 ) and still see the above exception.
>  We do have ulimit set ( memory unlimited and files 32k ) for the user
> running hbase.
> >
> > Thanks again for your help
> >
> >  Andy
> >
> > -----Original Message-----
> > From: [email protected] [mailto:[email protected]] On Behalf Of
> Stack
> > Sent: Sunday, April 10, 2011 1:16 PM
> > To: [email protected]
> > Cc: Andy Sautins
> > Subject: Re: DFS stability running HBase and
> dfs.datanode.handler.count...
> >
> > Did you try upping it Andy?  Andrew Purtell's recommendation though old
> would have come of experience.  The Intel article reads like sales but there
> is probably merit to its suggestion.  The Cloudera article is more unsure
> about the effect of upping handlers though it allows "...could be set a bit
> higher."
> >
> > I just looked at our prod frontend and its set to 3 still.  I don't see
> your exceptions in our DN log.
> >
> > What version of hadoop?  You say hbase 0.91.  You mean 0.90.1?
> >
> > ulimit and nproc are set sufficiently high for hadoop/hbase user?
> >
> > If you grep 163126943925471435_28809750 in namenode log, do you see a
> delete occur before a later open?
> >
> > St.Ack
> >
> > On Sat, Apr 9, 2011 at 4:35 PM, Andy Sautins <
> [email protected]> wrote:
> >>
> >>    I ran across an mailing list posting from 1/4/2009 that seemed to
> indicate increasing dfs.datanode.handler.count could help improve DFS
> stability (
> http://mail-archives.apache.org/mod_mbox/hbase-user/200901.mbox/%[email protected]%3E).
>   The posting seems to indicate the wiki was updated, but I don't seen
> anything in the wiki about increasing dfs.datanode.handler.count.   I have
> seen a few other notes that seem to show examples that have raised
> dfs.datanode.handler.count including one from an IBM article (
> http://software.intel.com/en-us/articles/hadoop-and-hbase-optimization-for-read-intensive-search-applications/)
>  and the Pro Hadoop book, but other than that the only other mention I see
> is from cloudera seems luke-warm on increasing dfs.datanode.handler.count (
> http://www.cloudera.com/blog/2009/03/configuration-parameters-what-can-you-just-ignore/).
> >>
> >>    Given the post is from 2009 I thought I'd ask if anyone has had any
> success improving stability of HBase/DFS when increasing
> dfs.datanode.handler.count.  The specific error we are seeing somewhat
>  frequently ( few hundred times per day ) in the datanode longs is as
> follows:
> >>
> >> 2011-04-09 00:12:48,035 ERROR
> >> org.apache.hadoop.hdfs.server.datanode.DataNode:
> >> DatanodeRegistration(10.18.0.33:50010,
> >> storageID=DS-1501576934-10.18.0.33-50010-1296248656454,
> >> infoPort=50075, ipcPort=50020):DataXceiver
> >> java.io.IOException: Block blk_-163126943925471435_28809750 is not
> valid.
> >>
> >>   The above seems to correspond to ClosedChannelExceptions in the hbase
> regionserver logs as well as some warnings about long write to hlog ( some
> in the 50+ seconds ).
> >>
> >>    The biggest end-user facing issue we are seeing is that Task Trackers
> keep getting blacklisted.  It's quite possible our problem is unrelated to
> anything HBase, but I thought it was worth asking given what we've been
> seeing.
> >>
> >>   We are currently running 0.91 on an 18 node cluster with ~3k total
> regions and each region server is running with 2G of memory.
> >>
> >>   Any insight would be appreciated.
> >>
> >>   Thanks
> >>
> >>    Andy
> >>
> >
>

Reply via email to