Re: DFS stability running HBase and dfs.datanode.handler.count...

Ben Aldrich Thu, 14 Apr 2011 10:32:25 -0700

Our heapsize is set to 2gb, I think my dev issue was because I was running
things off of a few vm's. Even though the compaction is in another thread it
would still fail to respond during major compaction.


-Ben

On Thu, Apr 14, 2011 at 11:26 AM, Jean-Daniel Cryans <[email protected]>wrote:

> Ben, the compaction is done in a background thread, it doesn't block
> anything. Now if you had a heap close to 2GB, you could easily run
> into issues.
>
> J-D
>
> On Thu, Apr 14, 2011 at 10:23 AM, Ben Aldrich <[email protected]> wrote:
> > Just to chime in here, the other thing we changed was our max_file_size
> is
> > now set to 2gb instead of 512mb. This could be causing long compaction
> > times. If a compaction takes too long it won't respond and can be marked
> as
> > dead. I have had this happen on my dev cluster a few times.
> >
> > -Ben
> >
> > On Thu, Apr 14, 2011 at 11:20 AM, Jean-Daniel Cryans <
> [email protected]>wrote:
> >
> >> This is probably a red herring, for example if the region server had a
> >> big GC pause then the master could have already split the log and the
> >> region server wouldn't be able to close it (that's our version of IO
> >> fencing). So from that exception look back in the log and see if
> >> there's anything like :
> >>
> >> INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have
> >> not heard from server in some_big_number ms
> >>
> >> J-D
> >>
> >> On Thu, Apr 14, 2011 at 7:24 AM, Andy Sautins
> >> <[email protected]> wrote:
> >> >
> >> >  Thanks for the response stack.  Yes we tried increasing
> >> dfs.datanode.handler.count to 8.   At this point I would say it didn't
> seem
> >> to resolve the issue we are seeing, but we it also doesn't seem to be
> >> hurting anything so for right now we're going to leave it in at 8 while
> we
> >> continue to debug.
> >> >
> >> >  In regard to the original error I posted ( Block 'x' is not valid )
> we
> >> have chased that down thanks to your suggestion of looking at the logs
> for
> >> the history of the block.  It _looks_ like our 'is not valid' block
> errors
> >> are unrelated and due to chmod or deleting mapreduce output directories
> >> directly after a run.  We are still isolating that but it looks like
> it's
> >> not HBase releated so I'll move that to another list.  Thank you very
> much
> >> for your debugging suggestions.
> >> >
> >> >   The one issue we are still seeing is that we will occasionally have
> a
> >> regionserver die with the following exception.  I need to chase that
> down a
> >> little more but it seems similar to a post from 2/13/2011 (
> >> http://www.mail-archive.com/[email protected]/msg05550.html ) that
> I'm
> >> not sure was ever resolved or not.  If anyone has any insight on how to
> >> debug the following error a little more I would appreciate any thoughts
> you
> >> might have.
> >> >
> >> > 2011-04-14 06:05:13,001 ERROR org.apache.hadoop.hdfs.DFSClient:
> Exception
> >> closing file /user/hbase/.logs/hd10.dfs.returnpath.net
> >> ,60020,1302555127291/hd10.dfs.returnpath.net%3A60020.1302781635921 :
> >> java.io.IOException: Error Recovery for block
> >> blk_1315316969665710488_29842654 failed  because recovery from primary
> >> datanode 10.18.0.16:50010 failed 6 times.  Pipeline was
> 10.18.0.16:50010.
> >> Aborting...
> >> > java.io.IOException: Error Recovery for block
> >> blk_1315316969665710488_29842654 failed  because recovery from primary
> >> datanode 10.18.0.16:50010 failed 6 times.  Pipeline was
> 10.18.0.16:50010.
> >> Aborting...
> >> >        at
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2841)
> >> >        at
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2305)
> >> >
> >> > Other than the above exception causing a region server to die
> >> occasionally everything seems to be working well.
> >> >
> >> > Note we have now upgraded to Cloudera CDH Version 3 Update 0 ( hadoop
> >> 0.20.2+923.21 and hbase 0.90.1+15.18 ) and still see the above
> exception.
> >>  We do have ulimit set ( memory unlimited and files 32k ) for the user
> >> running hbase.
> >> >
> >> > Thanks again for your help
> >> >
> >> >  Andy
> >> >
> >> > -----Original Message-----
> >> > From: [email protected] [mailto:[email protected]] On Behalf Of
> >> Stack
> >> > Sent: Sunday, April 10, 2011 1:16 PM
> >> > To: [email protected]
> >> > Cc: Andy Sautins
> >> > Subject: Re: DFS stability running HBase and
> >> dfs.datanode.handler.count...
> >> >
> >> > Did you try upping it Andy?  Andrew Purtell's recommendation though
> old
> >> would have come of experience.  The Intel article reads like sales but
> there
> >> is probably merit to its suggestion.  The Cloudera article is more
> unsure
> >> about the effect of upping handlers though it allows "...could be set a
> bit
> >> higher."
> >> >
> >> > I just looked at our prod frontend and its set to 3 still.  I don't
> see
> >> your exceptions in our DN log.
> >> >
> >> > What version of hadoop?  You say hbase 0.91.  You mean 0.90.1?
> >> >
> >> > ulimit and nproc are set sufficiently high for hadoop/hbase user?
> >> >
> >> > If you grep 163126943925471435_28809750 in namenode log, do you see a
> >> delete occur before a later open?
> >> >
> >> > St.Ack
> >> >
> >> > On Sat, Apr 9, 2011 at 4:35 PM, Andy Sautins <
> >> [email protected]> wrote:
> >> >>
> >> >>    I ran across an mailing list posting from 1/4/2009 that seemed to
> >> indicate increasing dfs.datanode.handler.count could help improve DFS
> >> stability (
> >>
> http://mail-archives.apache.org/mod_mbox/hbase-user/200901.mbox/%[email protected]%3E).
>  The posting seems to indicate the wiki was updated, but I don't seen
> >> anything in the wiki about increasing dfs.datanode.handler.count.   I
> have
> >> seen a few other notes that seem to show examples that have raised
> >> dfs.datanode.handler.count including one from an IBM article (
> >>
> http://software.intel.com/en-us/articles/hadoop-and-hbase-optimization-for-read-intensive-search-applications/)
> and the Pro Hadoop book, but other than that the only other mention I see
> >> is from cloudera seems luke-warm on increasing
> dfs.datanode.handler.count (
> >>
> http://www.cloudera.com/blog/2009/03/configuration-parameters-what-can-you-just-ignore/
> ).
> >> >>
> >> >>    Given the post is from 2009 I thought I'd ask if anyone has had
> any
> >> success improving stability of HBase/DFS when increasing
> >> dfs.datanode.handler.count.  The specific error we are seeing somewhat
> >>  frequently ( few hundred times per day ) in the datanode longs is as
> >> follows:
> >> >>
> >> >> 2011-04-09 00:12:48,035 ERROR
> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode:
> >> >> DatanodeRegistration(10.18.0.33:50010,
> >> >> storageID=DS-1501576934-10.18.0.33-50010-1296248656454,
> >> >> infoPort=50075, ipcPort=50020):DataXceiver
> >> >> java.io.IOException: Block blk_-163126943925471435_28809750 is not
> >> valid.
> >> >>
> >> >>   The above seems to correspond to ClosedChannelExceptions in the
> hbase
> >> regionserver logs as well as some warnings about long write to hlog (
> some
> >> in the 50+ seconds ).
> >> >>
> >> >>    The biggest end-user facing issue we are seeing is that Task
> Trackers
> >> keep getting blacklisted.  It's quite possible our problem is unrelated
> to
> >> anything HBase, but I thought it was worth asking given what we've been
> >> seeing.
> >> >>
> >> >>   We are currently running 0.91 on an 18 node cluster with ~3k total
> >> regions and each region server is running with 2G of memory.
> >> >>
> >> >>   Any insight would be appreciated.
> >> >>
> >> >>   Thanks
> >> >>
> >> >>    Andy
> >> >>
> >> >
> >>
> >
>

Re: DFS stability running HBase and dfs.datanode.handler.count...

Reply via email to