RE: Trying to understand HBase/ZooKeeper Logs

Michael Segel Wed, 03 Mar 2010 11:09:20 -0800

Thanks.

Cloudera's release doesn't ship with the source code, but luckily we have the 
source when we wanted to test 20.3 code.


Thanks again!


> Date: Wed, 3 Mar 2010 10:54:26 -0800
> Subject: Re: Trying to understand HBase/ZooKeeper Logs
> From: jdcry...@apache.org
> To: hbase-user@hadoop.apache.org
> 
> So get the patch in your hbase root, on linux do: wget
> https://issues.apache.org/jira/secure/attachment/12436659/HBASE-2174_0.20.3.patch
> 
> then run: patch -p0 < HBASE-2174_0.20.3.patch
> 
> finally compile: ant tar
> 
> The new tar will be in build/
> 
> J-D
> 
> On Wed, Mar 3, 2010 at 10:52 AM, Michael Segel
> <michael_se...@hotmail.com> wrote:
> >
> > Hey!
> >
> > Thanks for the responses.
> > It looks like the patch I was pointed to may solve the issue.
> >
> > We've had some network latency issues. Again the 50ms was something I found 
> > quickly in the logs and if I had a failure after turning on all of the 
> > debugging, I think I could have drilled down to the issue.
> >
> > I don't manage the DNS setup, so I can't say what's 'strange' or different.
> > The only think that I know that we did was set up a CNAME alias to the 
> > Namenode and JobTracker to make it easier to 'hide' the cloud and then give 
> > developers an easy to remember name for them to point to. I don't think 
> > that should cause it, although it could be something in how they set up 
> > their reverse DNS. If the patch works, I'll be happy.
> >
> > Now for the $64,000 question.
> > Any pointers on how to apply the patch?
> > I'm just used to pulling the distro from the website...
> >
> > Thanks again!
> >
> > -Mike
> >
> >
> >> From: jl...@streamy.com
> >> To: hbase-user@hadoop.apache.org
> >> Subject: RE: Trying to understand HBase/ZooKeeper Logs
> >> Date: Wed, 3 Mar 2010 10:21:10 -0800
> >>
> >> What version of HBase are you running?  There were some recent fixes 
> >> related
> >> to DNS issues causing regionservers to check-in to the master as a 
> >> different
> >> name.  Anything strange about the network or DNS setup of your cluster?
> >>
> >> ZooKeeper is sensitive to causes and network latency, as would any
> >> fault-tolerant distributed system.  ZK and HBase must determine when
> >> something has "failed", and the primary way is that it has not responded
> >> within some period of time.  50ms is negligible from a fault-detection
> >> standpoint, but 50 seconds is not.
> >>
> >> -----Original Message-----
> >> From: Michael Segel [mailto:michael_se...@hotmail.com]
> >> Sent: Wednesday, March 03, 2010 9:29 AM
> >> To: hbase-user@hadoop.apache.org
> >> Subject: RE: Trying to understand HBase/ZooKeeper Logs
> >>
> >>
> >>
> >>
> >> > Date: Wed, 3 Mar 2010 09:17:06 -0800
> >> > From: ph...@apache.org
> >> > To: hbase-user@hadoop.apache.org
> >> > Subject: Re: Trying to understand HBase/ZooKeeper Logs
> >> [SNIP]
> >> > There are a few issues involved with the ping time:
> >> >
> >> > 1) the network (obv :-) )
> >> > 2) the zk server - if the server is highly loaded the pings may take
> >> > longer. The heartbeat is also a "health check" that the client is doing
> >> > against the server (as much as it is a "health check" for the server
> >> > that the client is still live). The HB is routed "all the way" through
> >> > the ZK server, ie through the processing pipeline. So if the server were
> >> > stalled it would not respond immediately (vs say reading the HB at the
> >> > thread that reads data from the client). You can see the min/max/avg
> >> > request latencies on the zk server by using the "stat" 4letter word. See
> >> > the ZK admin docs on this http://bit.ly/dglVld
> >> > 3) the zk client - clients can only process HB responses if they are
> >> > running. Say the JVM GC runs in blocking mode, this will block all
> >> > client threads (incl the zk client thread) and the HB response will sit
> >> > until the GC is finished. This is why HBase RSs typically use very very
> >> > large (from our, zk, perspective) session timeouts.
> >> >
> >> > 50ms is not long btw. I believe that RS are using >> 30sec timeouts.
> >> >
> >> > I can't shed directly light on this (ie what's the problem in hbase that
> >> > could cause your issue). I'll let jd/stack comment on that.
> >> >
> >> > Patrick
> >> >
> >>
> >> Thanks for the quick response.
> >>
> >> I'm trying to track down the issue of why we're getting a lot of 'partial'
> >> failures. Unfortunately this is currently a lot like watching a pot boil.
> >> :-(
> >>
> >> What I am calling a 'partial failure' is that the region servers are
> >> spawning second or even third instances where only the last one appears to
> >> be live.
> >>
> >> >From what I can tell is that there's a spike of network activity that
> >> causes one of the processes to think that there is something wrong and 
> >> spawn
> >> a new instance.
> >>
> >> Is this a good description?
> >>
> >> Because some of the failures occur late at night with no load on the 
> >> system,
> >> I suspect that we have issues with the network but I can't definitively 
> >> say.
> >>
> >> Which process is the most sensitive to network latency issues?
> >>
> >> Sorry, still relatively new to HBase and I'm trying to track down a nasty
> >> issue that cause Hbase to fail on an almost regular basis. I think its a
> >> networking issue, but I can't be sure.
> >>
> >> Thx
> >>
> >> -Mike
> >>
> >>
> >>
> >>
> >>
> >> _________________________________________________________________
> >> Hotmail: Powerful Free email with security by Microsoft.
> >> http://clk.atdmt.com/GBL/go/201469230/direct/01/
> >>
> >
> > _________________________________________________________________
> > Your E-mail and More On-the-Go. Get Windows Live Hotmail Free.
> > http://clk.atdmt.com/GBL/go/201469229/direct/01/
                                          
_________________________________________________________________
Hotmail: Powerful Free email with security by Microsoft.
http://clk.atdmt.com/GBL/go/201469230/direct/01/

RE: Trying to understand HBase/ZooKeeper Logs

Reply via email to