Thanks. Cloudera's release doesn't ship with the source code, but luckily we have the source when we wanted to test 20.3 code.
Thanks again! > Date: Wed, 3 Mar 2010 10:54:26 -0800 > Subject: Re: Trying to understand HBase/ZooKeeper Logs > From: jdcry...@apache.org > To: hbase-user@hadoop.apache.org > > So get the patch in your hbase root, on linux do: wget > https://issues.apache.org/jira/secure/attachment/12436659/HBASE-2174_0.20.3.patch > > then run: patch -p0 < HBASE-2174_0.20.3.patch > > finally compile: ant tar > > The new tar will be in build/ > > J-D > > On Wed, Mar 3, 2010 at 10:52 AM, Michael Segel > <michael_se...@hotmail.com> wrote: > > > > Hey! > > > > Thanks for the responses. > > It looks like the patch I was pointed to may solve the issue. > > > > We've had some network latency issues. Again the 50ms was something I found > > quickly in the logs and if I had a failure after turning on all of the > > debugging, I think I could have drilled down to the issue. > > > > I don't manage the DNS setup, so I can't say what's 'strange' or different. > > The only think that I know that we did was set up a CNAME alias to the > > Namenode and JobTracker to make it easier to 'hide' the cloud and then give > > developers an easy to remember name for them to point to. I don't think > > that should cause it, although it could be something in how they set up > > their reverse DNS. If the patch works, I'll be happy. > > > > Now for the $64,000 question. > > Any pointers on how to apply the patch? > > I'm just used to pulling the distro from the website... > > > > Thanks again! > > > > -Mike > > > > > >> From: jl...@streamy.com > >> To: hbase-user@hadoop.apache.org > >> Subject: RE: Trying to understand HBase/ZooKeeper Logs > >> Date: Wed, 3 Mar 2010 10:21:10 -0800 > >> > >> What version of HBase are you running? There were some recent fixes > >> related > >> to DNS issues causing regionservers to check-in to the master as a > >> different > >> name. Anything strange about the network or DNS setup of your cluster? > >> > >> ZooKeeper is sensitive to causes and network latency, as would any > >> fault-tolerant distributed system. ZK and HBase must determine when > >> something has "failed", and the primary way is that it has not responded > >> within some period of time. 50ms is negligible from a fault-detection > >> standpoint, but 50 seconds is not. > >> > >> -----Original Message----- > >> From: Michael Segel [mailto:michael_se...@hotmail.com] > >> Sent: Wednesday, March 03, 2010 9:29 AM > >> To: hbase-user@hadoop.apache.org > >> Subject: RE: Trying to understand HBase/ZooKeeper Logs > >> > >> > >> > >> > >> > Date: Wed, 3 Mar 2010 09:17:06 -0800 > >> > From: ph...@apache.org > >> > To: hbase-user@hadoop.apache.org > >> > Subject: Re: Trying to understand HBase/ZooKeeper Logs > >> [SNIP] > >> > There are a few issues involved with the ping time: > >> > > >> > 1) the network (obv :-) ) > >> > 2) the zk server - if the server is highly loaded the pings may take > >> > longer. The heartbeat is also a "health check" that the client is doing > >> > against the server (as much as it is a "health check" for the server > >> > that the client is still live). The HB is routed "all the way" through > >> > the ZK server, ie through the processing pipeline. So if the server were > >> > stalled it would not respond immediately (vs say reading the HB at the > >> > thread that reads data from the client). You can see the min/max/avg > >> > request latencies on the zk server by using the "stat" 4letter word. See > >> > the ZK admin docs on this http://bit.ly/dglVld > >> > 3) the zk client - clients can only process HB responses if they are > >> > running. Say the JVM GC runs in blocking mode, this will block all > >> > client threads (incl the zk client thread) and the HB response will sit > >> > until the GC is finished. This is why HBase RSs typically use very very > >> > large (from our, zk, perspective) session timeouts. > >> > > >> > 50ms is not long btw. I believe that RS are using >> 30sec timeouts. > >> > > >> > I can't shed directly light on this (ie what's the problem in hbase that > >> > could cause your issue). I'll let jd/stack comment on that. > >> > > >> > Patrick > >> > > >> > >> Thanks for the quick response. > >> > >> I'm trying to track down the issue of why we're getting a lot of 'partial' > >> failures. Unfortunately this is currently a lot like watching a pot boil. > >> :-( > >> > >> What I am calling a 'partial failure' is that the region servers are > >> spawning second or even third instances where only the last one appears to > >> be live. > >> > >> >From what I can tell is that there's a spike of network activity that > >> causes one of the processes to think that there is something wrong and > >> spawn > >> a new instance. > >> > >> Is this a good description? > >> > >> Because some of the failures occur late at night with no load on the > >> system, > >> I suspect that we have issues with the network but I can't definitively > >> say. > >> > >> Which process is the most sensitive to network latency issues? > >> > >> Sorry, still relatively new to HBase and I'm trying to track down a nasty > >> issue that cause Hbase to fail on an almost regular basis. I think its a > >> networking issue, but I can't be sure. > >> > >> Thx > >> > >> -Mike > >> > >> > >> > >> > >> > >> _________________________________________________________________ > >> Hotmail: Powerful Free email with security by Microsoft. > >> http://clk.atdmt.com/GBL/go/201469230/direct/01/ > >> > > > > _________________________________________________________________ > > Your E-mail and More On-the-Go. Get Windows Live Hotmail Free. > > http://clk.atdmt.com/GBL/go/201469229/direct/01/ _________________________________________________________________ Hotmail: Powerful Free email with security by Microsoft. http://clk.atdmt.com/GBL/go/201469230/direct/01/