What version of HBase are you running? There were some recent fixes related to DNS issues causing regionservers to check-in to the master as a different name. Anything strange about the network or DNS setup of your cluster?
ZooKeeper is sensitive to causes and network latency, as would any fault-tolerant distributed system. ZK and HBase must determine when something has "failed", and the primary way is that it has not responded within some period of time. 50ms is negligible from a fault-detection standpoint, but 50 seconds is not. -----Original Message----- From: Michael Segel [mailto:michael_se...@hotmail.com] Sent: Wednesday, March 03, 2010 9:29 AM To: hbase-user@hadoop.apache.org Subject: RE: Trying to understand HBase/ZooKeeper Logs > Date: Wed, 3 Mar 2010 09:17:06 -0800 > From: ph...@apache.org > To: hbase-user@hadoop.apache.org > Subject: Re: Trying to understand HBase/ZooKeeper Logs [SNIP] > There are a few issues involved with the ping time: > > 1) the network (obv :-) ) > 2) the zk server - if the server is highly loaded the pings may take > longer. The heartbeat is also a "health check" that the client is doing > against the server (as much as it is a "health check" for the server > that the client is still live). The HB is routed "all the way" through > the ZK server, ie through the processing pipeline. So if the server were > stalled it would not respond immediately (vs say reading the HB at the > thread that reads data from the client). You can see the min/max/avg > request latencies on the zk server by using the "stat" 4letter word. See > the ZK admin docs on this http://bit.ly/dglVld > 3) the zk client - clients can only process HB responses if they are > running. Say the JVM GC runs in blocking mode, this will block all > client threads (incl the zk client thread) and the HB response will sit > until the GC is finished. This is why HBase RSs typically use very very > large (from our, zk, perspective) session timeouts. > > 50ms is not long btw. I believe that RS are using >> 30sec timeouts. > > I can't shed directly light on this (ie what's the problem in hbase that > could cause your issue). I'll let jd/stack comment on that. > > Patrick > Thanks for the quick response. I'm trying to track down the issue of why we're getting a lot of 'partial' failures. Unfortunately this is currently a lot like watching a pot boil. :-( What I am calling a 'partial failure' is that the region servers are spawning second or even third instances where only the last one appears to be live. >From what I can tell is that there's a spike of network activity that causes one of the processes to think that there is something wrong and spawn a new instance. Is this a good description? Because some of the failures occur late at night with no load on the system, I suspect that we have issues with the network but I can't definitively say. Which process is the most sensitive to network latency issues? Sorry, still relatively new to HBase and I'm trying to track down a nasty issue that cause Hbase to fail on an almost regular basis. I think its a networking issue, but I can't be sure. Thx -Mike _________________________________________________________________ Hotmail: Powerful Free email with security by Microsoft. http://clk.atdmt.com/GBL/go/201469230/direct/01/