Michael, Grep your master log for "Received report from unknown server" and if you do find it, it means that you have DNS flapping. This may explain why you see a "new instance" which in this case would be the master registering the region server a second or third time. This patch in this jira fixes this issue https://issues.apache.org/jira/browse/HBASE-2174
J-D On Wed, Mar 3, 2010 at 9:28 AM, Michael Segel <michael_se...@hotmail.com> wrote: > > > >> Date: Wed, 3 Mar 2010 09:17:06 -0800 >> From: ph...@apache.org >> To: hbase-user@hadoop.apache.org >> Subject: Re: Trying to understand HBase/ZooKeeper Logs > [SNIP] >> There are a few issues involved with the ping time: >> >> 1) the network (obv :-) ) >> 2) the zk server - if the server is highly loaded the pings may take >> longer. The heartbeat is also a "health check" that the client is doing >> against the server (as much as it is a "health check" for the server >> that the client is still live). The HB is routed "all the way" through >> the ZK server, ie through the processing pipeline. So if the server were >> stalled it would not respond immediately (vs say reading the HB at the >> thread that reads data from the client). You can see the min/max/avg >> request latencies on the zk server by using the "stat" 4letter word. See >> the ZK admin docs on this http://bit.ly/dglVld >> 3) the zk client - clients can only process HB responses if they are >> running. Say the JVM GC runs in blocking mode, this will block all >> client threads (incl the zk client thread) and the HB response will sit >> until the GC is finished. This is why HBase RSs typically use very very >> large (from our, zk, perspective) session timeouts. >> >> 50ms is not long btw. I believe that RS are using >> 30sec timeouts. >> >> I can't shed directly light on this (ie what's the problem in hbase that >> could cause your issue). I'll let jd/stack comment on that. >> >> Patrick >> > > Thanks for the quick response. > > I'm trying to track down the issue of why we're getting a lot of 'partial' > failures. Unfortunately this is currently a lot like watching a pot boil. :-( > > What I am calling a 'partial failure' is that the region servers are spawning > second or even third instances where only the last one appears to be live. > > From what I can tell is that there's a spike of network activity that causes > one of the processes to think that there is something wrong and spawn a new > instance. > > Is this a good description? > > Because some of the failures occur late at night with no load on the system, > I suspect that we have issues with the network but I can't definitively say. > > Which process is the most sensitive to network latency issues? > > Sorry, still relatively new to HBase and I'm trying to track down a nasty > issue that cause Hbase to fail on an almost regular basis. I think its a > networking issue, but I can't be sure. > > Thx > > -Mike > > > > > > _________________________________________________________________ > Hotmail: Powerful Free email with security by Microsoft. > http://clk.atdmt.com/GBL/go/201469230/direct/01/