What version of HBase are you running?  There were some recent fixes related
to DNS issues causing regionservers to check-in to the master as a different
name.  Anything strange about the network or DNS setup of your cluster?

ZooKeeper is sensitive to causes and network latency, as would any
fault-tolerant distributed system.  ZK and HBase must determine when
something has "failed", and the primary way is that it has not responded
within some period of time.  50ms is negligible from a fault-detection
standpoint, but 50 seconds is not.

-----Original Message-----
From: Michael Segel [mailto:michael_se...@hotmail.com] 
Sent: Wednesday, March 03, 2010 9:29 AM
To: hbase-user@hadoop.apache.org
Subject: RE: Trying to understand HBase/ZooKeeper Logs




> Date: Wed, 3 Mar 2010 09:17:06 -0800
> From: ph...@apache.org
> To: hbase-user@hadoop.apache.org
> Subject: Re: Trying to understand HBase/ZooKeeper Logs
[SNIP]
> There are a few issues involved with the ping time:
> 
> 1) the network (obv :-) )
> 2) the zk server - if the server is highly loaded the pings may take 
> longer. The heartbeat is also a "health check" that the client is doing 
> against the server (as much as it is a "health check" for the server 
> that the client is still live). The HB is routed "all the way" through 
> the ZK server, ie through the processing pipeline. So if the server were 
> stalled it would not respond immediately (vs say reading the HB at the 
> thread that reads data from the client). You can see the min/max/avg 
> request latencies on the zk server by using the "stat" 4letter word. See 
> the ZK admin docs on this http://bit.ly/dglVld
> 3) the zk client - clients can only process HB responses if they are 
> running. Say the JVM GC runs in blocking mode, this will block all 
> client threads (incl the zk client thread) and the HB response will sit 
> until the GC is finished. This is why HBase RSs typically use very very 
> large (from our, zk, perspective) session timeouts.
> 
> 50ms is not long btw. I believe that RS are using >> 30sec timeouts.
> 
> I can't shed directly light on this (ie what's the problem in hbase that 
> could cause your issue). I'll let jd/stack comment on that.
> 
> Patrick
> 

Thanks for the quick response.

I'm trying to track down the issue of why we're getting a lot of 'partial'
failures. Unfortunately this is currently a lot like watching a pot boil.
:-( 

What I am calling a 'partial failure' is that the region servers are
spawning second or even third instances where only the last one appears to
be live.

>From what I can tell is that there's a spike of network activity that
causes one of the processes to think that there is something wrong and spawn
a new instance.

Is this a good description?

Because some of the failures occur late at night with no load on the system,
I suspect that we have issues with the network but I can't definitively say.

Which process is the most sensitive to network latency issues?

Sorry, still relatively new to HBase and I'm trying to track down a nasty
issue that cause Hbase to fail on an almost regular basis. I think its a
networking issue, but I can't be sure.

Thx

-Mike




                                          
_________________________________________________________________
Hotmail: Powerful Free email with security by Microsoft.
http://clk.atdmt.com/GBL/go/201469230/direct/01/

Reply via email to