RE: Trying to understand HBase/ZooKeeper Logs

Michael Segel Wed, 03 Mar 2010 09:29:24 -0800

> Date: Wed, 3 Mar 2010 09:17:06 -0800
> From: ph...@apache.org
> To: hbase-user@hadoop.apache.org
> Subject: Re: Trying to understand HBase/ZooKeeper Logs
[SNIP]
> There are a few issues involved with the ping time:
> 
> 1) the network (obv :-) )
> 2) the zk server - if the server is highly loaded the pings may take 
> longer. The heartbeat is also a "health check" that the client is doing 
> against the server (as much as it is a "health check" for the server 
> that the client is still live). The HB is routed "all the way" through 
> the ZK server, ie through the processing pipeline. So if the server were 
> stalled it would not respond immediately (vs say reading the HB at the 
> thread that reads data from the client). You can see the min/max/avg 
> request latencies on the zk server by using the "stat" 4letter word. See 
> the ZK admin docs on this http://bit.ly/dglVld
> 3) the zk client - clients can only process HB responses if they are 
> running. Say the JVM GC runs in blocking mode, this will block all 
> client threads (incl the zk client thread) and the HB response will sit 
> until the GC is finished. This is why HBase RSs typically use very very 
> large (from our, zk, perspective) session timeouts.
> 
> 50ms is not long btw. I believe that RS are using >> 30sec timeouts.
> 
> I can't shed directly light on this (ie what's the problem in hbase that 
> could cause your issue). I'll let jd/stack comment on that.
> 
> Patrick
> 

Thanks for the quick response.

I'm trying to track down the issue of why we're getting a lot of 'partial' 
failures. Unfortunately this is currently a lot like watching a pot boil. :-( 

What I am calling a 'partial failure' is that the region servers are spawning 
second or even third instances where only the last one appears to be live.

>From what I can tell is that there's a spike of network activity that causes 
>one of the processes to think that there is something wrong and spawn a new 
>instance.

Is this a good description?

Because some of the failures occur late at night with no load on the system, I 
suspect that we have issues with the network but I can't definitively say.

Which process is the most sensitive to network latency issues?

Sorry, still relatively new to HBase and I'm trying to track down a nasty issue 
that cause Hbase to fail on an almost regular basis. I think its a networking 
issue, but I can't be sure.

Thx

-Mike




                                          
_________________________________________________________________
Hotmail: Powerful Free email with security by Microsoft.
http://clk.atdmt.com/GBL/go/201469230/direct/01/
RE: Trying to understand HBase/ZooKeeper Logs

Reply via email to