Re: Trying to understand HBase/ZooKeeper Logs

Jean-Daniel Cryans Wed, 03 Mar 2010 10:15:42 -0800

Michael,

Grep your master log for "Received report from unknown server" and if
you do find it, it means that you have DNS flapping. This may explain
why you see a "new instance" which in this case would be the master
registering the region server a second or third time. This patch in
this jira fixes this issue
https://issues.apache.org/jira/browse/HBASE-2174


J-D

On Wed, Mar 3, 2010 at 9:28 AM, Michael Segel <michael_se...@hotmail.com> wrote:
>
>
>
>> Date: Wed, 3 Mar 2010 09:17:06 -0800
>> From: ph...@apache.org
>> To: hbase-user@hadoop.apache.org
>> Subject: Re: Trying to understand HBase/ZooKeeper Logs
> [SNIP]
>> There are a few issues involved with the ping time:
>>
>> 1) the network (obv :-) )
>> 2) the zk server - if the server is highly loaded the pings may take
>> longer. The heartbeat is also a "health check" that the client is doing
>> against the server (as much as it is a "health check" for the server
>> that the client is still live). The HB is routed "all the way" through
>> the ZK server, ie through the processing pipeline. So if the server were
>> stalled it would not respond immediately (vs say reading the HB at the
>> thread that reads data from the client). You can see the min/max/avg
>> request latencies on the zk server by using the "stat" 4letter word. See
>> the ZK admin docs on this http://bit.ly/dglVld
>> 3) the zk client - clients can only process HB responses if they are
>> running. Say the JVM GC runs in blocking mode, this will block all
>> client threads (incl the zk client thread) and the HB response will sit
>> until the GC is finished. This is why HBase RSs typically use very very
>> large (from our, zk, perspective) session timeouts.
>>
>> 50ms is not long btw. I believe that RS are using >> 30sec timeouts.
>>
>> I can't shed directly light on this (ie what's the problem in hbase that
>> could cause your issue). I'll let jd/stack comment on that.
>>
>> Patrick
>>
>
> Thanks for the quick response.
>
> I'm trying to track down the issue of why we're getting a lot of 'partial' 
> failures. Unfortunately this is currently a lot like watching a pot boil. :-(
>
> What I am calling a 'partial failure' is that the region servers are spawning 
> second or even third instances where only the last one appears to be live.
>
> From what I can tell is that there's a spike of network activity that causes 
> one of the processes to think that there is something wrong and spawn a new 
> instance.
>
> Is this a good description?
>
> Because some of the failures occur late at night with no load on the system, 
> I suspect that we have issues with the network but I can't definitively say.
>
> Which process is the most sensitive to network latency issues?
>
> Sorry, still relatively new to HBase and I'm trying to track down a nasty 
> issue that cause Hbase to fail on an almost regular basis. I think its a 
> networking issue, but I can't be sure.
>
> Thx
>
> -Mike
>
>
>
>
>
> _________________________________________________________________
> Hotmail: Powerful Free email with security by Microsoft.
> http://clk.atdmt.com/GBL/go/201469230/direct/01/

Re: Trying to understand HBase/ZooKeeper Logs

Reply via email to