Michael,

Grep your master log for "Received report from unknown server" and if
you do find it, it means that you have DNS flapping. This may explain
why you see a "new instance" which in this case would be the master
registering the region server a second or third time. This patch in
this jira fixes this issue
https://issues.apache.org/jira/browse/HBASE-2174

J-D

On Wed, Mar 3, 2010 at 9:28 AM, Michael Segel <michael_se...@hotmail.com> wrote:
>
>
>
>> Date: Wed, 3 Mar 2010 09:17:06 -0800
>> From: ph...@apache.org
>> To: hbase-user@hadoop.apache.org
>> Subject: Re: Trying to understand HBase/ZooKeeper Logs
> [SNIP]
>> There are a few issues involved with the ping time:
>>
>> 1) the network (obv :-) )
>> 2) the zk server - if the server is highly loaded the pings may take
>> longer. The heartbeat is also a "health check" that the client is doing
>> against the server (as much as it is a "health check" for the server
>> that the client is still live). The HB is routed "all the way" through
>> the ZK server, ie through the processing pipeline. So if the server were
>> stalled it would not respond immediately (vs say reading the HB at the
>> thread that reads data from the client). You can see the min/max/avg
>> request latencies on the zk server by using the "stat" 4letter word. See
>> the ZK admin docs on this http://bit.ly/dglVld
>> 3) the zk client - clients can only process HB responses if they are
>> running. Say the JVM GC runs in blocking mode, this will block all
>> client threads (incl the zk client thread) and the HB response will sit
>> until the GC is finished. This is why HBase RSs typically use very very
>> large (from our, zk, perspective) session timeouts.
>>
>> 50ms is not long btw. I believe that RS are using >> 30sec timeouts.
>>
>> I can't shed directly light on this (ie what's the problem in hbase that
>> could cause your issue). I'll let jd/stack comment on that.
>>
>> Patrick
>>
>
> Thanks for the quick response.
>
> I'm trying to track down the issue of why we're getting a lot of 'partial' 
> failures. Unfortunately this is currently a lot like watching a pot boil. :-(
>
> What I am calling a 'partial failure' is that the region servers are spawning 
> second or even third instances where only the last one appears to be live.
>
> From what I can tell is that there's a spike of network activity that causes 
> one of the processes to think that there is something wrong and spawn a new 
> instance.
>
> Is this a good description?
>
> Because some of the failures occur late at night with no load on the system, 
> I suspect that we have issues with the network but I can't definitively say.
>
> Which process is the most sensitive to network latency issues?
>
> Sorry, still relatively new to HBase and I'm trying to track down a nasty 
> issue that cause Hbase to fail on an almost regular basis. I think its a 
> networking issue, but I can't be sure.
>
> Thx
>
> -Mike
>
>
>
>
>
> _________________________________________________________________
> Hotmail: Powerful Free email with security by Microsoft.
> http://clk.atdmt.com/GBL/go/201469230/direct/01/

Reply via email to