Hi,

I'm trying to debug an issue where I am getting 'partial' failures. For some 
reason the region servers seem to end up with multiple 'live' servers on a 
node. (We start with 3 servers and the next morning we see 4,5 or 6 servers 
where a server has multiple servers 'live'. ) Yet if you do a list or a scan on 
a table, an exception gets thrown. (The next time we have a failure I'll 
include the exception....)

I've set all of the logging to Debug so I should be picking up as much 
information.

The master log shows the following:
2010-03-02 20:05:40,712 INFO org.apache.hadoop.hbase.master.BaseScanner: All 1 
.META. region(s) scanned
2010-03-02 20:05:45,000 DEBUG org.apache.zookeeper.ClientCnxn: Got ping 
response for sessionid:0x2720dcdb350000 after 3ms
2010-03-02 20:06:05,032 DEBUG org.apache.zookeeper.ClientCnxn: Got ping 
response for sessionid:0x2720dcdb350000 after 35ms
2010-03-02 20:06:24,998 DEBUG org.apache.zookeeper.ClientCnxn: Got ping 
response for sessionid:0x2720dcdb350000 after 0ms
2010-03-02 20:06:39,563 INFO org.apache.hadoop.hbase.master.ServerManager: 3 
region servers, 0 dead, average load 1.6666666666666667
2010-03-02 20:06:40,705 INFO org.apache.hadoop.hbase.master.BaseScanner: 
RegionManager.rootScanner scanning meta region {server: 10.8.237.230:60020, 
regionname: -ROOT-,,0, startKey: <>}
2010-03-02 20:06:40,705 INFO org.apache.hadoop.hbase.master.BaseScanner: 
RegionManager.metaScanner scanning meta region {server: 10.8.237.232:60020, 
regionname: .META.,,1, startKey: <>}

(Hopefully this formats ok...)

I'm trying to understand what I'm seeing.
Am I correct when I say that this is where the master node is pinging the lead 
zookeeper as a way to maintain a heartbeat to see if zookeeper is alive?

On the region servers I see every node with roughly the following:
2010-03-03 09:31:52,086 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: 
Cache Stats: Sizes: Total=2.9179459MB (3059688), Free=352.64456MB (369774616), 
Max=355.5625MB (372834304), Counts: Blocks=0, Access=0, Hit=0, Miss=0, 
Evictions=0, Evicted=0, Ratios: Hit Ratio=NaN%, Miss Ratio=NaN%, Evicted/Run=NaN
2010-03-03 09:31:52,222 DEBUG org.apache.zookeeper.ClientCnxn: Got ping 
response for sessionid:0x12722bb961d0001 after 0ms
2010-03-03 09:32:12,223 DEBUG org.apache.zookeeper.ClientCnxn: Got ping 
response for sessionid:0x12722bb961d0001 after 0ms
2010-03-03 09:32:32,223 DEBUG org.apache.zookeeper.ClientCnxn: Got ping 
response for sessionid:0x12722bb961d0001 after 0ms


Going through the logs, I see 0-1ms response time from the region servers to 
zookeeper.

I'm trying to track down why I'm having partial failures.
That is, on a region server, I see multiple live servers, where one is actually 
alive.
(This problem is intermittent and I haven't seen a failure [yet] since I turned 
on the debugging.)

Is it normal to see pings as long as 50ms when a master pings zookeeper?

Thx

-Mike






                                          
_________________________________________________________________
Your E-mail and More On-the-Go. Get Windows Live Hotmail Free.
http://clk.atdmt.com/GBL/go/201469229/direct/01/

Reply via email to