Hi, I'm trying to debug an issue where I am getting 'partial' failures. For some reason the region servers seem to end up with multiple 'live' servers on a node. (We start with 3 servers and the next morning we see 4,5 or 6 servers where a server has multiple servers 'live'. ) Yet if you do a list or a scan on a table, an exception gets thrown. (The next time we have a failure I'll include the exception....)
I've set all of the logging to Debug so I should be picking up as much information. The master log shows the following: 2010-03-02 20:05:40,712 INFO org.apache.hadoop.hbase.master.BaseScanner: All 1 .META. region(s) scanned 2010-03-02 20:05:45,000 DEBUG org.apache.zookeeper.ClientCnxn: Got ping response for sessionid:0x2720dcdb350000 after 3ms 2010-03-02 20:06:05,032 DEBUG org.apache.zookeeper.ClientCnxn: Got ping response for sessionid:0x2720dcdb350000 after 35ms 2010-03-02 20:06:24,998 DEBUG org.apache.zookeeper.ClientCnxn: Got ping response for sessionid:0x2720dcdb350000 after 0ms 2010-03-02 20:06:39,563 INFO org.apache.hadoop.hbase.master.ServerManager: 3 region servers, 0 dead, average load 1.6666666666666667 2010-03-02 20:06:40,705 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.rootScanner scanning meta region {server: 10.8.237.230:60020, regionname: -ROOT-,,0, startKey: <>} 2010-03-02 20:06:40,705 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.metaScanner scanning meta region {server: 10.8.237.232:60020, regionname: .META.,,1, startKey: <>} (Hopefully this formats ok...) I'm trying to understand what I'm seeing. Am I correct when I say that this is where the master node is pinging the lead zookeeper as a way to maintain a heartbeat to see if zookeeper is alive? On the region servers I see every node with roughly the following: 2010-03-03 09:31:52,086 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: Total=2.9179459MB (3059688), Free=352.64456MB (369774616), Max=355.5625MB (372834304), Counts: Blocks=0, Access=0, Hit=0, Miss=0, Evictions=0, Evicted=0, Ratios: Hit Ratio=NaN%, Miss Ratio=NaN%, Evicted/Run=NaN 2010-03-03 09:31:52,222 DEBUG org.apache.zookeeper.ClientCnxn: Got ping response for sessionid:0x12722bb961d0001 after 0ms 2010-03-03 09:32:12,223 DEBUG org.apache.zookeeper.ClientCnxn: Got ping response for sessionid:0x12722bb961d0001 after 0ms 2010-03-03 09:32:32,223 DEBUG org.apache.zookeeper.ClientCnxn: Got ping response for sessionid:0x12722bb961d0001 after 0ms Going through the logs, I see 0-1ms response time from the region servers to zookeeper. I'm trying to track down why I'm having partial failures. That is, on a region server, I see multiple live servers, where one is actually alive. (This problem is intermittent and I haven't seen a failure [yet] since I turned on the debugging.) Is it normal to see pings as long as 50ms when a master pings zookeeper? Thx -Mike _________________________________________________________________ Your E-mail and More On-the-Go. Get Windows Live Hotmail Free. http://clk.atdmt.com/GBL/go/201469229/direct/01/