Hi Lucas, from the log snippet you sent earlier it looks like the ZK client session times out (as jd identified) at time 08:40:44. After this happens the client attempts to reestablish the session, however the server has already expired the session (meaning that it has not heard from the client w/in the timeout period). I don't see any big time delta discontinuities in the logs (client or server) that would indicate gc/swap activity, however I can't be 100% sure given that we only have snippets of the logs to work with.

I see all kinds of network connectivity errors in the log during this period, so I would assume that there was a network connectivity issue that caused the client to lose the ability to communicate with the server for some period of time.

After this the zk client (hbase) is able to establish a new ZK session successfully, for example here they are reading a znode:

2010-05-27 08:40:47,129 DEBUG
org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Read ZNode /hbase/master
got 10.251.158.224:60000

Notice that this is a few (~3) seconds after the TIMEOUT occurred.


On the server log I see similar, a number of sessions expire (which bolsters my guess that a network connectivity issue occurred during this timeframe). Then at time 08:41:12 a number of sessions are established (ie clients reconnecting to the ZK service after timing out or such). Then the log snippet stops so I can't see what's next.


So from a ZK perspective, as far as I can tell, all looks fine. There was a blip that caused connectivity to fail during which the sessions timedout/expired. Then subsequently the ZK clients established sessions with the ZK server and were able to access ZK data.

Patrick


On 05/27/2010 11:28 AM, Lucas Nazário dos Santos wrote:
So I see 2 main issues:

  - Your master's zookeeper session timed out. Why? Hard to tell with
those logs since it happened before what you pasted. Very slow IO?

I'm not swapping and I doubt that zookeeper session timed out because of
slow IO since my applications aren't even close to stress the hardware. I've
already followed the instruction in
http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A9 to avoid this kind of
problem.

Swapping + GC?
  - The your region server seemed to have moved elsewhere, or something
weird like that. DNS blip? Can't tell from the logs.

Maybe a DNS blip. But how can I confirm it? Logs? I didn't move anything and
as soon as I restarted the cluster things got back on track.

The hbase master log just repeated the text bellow for the last 8 hours
before the crash. The zookeeper and region server logs are clear from
errors.

2010-05-27 08:28:15,118 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.rootScanner scanning meta region {server: 10.251.158.224:60020,
regionname: -ROOT-,,0, startKey:<>}
2010-05-27 08:28:15,125 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.rootScanner scan of 1 row(s) of meta region {server:
10.251.158.224:60020, regionname: -ROOT-,,0, startKey:<>} complete
2010-05-27 08:28:26,379 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.metaScanner scanning meta region {server: 10.251.158.224:60020,
regionname: .META.,,1, startKey:<>}
2010-05-27 08:28:26,787 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.metaScanner scan of 73 row(s) of meta region {server:
10.251.158.224:60020, regionname: .META.,,1, startKey:<>} complete
2010-05-27 08:28:26,788 INFO org.apache.hadoop.hbase.master.BaseScanner: All
1 .META. region(s) scanned
2010-05-27 08:28:32,603 INFO org.apache.hadoop.hbase.master.ServerManager: 1
region servers, 0 dead, average load 75.0
2010-05-27 08:29:15,123 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.rootScanner scanning meta region {server: 10.251.158.224:60020,
regionname: -ROOT-,,0, startKey:<>}
2010-05-27 08:29:15,138 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.rootScanner scan of 1 row(s) of meta region {server:
10.251.158.224:60020, regionname: -ROOT-,,0, startKey:<>} complete
2010-05-27 08:29:26,380 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.metaScanner scanning meta region {server: 10.251.158.224:60020,
regionname: .META.,,1, startKey:<>}
2010-05-27 08:29:26,738 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.metaScanner scan of 73 row(s) of meta region {server:
10.251.158.224:60020, regionname: .META.,,1, startKey:<>} complete
2010-05-27 08:29:26,738 INFO org.apache.hadoop.hbase.master.BaseScanner: All
1 .META. region(s) scanned
2010-05-27 08:29:32,605 INFO org.apache.hadoop.hbase.master.ServerManager: 1
region servers, 0 dead, average load 75.0





On Thu, May 27, 2010 at 12:49 PM, Jean-Daniel Cryans<[email protected]>wrote:

 From what I see, nothing happened to zookeeper.

What happened:

1) The master wasn't able to scan the -ROOT- region because the
connection was refused  (same with .META.)
2010-05-27 08:40:44,259 WARN org.apache.hadoop.hbase.master.BaseScanner:
Scan ROOT region
java.io.IOException: Call to /10.251.158.224:60020 failed on local
exception: java.io.IOException: Connection reset by peer

2) The master's session with zookeeper was timed out
2010-05-27 08:40:46,630 WARN org.apache.zookeeper.ClientCnxn: Exception
closing session 0x128c8b303040000 to sun.nio.ch.selectionkeyi...@744e022c
java.io.IOException: Session Expired
   at

org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)
   at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)
   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)

3) The master was kicked out of the cluster, tries to re-enter
2010-05-27 08:40:46,631 INFO org.apache.hadoop.hbase.master.HMaster: Master
lost its znode, trying to get a new one

4) The master was able to win the race the be the main master again
(easy, there's only 1 machine in your cluster)
org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Wrote master address
10.251.158.224:60000 to ZooKeeper

5) This master still isn't able to scan -ROOT-
2010-05-27 08:41:44,270 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.rootScanner scanning meta region {server:
10.251.158.224:60020,
regionname: -ROOT-,,0, startKey:<>}

So I see 2 main issues:

  - Your master's zookeeper session timed out. Why? Hard to tell with
those logs since it happened before what you pasted. Very slow IO?
Swapping + GC?
  - The your region server seemed to have moved elsewhere, or something
weird like that. DNS blip? Can't tell from the logs.

Shouldn't Zookeeper recovery nicely? How can I prevent such error from
happening in the future?

Nothing happened to zookeeper. And since you have only 1 machine, even
if the ZK process did die for some reason, how could it even recover?
Reliability with ZK is 3 machines and more, nothing can be guaranteed
with only 1 machine.

Now on how to prevent, we need to understand the root cause of the 2
issues I listed.

Also, not sure if you saw that, but the first minute in your log is
out of order. Very apparent with the first two lines.

J-D


Reply via email to