Re: Zookeeper apparently going down

Patrick Hunt Thu, 27 May 2010 13:23:12 -0700

Hi Lucas, from the log snippet you sent earlier it looks like the ZKclient session times out (as jd identified) at time 08:40:44. After thishappens the client attempts to reestablish the session, however theserver has already expired the session (meaning that it has not heardfrom the client w/in the timeout period). I don't see any big time deltadiscontinuities in the logs (client or server) that would indicategc/swap activity, however I can't be 100% sure given that we only havesnippets of the logs to work with.

I see all kinds of network connectivity errors in the log during thisperiod, so I would assume that there was a network connectivity issuethat caused the client to lose the ability to communicate with theserver for some period of time.

After this the zk client (hbase) is able to establish a new ZK sessionsuccessfully, for example here they are reading a znode:


2010-05-27 08:40:47,129 DEBUG
org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Read ZNode /hbase/master
got 10.251.158.224:60000

Notice that this is a few (~3) seconds after the TIMEOUT occurred.

On the server log I see similar, a number of sessions expire (whichbolsters my guess that a network connectivity issue occurred during thistimeframe). Then at time 08:41:12 a number of sessions are established(ie clients reconnecting to the ZK service after timing out or such).Then the log snippet stops so I can't see what's next.

So from a ZK perspective, as far as I can tell, all looks fine. Therewas a blip that caused connectivity to fail during which the sessionstimedout/expired. Then subsequently the ZK clients established sessionswith the ZK server and were able to access ZK data.


Patrick


On 05/27/2010 11:28 AM, Lucas Nazário dos Santos wrote:

So I see 2 main issues:

  - Your master's zookeeper session timed out. Why? Hard to tell with
those logs since it happened before what you pasted. Very slow IO?


I'm not swapping and I doubt that zookeeper session timed out because of
slow IO since my applications aren't even close to stress the hardware. I've
already followed the instruction in
http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A9 to avoid this kind of
problem.

Swapping + GC?
  - The your region server seemed to have moved elsewhere, or something
weird like that. DNS blip? Can't tell from the logs.


Maybe a DNS blip. But how can I confirm it? Logs? I didn't move anything and
as soon as I restarted the cluster things got back on track.

The hbase master log just repeated the text bellow for the last 8 hours
before the crash. The zookeeper and region server logs are clear from
errors.

2010-05-27 08:28:15,118 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.rootScanner scanning meta region {server: 10.251.158.224:60020,
regionname: -ROOT-,,0, startKey:<>}
2010-05-27 08:28:15,125 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.rootScanner scan of 1 row(s) of meta region {server:
10.251.158.224:60020, regionname: -ROOT-,,0, startKey:<>} complete
2010-05-27 08:28:26,379 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.metaScanner scanning meta region {server: 10.251.158.224:60020,
regionname: .META.,,1, startKey:<>}
2010-05-27 08:28:26,787 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.metaScanner scan of 73 row(s) of meta region {server:
10.251.158.224:60020, regionname: .META.,,1, startKey:<>} complete
2010-05-27 08:28:26,788 INFO org.apache.hadoop.hbase.master.BaseScanner: All
1 .META. region(s) scanned
2010-05-27 08:28:32,603 INFO org.apache.hadoop.hbase.master.ServerManager: 1
region servers, 0 dead, average load 75.0
2010-05-27 08:29:15,123 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.rootScanner scanning meta region {server: 10.251.158.224:60020,
regionname: -ROOT-,,0, startKey:<>}
2010-05-27 08:29:15,138 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.rootScanner scan of 1 row(s) of meta region {server:
10.251.158.224:60020, regionname: -ROOT-,,0, startKey:<>} complete
2010-05-27 08:29:26,380 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.metaScanner scanning meta region {server: 10.251.158.224:60020,
regionname: .META.,,1, startKey:<>}
2010-05-27 08:29:26,738 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.metaScanner scan of 73 row(s) of meta region {server:
10.251.158.224:60020, regionname: .META.,,1, startKey:<>} complete
2010-05-27 08:29:26,738 INFO org.apache.hadoop.hbase.master.BaseScanner: All
1 .META. region(s) scanned
2010-05-27 08:29:32,605 INFO org.apache.hadoop.hbase.master.ServerManager: 1
region servers, 0 dead, average load 75.0





On Thu, May 27, 2010 at 12:49 PM, Jean-Daniel Cryans<[email protected]>wrote:

 From what I see, nothing happened to zookeeper.

What happened:

1) The master wasn't able to scan the -ROOT- region because the
connection was refused  (same with .META.)
2010-05-27 08:40:44,259 WARN org.apache.hadoop.hbase.master.BaseScanner:
Scan ROOT region
java.io.IOException: Call to /10.251.158.224:60020 failed on local
exception: java.io.IOException: Connection reset by peer

2) The master's session with zookeeper was timed out
2010-05-27 08:40:46,630 WARN org.apache.zookeeper.ClientCnxn: Exception
closing session 0x128c8b303040000 to sun.nio.ch.selectionkeyi...@744e022c
java.io.IOException: Session Expired
   at

org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)
   at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)
   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)

3) The master was kicked out of the cluster, tries to re-enter
2010-05-27 08:40:46,631 INFO org.apache.hadoop.hbase.master.HMaster: Master
lost its znode, trying to get a new one

4) The master was able to win the race the be the main master again
(easy, there's only 1 machine in your cluster)
org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Wrote master address
10.251.158.224:60000 to ZooKeeper

5) This master still isn't able to scan -ROOT-
2010-05-27 08:41:44,270 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.rootScanner scanning meta region {server:
10.251.158.224:60020,
regionname: -ROOT-,,0, startKey:<>}

So I see 2 main issues:

  - Your master's zookeeper session timed out. Why? Hard to tell with
those logs since it happened before what you pasted. Very slow IO?
Swapping + GC?
  - The your region server seemed to have moved elsewhere, or something
weird like that. DNS blip? Can't tell from the logs.

Shouldn't Zookeeper recovery nicely? How can I prevent such error from
happening in the future?


Nothing happened to zookeeper. And since you have only 1 machine, even
if the ZK process did die for some reason, how could it even recover?
Reliability with ZK is 3 machines and more, nothing can be guaranteed
with only 1 machine.

Now on how to prevent, we need to understand the root cause of the 2
issues I listed.

Also, not sure if you saw that, but the first minute in your log is
out of order. Very apparent with the first two lines.

J-D

Re: Zookeeper apparently going down

Reply via email to