So I see 2 main issues:
- Your master's zookeeper session timed out. Why? Hard to tell with
those logs since it happened before what you pasted. Very slow IO?
I'm not swapping and I doubt that zookeeper session timed out because of
slow IO since my applications aren't even close to stress the hardware. I've
already followed the instruction in
http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A9 to avoid this kind of
problem.
Swapping + GC?
- The your region server seemed to have moved elsewhere, or something
weird like that. DNS blip? Can't tell from the logs.
Maybe a DNS blip. But how can I confirm it? Logs? I didn't move anything and
as soon as I restarted the cluster things got back on track.
The hbase master log just repeated the text bellow for the last 8 hours
before the crash. The zookeeper and region server logs are clear from
errors.
2010-05-27 08:28:15,118 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.rootScanner scanning meta region {server: 10.251.158.224:60020,
regionname: -ROOT-,,0, startKey:<>}
2010-05-27 08:28:15,125 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.rootScanner scan of 1 row(s) of meta region {server:
10.251.158.224:60020, regionname: -ROOT-,,0, startKey:<>} complete
2010-05-27 08:28:26,379 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.metaScanner scanning meta region {server: 10.251.158.224:60020,
regionname: .META.,,1, startKey:<>}
2010-05-27 08:28:26,787 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.metaScanner scan of 73 row(s) of meta region {server:
10.251.158.224:60020, regionname: .META.,,1, startKey:<>} complete
2010-05-27 08:28:26,788 INFO org.apache.hadoop.hbase.master.BaseScanner: All
1 .META. region(s) scanned
2010-05-27 08:28:32,603 INFO org.apache.hadoop.hbase.master.ServerManager: 1
region servers, 0 dead, average load 75.0
2010-05-27 08:29:15,123 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.rootScanner scanning meta region {server: 10.251.158.224:60020,
regionname: -ROOT-,,0, startKey:<>}
2010-05-27 08:29:15,138 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.rootScanner scan of 1 row(s) of meta region {server:
10.251.158.224:60020, regionname: -ROOT-,,0, startKey:<>} complete
2010-05-27 08:29:26,380 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.metaScanner scanning meta region {server: 10.251.158.224:60020,
regionname: .META.,,1, startKey:<>}
2010-05-27 08:29:26,738 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.metaScanner scan of 73 row(s) of meta region {server:
10.251.158.224:60020, regionname: .META.,,1, startKey:<>} complete
2010-05-27 08:29:26,738 INFO org.apache.hadoop.hbase.master.BaseScanner: All
1 .META. region(s) scanned
2010-05-27 08:29:32,605 INFO org.apache.hadoop.hbase.master.ServerManager: 1
region servers, 0 dead, average load 75.0
On Thu, May 27, 2010 at 12:49 PM, Jean-Daniel Cryans<[email protected]>wrote:
From what I see, nothing happened to zookeeper.
What happened:
1) The master wasn't able to scan the -ROOT- region because the
connection was refused (same with .META.)
2010-05-27 08:40:44,259 WARN org.apache.hadoop.hbase.master.BaseScanner:
Scan ROOT region
java.io.IOException: Call to /10.251.158.224:60020 failed on local
exception: java.io.IOException: Connection reset by peer
2) The master's session with zookeeper was timed out
2010-05-27 08:40:46,630 WARN org.apache.zookeeper.ClientCnxn: Exception
closing session 0x128c8b303040000 to sun.nio.ch.selectionkeyi...@744e022c
java.io.IOException: Session Expired
at
org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)
at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
3) The master was kicked out of the cluster, tries to re-enter
2010-05-27 08:40:46,631 INFO org.apache.hadoop.hbase.master.HMaster: Master
lost its znode, trying to get a new one
4) The master was able to win the race the be the main master again
(easy, there's only 1 machine in your cluster)
org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Wrote master address
10.251.158.224:60000 to ZooKeeper
5) This master still isn't able to scan -ROOT-
2010-05-27 08:41:44,270 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.rootScanner scanning meta region {server:
10.251.158.224:60020,
regionname: -ROOT-,,0, startKey:<>}
So I see 2 main issues:
- Your master's zookeeper session timed out. Why? Hard to tell with
those logs since it happened before what you pasted. Very slow IO?
Swapping + GC?
- The your region server seemed to have moved elsewhere, or something
weird like that. DNS blip? Can't tell from the logs.
Shouldn't Zookeeper recovery nicely? How can I prevent such error from
happening in the future?
Nothing happened to zookeeper. And since you have only 1 machine, even
if the ZK process did die for some reason, how could it even recover?
Reliability with ZK is 3 machines and more, nothing can be guaranteed
with only 1 machine.
Now on how to prevent, we need to understand the root cause of the 2
issues I listed.
Also, not sure if you saw that, but the first minute in your log is
out of order. Very apparent with the first two lines.
J-D