I also see a number of these warnings in the zookeeper logs, which look quite telling. Zookeeper is running on the slaves in question, and port 3888 is unblocked in the firewall.
2012-12-29 07:23:42,492 WARN org.apache.zookeeper.server.quorum.QuorumCnxManager : Cannot open channel to 1 at election address slave1.analytics-internal.lokistu dios.com/10.171.98.247:3888 java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.ja va:327) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocket Impl.java:193) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java :180) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384) at java.net.Socket.connect(Socket.java:546) at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(Quorum CnxManager.java:354) at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxM anager.java:327) at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$Worke rSender.process(FastLeaderElection.java:393) at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:365) at java.lang.Thread.run(Thread.java:679) -- Marco Gallotta | Mountain View, California Software Engineer, Infrastructure | Loki Studios fb.me/marco.gallotta | twitter.com/marcog ma...@gallotta.co.za | +1 (650) 417-3313 Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Saturday 29 December 2012 at 1:44 PM, Marco Gallotta wrote: > Hi there > > I've been running an hbase cluster for several months, and it recently > experienced problems as the nodes reached 95% disk capacity. I added an extra > node, and now the master keeps crashing with the errors below. I also > increased the disk capacity on each individual node after this, and the > errors are the same. I tried removing the new node, and that doesn't help. > > There are similar errors in the regionserver and zookeeper logs, but the all > seem to echo from the master logs. > > Anything I can look at to help diagnose what the problem here is? > > hbase-root-master-analytics.log: > Sat Dec 29 03:14:22 PST 2012 Starting master on analytics > core file size (blocks, -c) 0 > data seg size (kbytes, -d) unlimited > scheduling priority (-e) 0 > file size (blocks, -f) unlimited > pending signals (-i) 59480 > max locked memory (kbytes, -l) 64 > max memory size (kbytes, -m) unlimited > open files (-n) 1024 > pipe size (512 bytes, -p) 8 > POSIX message queues (bytes, -q) 819200 > real-time priority (-r) 0 > stack size (kbytes, -s) 8192 > cpu time (seconds, -t) unlimited > max user processes (-u) 59480 > virtual memory (kbytes, -v) unlimited > file locks (-x) unlimited > 2012-12-29 03:14:24,601 INFO org.apache.hadoop.ipc.HBaseServer: Starting > Thread-2 > 2012-12-29 03:14:24,614 INFO org.apache.hadoop.ipc.HBaseServer: Starting > Thread-2 > 2012-12-29 03:14:24,622 INFO org.apache.hadoop.ipc.HBaseServer: Starting > Thread-2 > 2012-12-29 03:14:24,631 INFO org.apache.hadoop.ipc.HBaseServer: Starting > Thread-2 > 2012-12-29 03:14:24,636 INFO org.apache.hadoop.ipc.HBaseServer: Starting > Thread-2 > 2012-12-29 03:14:24,643 INFO org.apache.hadoop.ipc.HBaseServer: Starting > Thread-2 > 2012-12-29 03:14:24,651 INFO org.apache.hadoop.ipc.HBaseServer: Starting > Thread-2 > 2012-12-29 03:14:24,665 INFO org.apache.hadoop.ipc.HBaseServer: Starting > Thread-2 > 2012-12-29 03:14:24,675 INFO org.apache.hadoop.ipc.HBaseServer: Starting > Thread-2 > 2012-12-29 03:14:24,698 INFO org.apache.hadoop.ipc.HBaseServer: Starting IPC > Server listener on 60000 > 2012-12-29 03:14:25,322 WARN > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient > ZooKeeper exception: > org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode > = ConnectionLoss for /hbase > 2012-12-29 03:14:28,735 WARN > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient > ZooKeeper exception: > org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode > = ConnectionLoss for /hbase > > 2012-12-29 03:14:32,797 WARN > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient > ZooKeeper exception: > org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode > = ConnectionLoss for /hbase > 2012-12-29 03:14:41,427 WARN > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient > ZooKeeper exception: > org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode > = ConnectionLoss for /hbase > 2012-12-29 03:14:41,427 ERROR > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper exists > failed after 3 retries > 2012-12-29 03:14:41,428 ERROR > org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start master > java.lang.RuntimeException: Failed construction of Master: class > org.apache.hadoop.hbase.master.HMaster > > at > org.apache.hadoop.hbase.master.HMaster.constructMaster(HMaster.java:1740) > at > org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:146) > at > org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:103) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at > org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:76) > at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:1754) > Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: > KeeperErrorCode = ConnectionLoss for /hbase > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1021) > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1049) > at > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:176) > at > org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndFailSilent(ZKUtil.java:896) > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.createBaseZNodes(ZooKeeperWatcher.java:161) > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:154) > at org.apache.hadoop.hbase.master.HMaster.<init>(HMaster.java:281) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:532) > at > org.apache.hadoop.hbase.master.HMaster.constructMaster(HMaster.java:1735) > ... 5 more > > > -- > Marco Gallotta | Mountain View, California > Software Engineer, Infrastructure | Loki Studios > fb.me/marco.gallotta (http://fb.me/marco.gallotta) | twitter.com/marcog > (http://twitter.com/marcog) > ma...@gallotta.co.za (mailto:ma...@gallotta.co.za) | +1 (650) 417-3313 > > Sent with Sparrow (http://www.sparrowmailapp.com/?sig) >