I also see a number of these warnings in the zookeeper logs, which look quite 
telling. Zookeeper is running on the slaves in question, and port 3888 is 
unblocked in the firewall.

2012-12-29 07:23:42,492 WARN org.apache.zookeeper.server.quorum.QuorumCnxManager
: Cannot open channel to 1 at election address slave1.analytics-internal.lokistu
dios.com/10.171.98.247:3888
java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.ja
va:327)
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocket
Impl.java:193)
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java
:180)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384)
        at java.net.Socket.connect(Socket.java:546)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(Quorum
CnxManager.java:354)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxM
anager.java:327)
        at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$Worke
rSender.process(FastLeaderElection.java:393)
        at 
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:365)
        at java.lang.Thread.run(Thread.java:679)


-- 
Marco Gallotta | Mountain View, California
Software Engineer, Infrastructure | Loki Studios
fb.me/marco.gallotta | twitter.com/marcog
ma...@gallotta.co.za | +1 (650) 417-3313

Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Saturday 29 December 2012 at 1:44 PM, Marco Gallotta wrote:

> Hi there 
> 
> I've been running an hbase cluster for several months, and it recently 
> experienced problems as the nodes reached 95% disk capacity. I added an extra 
> node, and now the master keeps crashing with the errors below. I also 
> increased the disk capacity on each individual node after this, and the 
> errors are the same. I tried removing the new node, and that doesn't help.
> 
> There are similar errors in the regionserver and zookeeper logs, but the all 
> seem to echo from the master logs.
> 
> Anything I can look at to help diagnose what the problem here is?
> 
> hbase-root-master-analytics.log:
> Sat Dec 29 03:14:22 PST 2012 Starting master on analytics
> core file size          (blocks, -c) 0
> data seg size           (kbytes, -d) unlimited
> scheduling priority             (-e) 0
> file size               (blocks, -f) unlimited
> pending signals                 (-i) 59480
> max locked memory       (kbytes, -l) 64
> max memory size         (kbytes, -m) unlimited
> open files                      (-n) 1024
> pipe size            (512 bytes, -p) 8
> POSIX message queues     (bytes, -q) 819200
> real-time priority              (-r) 0
> stack size              (kbytes, -s) 8192
> cpu time               (seconds, -t) unlimited
> max user processes              (-u) 59480
> virtual memory          (kbytes, -v) unlimited
> file locks                      (-x) unlimited
> 2012-12-29 03:14:24,601 INFO org.apache.hadoop.ipc.HBaseServer: Starting 
> Thread-2
> 2012-12-29 03:14:24,614 INFO org.apache.hadoop.ipc.HBaseServer: Starting 
> Thread-2
> 2012-12-29 03:14:24,622 INFO org.apache.hadoop.ipc.HBaseServer: Starting 
> Thread-2
> 2012-12-29 03:14:24,631 INFO org.apache.hadoop.ipc.HBaseServer: Starting 
> Thread-2
> 2012-12-29 03:14:24,636 INFO org.apache.hadoop.ipc.HBaseServer: Starting 
> Thread-2
> 2012-12-29 03:14:24,643 INFO org.apache.hadoop.ipc.HBaseServer: Starting 
> Thread-2
> 2012-12-29 03:14:24,651 INFO org.apache.hadoop.ipc.HBaseServer: Starting 
> Thread-2
> 2012-12-29 03:14:24,665 INFO org.apache.hadoop.ipc.HBaseServer: Starting 
> Thread-2
> 2012-12-29 03:14:24,675 INFO org.apache.hadoop.ipc.HBaseServer: Starting 
> Thread-2
> 2012-12-29 03:14:24,698 INFO org.apache.hadoop.ipc.HBaseServer: Starting IPC 
> Server listener on 60000
> 2012-12-29 03:14:25,322 WARN 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient 
> ZooKeeper exception: 
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for /hbase
> 2012-12-29 03:14:28,735 WARN 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient 
> ZooKeeper exception: 
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for /hbase
> 
> 2012-12-29 03:14:32,797 WARN 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient 
> ZooKeeper exception: 
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for /hbase
> 2012-12-29 03:14:41,427 WARN 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient 
> ZooKeeper exception: 
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for /hbase
> 2012-12-29 03:14:41,427 ERROR 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper exists 
> failed after 3 retries
> 2012-12-29 03:14:41,428 ERROR 
> org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start master
> java.lang.RuntimeException: Failed construction of Master: class 
> org.apache.hadoop.hbase.master.HMaster
> 
>         at 
> org.apache.hadoop.hbase.master.HMaster.constructMaster(HMaster.java:1740)
>         at 
> org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:146)
>         at 
> org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:103)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at 
> org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:76)
>         at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:1754)
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: 
> KeeperErrorCode = ConnectionLoss for /hbase
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1021)
>         at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1049)
>         at 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:176)
>         at 
> org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndFailSilent(ZKUtil.java:896)
>         at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.createBaseZNodes(ZooKeeperWatcher.java:161)
>         at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:154)
>         at org.apache.hadoop.hbase.master.HMaster.<init>(HMaster.java:281)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
>         at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>         at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:532)
>         at 
> org.apache.hadoop.hbase.master.HMaster.constructMaster(HMaster.java:1735)
>         ... 5 more
> 
> 
> -- 
> Marco Gallotta | Mountain View, California
> Software Engineer, Infrastructure | Loki Studios
> fb.me/marco.gallotta (http://fb.me/marco.gallotta) | twitter.com/marcog 
> (http://twitter.com/marcog)
> ma...@gallotta.co.za (mailto:ma...@gallotta.co.za) | +1 (650) 417-3313
> 
> Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
> 

Reply via email to