We see this line occur frequently and the timeout will graduatlly increase to
60000. It appears that all of our servers that seem to be acting normally are
experiencing the cpu issue I mentioned earlier
'https://issues.apache.org/jira/browse/ZOOKEEPER-427'. Perhaps that is causing
the timeout in responding?
Also to answer your other questions Patrick, we aren't storing a large amount
of data really and network latency appears fine.
Thanks for the help,
Nick
-----Original Message-----
From: "Nick Bailey" <nicholas.bai...@rackspace.com>
Sent: Tuesday, January 12, 2010 6:03pm
To: zookeeper-user@hadoop.apache.org
Subject: Re: Killing a zookeeper server
12 was just to keep uniformity on our servers. Our clients are connecting from
the same 12 servers. Easily modifiable and perhaps we should look into
changing that.
The logs just seem to indicate that the servers that claim to have no server
running are continually attempting to elect a leader. A sample is provided
below. The initial exception is something we see regularly in our logs and the
debug and info lines following are simply repeating throughout the log.
2010-01-12 17:55:02,269 [NIOServerCxn.Factory:2181] (NIOServerCnxn) WARN -
Exception causing close of session 0x0 due to java.io.IOException: Read error
2010-01-12 17:55:02,269 [NIOServerCxn.Factory:2181] (NIOServerCnxn) DEBUG -
IOException stack trace
java.io.IOException: Read error
at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:295)
at
org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:162)
2010-01-12 17:55:02,269 [NIOServerCxn.Factory:2181] (NIOServerCnxn) INFO -
closing session:0x0 NIOServerCnxn: java.nio.channels.SocketChannel[connected
local=/172.20.36.9:2181 remote=/172.20.36.9:50367]
2010-01-12 17:55:02,270 [NIOServerCxn.Factory:2181] (NIOServerCnxn) DEBUG -
ignoring exception during input shutdown
java.net.SocketException: Transport endpoint is not connected
at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
at
sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:640)
at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
at
org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:767)
at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:421)
at
org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:162)
2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO -
Adding vote
2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO -
Notification: 3, 30064826826, 1, 9, LOOKING, LOOKING, 9
2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) DEBUG -
id: 3, proposed id: 3, zxid: 30064826826, proposed zxid: 30064826826
2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO -
Adding vote
2010-01-12 17:55:52,182 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO -
Notification: 3, 30064826826, 1, 9, LOOKING, LOOKING, 11
2010-01-12 17:55:52,182 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) DEBUG -
id: 3, proposed id: 3, zxid: 30064826826, proposed zxid: 30064826826
2010-01-12 17:55:52,182 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO -
Adding vote
2010-01-12 17:55:52,183 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO -
Notification: 3, 30064826826, 1, 9, LOOKING, LOOKING, 12
2010-01-12 17:55:52,183 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) DEBUG -
id: 3, proposed id: 3, zxid: 30064826826, proposed zxid: 30064826826
2010-01-12 17:55:52,183 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO -
Adding vote
2010-01-12 17:56:52,190 [WorkerReceiver Thread]
(FastLeaderElection$Messenger$WorkerReceiver) DEBUG - Receive new message.
2010-01-12 17:56:52,190 [WorkerReceiver Thread]
(FastLeaderElection$Messenger$WorkerReceiver) DEBUG - Receive new message.
2010-01-12 17:56:52,191 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO -
Notification: 3, 30064826826, 1, 9, LOOKING, LOOKING, 8
2010-01-12 17:56:52,191 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) DEBUG -
id: 3, proposed id: 3, zxid: 30064826826, proposed zxid: 30064826826
2010-01-12 17:56:52,191 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO -
Adding vote
2010-01-12 17:56:52,192 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO -
Notification: 3, 30064826826, 1, 9, LOOKING, LOOKING, 12
2010-01-12 17:56:52,192 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) DEBUG -
id: 3, proposed id: 3, zxid: 30064826826, proposed zxid: 30064826826
2010-01-12 17:56:52,192 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO -
Adding vote
2010-01-12 17:57:52,200 [WorkerReceiver Thread]
(FastLeaderElection$Messenger$WorkerReceiver) DEBUG - Receive new message.
2010-01-12 17:57:52,201 [WorkerReceiver Thread]
(FastLeaderElection$Messenger$WorkerReceiver) DEBUG - Receive new message.
-----Original Message-----
From: "Patrick Hunt" <ph...@apache.org>
Sent: Tuesday, January 12, 2010 5:40pm
To: zookeeper-user@hadoop.apache.org, nicholas.bai...@rackspace.com
Subject: Re: Killing a zookeeper server
12 servers? That's alot, if you dont' mind my asking why so many?
Typically we recommend 5 - that way you can have one down for
maintenance and still have a failure that doesn't bring down the cluster.
The "electing a leader" is probably the restarted machine attempting to
re-join the ensemble (it should join as a follower if you have a leader
already elected, given that it's xid is behind the existing leader.)
Hard to tell though without the logs.
You might also be seeing the initLimit exceeded, is the data you are
storing in ZK large? Or perhaps network connectivity is slow?
http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_clusterOptions
again the logs would give some insight on this.
Patrick
Nick Bailey wrote:
We are running zookeeper 3.1.0
Recently we noticed the cpu usage on our machines becoming
increasingly high and we believe the cause is
https://issues.apache.org/jira/browse/ZOOKEEPER-427
However our solution when we noticed the problem was to kill the
zookeeper process and restart it.
After doing that though it looks like the newly restarted zookeeper
server is continually attempting to elect a leader even though one
already exists.
The process responses with 'imok' when asked, but the stat command
returns 'ZooKeeperServer not running'.
I belive that killing the current leader should trigger all servers
to do an election and solve the problem, but I'm not sure. Should
that be the course of action in this situation?
Also we have 12 servers, but 5 are currently not running according to
stat. So I guess this isn't a problem unless we lose another one.
We have plans to upgrade zookeeper to solve the cpu issue but haven't
been able to do that yet.
Any help appreciated, Nick Bailey