Re: Killing a zookeeper server

2010-01-25 Thread Jean-Daniel Cryans
Everything is here http://people.apache.org/~jdcryans/zk_election_bug.tar.gz The server we are trying to start is sv4borg222 (myid is 2) and we started it around 10:03:21 Thx! J-D On Mon, Jan 25, 2010 at 10:49 AM, Patrick Hunt ph...@apache.org wrote: 1) Capture the logs from all 5 servers 2)

Re: Killing a zookeeper server

2010-01-25 Thread Patrick Hunt
According to the log for 222 it can't open a connection to the election port (3888) for any of the other servers. This seems very unusual. Can you verify that ther's connectivity on that port btw 222 and all the other servers? Also, can you re-run the netstat with -a option? We can see the

Re: Killing a zookeeper server

2010-01-25 Thread Jean-Daniel Cryans
According to the log for 222 it can't open a connection to the election port (3888) for any of the other servers. This seems very unusual. Can you verify that ther's connectivity on that port btw 222 and all the other servers? jdcry...@sv4borg222:~$ telnet sv4borg224 3888 Trying

Re: Killing a zookeeper server

2010-01-25 Thread Patrick Hunt
JD, there's something _very_ unusual in your setup. Are you running official released ZooKeeper code or something else? Either there is a misconfiguration on the other servers (the configs for the other servers is exactly the same as 222 right?), or perhaps some patches to ZK codebase that

Re: Killing a zookeeper server

2010-01-14 Thread Patrick Hunt
: Killing a zookeeper server 12 was just to keep uniformity on our servers. Our clients are connecting from the same 12 servers. Easily modifiable and perhaps we should look into changing that. The logs just seem to indicate that the servers that claim to have no server running are continually attempting

Re: Killing a zookeeper server

2010-01-14 Thread Patrick Hunt
-Original Message- From: Nick Bailey nicholas.bai...@rackspace.com Sent: Tuesday, January 12, 2010 6:03pm To: zookeeper-user@hadoop.apache.org Subject: Re: Killing a zookeeper server 12 was just to keep uniformity on our servers. Our clients are connecting from the same 12 servers. Easily

Re: Killing a zookeeper server

2010-01-13 Thread Flavio Junqueira
Hi Nick, Your assessment sounds correct, the issue seems to be caused by the bug described in ZOOKEEPER-427. Can't you upgrade to a newer release? Killing the leader should do it, but the bug will still be there, so I recommend upgrading. Thanks, -Flavio On Jan 12, 2010, at 10:52 PM, Nick

Re: Killing a zookeeper server

2010-01-13 Thread Nick Bailey
@hadoop.apache.org, nicholas.bai...@rackspace.com Subject: Re: Killing a zookeeper server 12 servers? That's alot, if you dont' mind my asking why so many? Typically we recommend 5 - that way you can have one down for maintenance and still have a failure that doesn't bring down the cluster. The electing

Re: Killing a zookeeper server

2010-01-13 Thread Adam Rosien
Subject: Re: Killing a zookeeper server 12 was just to keep uniformity on our servers. Our clients are connecting from the same 12 servers.  Easily modifiable and perhaps we should look into changing that. The logs just seem to indicate that the servers that claim to have no server running

Re: Killing a zookeeper server

2010-01-13 Thread Mahadev Konar
amount of data really and network latency appears fine. Thanks for the help, Nick -Original Message- From: Nick Bailey nicholas.bai...@rackspace.com Sent: Tuesday, January 12, 2010 6:03pm To: zookeeper-user@hadoop.apache.org Subject: Re: Killing a zookeeper server 12 was just

Re: Killing a zookeeper server

2010-01-13 Thread Adam Rosien
nicholas.bai...@rackspace.com Sent: Tuesday, January 12, 2010 6:03pm To: zookeeper-user@hadoop.apache.org Subject: Re: Killing a zookeeper server 12 was just to keep uniformity on our servers. Our clients are connecting from the same 12 servers.  Easily modifiable and perhaps we should look

Re: Killing a zookeeper server

2010-01-13 Thread Nick Bailey
a large amount of data really and network latency appears fine. Thanks for the help, Nick -Original Message- From: Nick Bailey nicholas.bai...@rackspace.com Sent: Tuesday, January 12, 2010 6:03pm To: zookeeper-user@hadoop.apache.org Subject: Re: Killing a zookeeper server 12 was just

Killing a zookeeper server

2010-01-12 Thread Nick Bailey
We are running zookeeper 3.1.0 Recently we noticed the cpu usage on our machines becoming increasingly high and we believe the cause is https://issues.apache.org/jira/browse/ZOOKEEPER-427 However our solution when we noticed the problem was to kill the zookeeper process and restart it.

Re: Killing a zookeeper server

2010-01-12 Thread Patrick Hunt
12 servers? That's alot, if you dont' mind my asking why so many? Typically we recommend 5 - that way you can have one down for maintenance and still have a failure that doesn't bring down the cluster. The electing a leader is probably the restarted machine attempting to re-join the ensemble

Re: Killing a zookeeper server

2010-01-12 Thread Adam Rosien
I have a related question: what's the behavior of a cluster of 3 when one is down? I've tried it and a leader is elected, but are there any other caveats for this situation? .. Adam On Tue, Jan 12, 2010 at 2:40 PM, Patrick Hunt ph...@apache.org wrote: 12 servers? That's alot, if you dont' mind

Re: Killing a zookeeper server

2010-01-12 Thread Nick Bailey
@hadoop.apache.org Subject: Re: Killing a zookeeper server 12 was just to keep uniformity on our servers. Our clients are connecting from the same 12 servers. Easily modifiable and perhaps we should look into changing that. The logs just seem to indicate that the servers that claim to have no server

Re: Killing a zookeeper server

2010-01-12 Thread Adam Rosien
Doh - that makes total sense. For whatever reason I thought with 2 servers you couldn't get a majority :P On Tue, Jan 12, 2010 at 3:17 PM, Henry Robinson he...@cloudera.com wrote: Hi Adam - As long as a quorum of servers is running, ZK will be live. With majority quorums, 2/3 is enough to

Re: Killing a zookeeper server

2010-01-12 Thread Patrick Hunt
-Original Message- From: Nick Bailey nicholas.bai...@rackspace.com Sent: Tuesday, January 12, 2010 6:03pm To: zookeeper-user@hadoop.apache.org Subject: Re: Killing a zookeeper server 12 was just to keep uniformity on our servers. Our clients are connecting from the same 12 servers