Build failed in Hudson: ZooKeeper-trunk #915

2010-08-26 Thread Apache Hudson Server
See https://hudson.apache.org/hudson/job/ZooKeeper-trunk/915/

--
[...truncated 510 lines...]
  [javadoc] Javadoc execution
  [javadoc] Loading source file 
https://hudson.apache.org/hudson/job/ZooKeeper-trunk/ws/trunk/src/java/main/org/apache/zookeeper/AsyncCallback.java...
  [javadoc] Loading source file 
https://hudson.apache.org/hudson/job/ZooKeeper-trunk/ws/trunk/src/java/main/org/apache/zookeeper/CreateMode.java...
  [javadoc] Loading source file 
https://hudson.apache.org/hudson/job/ZooKeeper-trunk/ws/trunk/src/java/main/org/apache/zookeeper/KeeperException.java...
  [javadoc] Loading source file 
https://hudson.apache.org/hudson/job/ZooKeeper-trunk/ws/trunk/src/java/main/org/apache/zookeeper/ServerAdminClient.java...
  [javadoc] Loading source file 
https://hudson.apache.org/hudson/job/ZooKeeper-trunk/ws/trunk/src/java/main/org/apache/zookeeper/Watcher.java...
  [javadoc] Loading source file 
https://hudson.apache.org/hudson/job/ZooKeeper-trunk/ws/trunk/src/java/main/org/apache/zookeeper/ZooDefs.java...
  [javadoc] Loading source file 
https://hudson.apache.org/hudson/job/ZooKeeper-trunk/ws/trunk/src/java/main/org/apache/zookeeper/ZooKeeper.java...
  [javadoc] Loading source file 
https://hudson.apache.org/hudson/job/ZooKeeper-trunk/ws/trunk/src/java/main/org/apache/zookeeper/ZooKeeperMain.java...
  [javadoc] Loading source file 
https://hudson.apache.org/hudson/job/ZooKeeper-trunk/ws/trunk/src/java/main/org/apache/zookeeper/server/LogFormatter.java...
  [javadoc] Loading source file 
https://hudson.apache.org/hudson/job/ZooKeeper-trunk/ws/trunk/src/java/main/org/apache/zookeeper/server/PurgeTxnLog.java...
  [javadoc] Loading source file 
https://hudson.apache.org/hudson/job/ZooKeeper-trunk/ws/trunk/src/java/main/org/apache/zookeeper/server/ZooKeeperServerMain.java...
  [javadoc] Loading source file 
https://hudson.apache.org/hudson/job/ZooKeeper-trunk/ws/trunk/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeerMain.java...
  [javadoc] Loading source file 
https://hudson.apache.org/hudson/job/ZooKeeper-trunk/ws/trunk/src/java/main/org/apache/zookeeper/server/upgrade/UpgradeMain.java...
  [javadoc] Loading source files for package org.apache.zookeeper.data...
  [javadoc] Constructing Javadoc information...
  [javadoc] Standard Doclet version 1.6.0_11
  [javadoc] Building tree for all the packages and classes...
  [javadoc] Building index for all the packages and classes...
  [javadoc] Building index for all classes...

javadoc-jar:
  [jar] Building jar: 
https://hudson.apache.org/hudson/job/ZooKeeper-trunk/ws/trunk/build/zookeeper-3.4.0-javadoc.jar

ivy-retrieve-jdiff:
[mkdir] Created dir: 
https://hudson.apache.org/hudson/job/ZooKeeper-trunk/ws/trunk/build/jdiff/lib
[ivy:retrieve] :: resolving dependencies :: org.apache.zookeeper#zookeeper;3.4.0
[ivy:retrieve]  confs: [jdiff]
[ivy:retrieve]  found jdiff#jdiff;1.0.9 in default
[ivy:retrieve]  found xerces#xerces;1.4.4 in default
[ivy:retrieve] :: resolution report :: resolve 122ms :: artifacts dl 4ms
-
|  |modules||   artifacts   |
|   conf   | number| search|dwnlded|evicted|| number|dwnlded|
-
|   jdiff  |   2   |   0   |   0   |   0   ||   2   |   0   |
-
[ivy:retrieve] :: retrieving :: org.apache.zookeeper#zookeeper
[ivy:retrieve]  confs: [jdiff]
[ivy:retrieve]  2 artifacts copied, 0 already retrieved (1896kB/11ms)

write-null:

api-xml:
  [javadoc] Generating Javadoc
  [javadoc] Javadoc execution
  [javadoc] Loading source files for package org.apache.zookeeper...
  [javadoc] Constructing Javadoc information...
  [javadoc] JDiff: doclet started ...
  [javadoc] JDiff: writing the API to file 
'https://hudson.apache.org/hudson/job/ZooKeeper-trunk/ws/trunk/src/java/lib/jdiff/zookeeper_3.4.0.xml'...
  [javadoc] 
https://hudson.apache.org/hudson/job/ZooKeeper-trunk/ws/trunk/src/java/main/org/apache/zookeeper/ClientWatchManager.java:38:
 warning - @return tag has no arguments.
  [javadoc] JDiff: finished (took 0s, not including scanning the source files).
  [javadoc] 1 warning

api-report:
[mkdir] Created dir: 
https://hudson.apache.org/hudson/job/ZooKeeper-trunk/ws/trunk/build/docs/jdiff
  [javadoc] Generating Javadoc
  [javadoc] Javadoc execution
  [javadoc] Loading source file 
https://hudson.apache.org/hudson/job/ZooKeeper-trunk/ws/trunk/build/jdiff/lib/Null.java...
  [javadoc] Loading source files for package org.apache.jute.compiler...
  [javadoc] Loading source files for package 
org.apache.jute.compiler.generated...
  [javadoc] Loading source files for package org.apache.zookeeper...
  [javadoc] Loading source files for package org.apache.zookeeper.common...
  [javadoc] Loading 

[jira] Updated: (ZOOKEEPER-855) clientPortBindAddress should be clientPortAddress

2010-08-26 Thread Jared Cantwell (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jared Cantwell updated ZOOKEEPER-855:
-

Priority: Trivial  (was: Major)

 clientPortBindAddress should be clientPortAddress
 -

 Key: ZOOKEEPER-855
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-855
 Project: Zookeeper
  Issue Type: Bug
  Components: documentation
Affects Versions: 3.3.0, 3.3.1
Reporter: Jared Cantwell
Priority: Trivial

 The server documentation states that the configuration parameter for binding 
 to a specific ip address is clientPortBindAddress.  The code believes the 
 parameter is clientPortAddress.  The documentation for 3.3.X versions needs 
 changed to reflect the correct parameter .  This parameter was added in 
 ZOOKEEPER-635.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (ZOOKEEPER-855) clientPortBindAddress should be clientPortAddress

2010-08-26 Thread Jared Cantwell (JIRA)
clientPortBindAddress should be clientPortAddress
-

 Key: ZOOKEEPER-855
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-855
 Project: Zookeeper
  Issue Type: Bug
  Components: documentation
Affects Versions: 3.3.1, 3.3.0
Reporter: Jared Cantwell


The server documentation states that the configuration parameter for binding to 
a specific ip address is clientPortBindAddress.  The code believes the 
parameter is clientPortAddress.  The documentation for 3.3.X versions needs 
changed to reflect the correct parameter .  This parameter was added in 
ZOOKEEPER-635.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-366) Session timeout detection can go wrong if the leader system time changes

2010-08-26 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902937#action_12902937
 ] 

Patrick Hunt commented on ZOOKEEPER-366:


One thing we should do - add sufficient logging (warn level or higher I would 
say) to ensure if this does happen in production we have a record of it in the 
log.

 Session timeout detection can go wrong if the leader system time changes
 

 Key: ZOOKEEPER-366
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-366
 Project: Zookeeper
  Issue Type: Bug
Reporter: Benjamin Reed
Assignee: Benjamin Reed
 Attachments: ZOOKEEPER-366.patch


 the leader tracks session expirations by calculating when a session will 
 timeout and then periodically checking to see what needs to be timed out 
 based on the current time. this works great as long as the leaders clock 
 progresses at a steady pace. the problem comes when there are big (session 
 size) changes in clock, by ntp for example. if time gets adjusted forward, 
 all the sessions could timeout immediately. if time goes backward sessions 
 that should timeout may take a lot longer to actually expire.
 this is really just a leader issue. the easiest way to deal with this is to 
 have the leader relinquish leadership if it detects a big jump forward in 
 time. when a new leader gets elected, it will recalculate timeouts of active 
 sessions.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Proposed: Leader communication should listen on specified IP, not wildcard address

2010-08-26 Thread Jared Cantwell
Hello,

My project currently has the need to specify the local address that is used
for leader communication (and not use the default of listening on all
interfaces).  This is similar to the clientPortAddress parameter that was
recently added.  After reviewing the code, we can't think of a reason why
only the port would be used with the wildcard interface, when servers are
already connecting specifically to that interface anyway.  Is binding to the
wildcard interface for leader communication intentional?

I believe the change would be straightforward-- one change for each leader
port used.  Note: this doesn't account for all leader election algorithms,
only the default.

Index:
src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java
===
---
src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java
(revision 989805)
+++
src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java
(working copy)
@@ -434,7 +434,7 @@
 ss = ServerSocketChannel.open();
 int port =
self.quorumPeers.get(self.getId()).electionAddr.getPort();
 ss.socket().setReuseAddress(true);
-InetSocketAddress addr = new InetSocketAddress(port);
+InetSocketAddress addr =
self.quorumPeers.get(self.getId()).electionAddr;
 LOG.info(My election bind port:  + addr.toString());
 setName(addr.toString());
 ss.socket().bind(addr);
Index: src/java/main/org/apache/zookeeper/server/quorum/Leader.java
===
--- src/java/main/org/apache/zookeeper/server/quorum/Leader.java
(revision 989805)
+++ src/java/main/org/apache/zookeeper/server/quorum/Leader.java(working
copy)
@@ -128,10 +128,11 @@
 Leader(QuorumPeer self,LeaderZooKeeperServer zk) throws IOException {
 this.self = self;
 try {
-ss = new ServerSocket(self.getQuorumAddress().getPort());
+ss = new ServerSocket();
+ss.bind(self.getQuorumAddress());
 } catch (BindException e) {
-LOG.error(Couldn't bind to port 
-+ self.getQuorumAddress().getPort(), e);
+LOG.error(Couldn't bind to address 
++ self.getQuorumAddress().getAddress() + : +
self.getQuorumAddress().getPort(), e);
 throw e;
 }
 this.zk=zk;


Does this seem like a reasonable change? Thoughts?

~Jared


[jira] Commented: (ZOOKEEPER-822) Leader election taking a long time to complete

2010-08-26 Thread Vishal K (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902988#action_12902988
 ] 

Vishal K commented on ZOOKEEPER-822:


The fix for problem 1 and 2 above eliminates the bug. I will have a patch out 
soon.

 Leader election taking a long time  to complete
 ---

 Key: ZOOKEEPER-822
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-822
 Project: Zookeeper
  Issue Type: Bug
  Components: quorum
Affects Versions: 3.3.0
Reporter: Vishal K
Priority: Blocker
 Attachments: 822.tar.gz, rhel.tar.gz, test_zookeeper_1.log, 
 test_zookeeper_2.log, zk_leader_election.tar.gz, zookeeper-3.4.0.tar.gz


 Created a 3 node cluster.
 1 Fail the ZK leader
 2. Let leader election finish. Restart the leader and let it join the 
 3. Repeat 
 After a few rounds leader election takes anywhere 25- 60 seconds to finish. 
 Note- we didn't have any ZK clients and no new znodes were created.
 zoo.cfg is shown below:
 #Mon Jul 19 12:15:10 UTC 2010
 server.1=192.168.4.12\:2888\:3888
 server.0=192.168.4.11\:2888\:3888
 clientPort=2181
 dataDir=/var/zookeeper
 syncLimit=2
 server.2=192.168.4.13\:2888\:3888
 initLimit=5
 tickTime=2000
 I have attached logs from two nodes that took a long time to form the cluster 
 after failing the leader. The leader was down anyways so logs from that node 
 shouldn't matter.
 Look for START HERE. Logs after that point should be of our interest.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (ZOOKEEPER-856) Connection imbalance leads to overloaded ZK instances

2010-08-26 Thread Travis Crawford (JIRA)
Connection imbalance leads to overloaded ZK instances
-

 Key: ZOOKEEPER-856
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-856
 Project: Zookeeper
  Issue Type: Bug
Reporter: Travis Crawford


We've experienced a number of issues lately where ruok requests would take 
upwards of 10 seconds to return, and ZooKeeper instances were extremely 
sluggish. The sluggish instance requires a restart to make it responsive again.

I believe the issue is connections are very imbalanced, leading to certain 
instances having many thousands of connections, while other instances are 
largely idle.

A potential solution is periodically disconnecting/reconnecting to balance 
connections over time; this seems fine because sessions should not be affected, 
and therefore ephemaral nodes and watches should not be affected.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-856) Connection imbalance leads to overloaded ZK instances

2010-08-26 Thread Travis Crawford (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Travis Crawford updated ZOOKEEPER-856:
--

Attachment: zk_open_file_descriptor_count_total.gif
zk_open_file_descriptor_count_members.gif

Attached are two graphs showing:

- Total ZooKeeper connections to a 3 node cluster
- Connections per member in the cluster

In the totals graph, notice how its largely unchanged over time. This period 
represents a steady-state period of usage.

In the members graph, notice how the number of connections is significantly 
different between machines. This cluster allows the leader to service reads, so 
that's not something to factor in when interpreting number of  connections.

These graphs look very similar to an issue I had with another service (scribe) 
and we solved the issue by disconnecting every N+-K messages. We tried getting 
fancy by publishing load metrics and using a smart selection algorithm. Turns 
out in practice though the periodic disconnect/reconnect was easier to 
implement and worked better, so I'm tossing that idea out as a potential 
solution here.

 Connection imbalance leads to overloaded ZK instances
 -

 Key: ZOOKEEPER-856
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-856
 Project: Zookeeper
  Issue Type: Bug
Reporter: Travis Crawford
 Attachments: zk_open_file_descriptor_count_members.gif, 
 zk_open_file_descriptor_count_total.gif


 We've experienced a number of issues lately where ruok requests would take 
 upwards of 10 seconds to return, and ZooKeeper instances were extremely 
 sluggish. The sluggish instance requires a restart to make it responsive 
 again.
 I believe the issue is connections are very imbalanced, leading to certain 
 instances having many thousands of connections, while other instances are 
 largely idle.
 A potential solution is periodically disconnecting/reconnecting to balance 
 connections over time; this seems fine because sessions should not be 
 affected, and therefore ephemaral nodes and watches should not be affected.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-856) Connection imbalance leads to overloaded ZK instances

2010-08-26 Thread Mahadev konar (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903017#action_12903017
 ] 

Mahadev konar commented on ZOOKEEPER-856:
-

travis, 
 we have had a lot of discussion on load balancing. I'd really want to try and 
see how the disconnect and reconnect works for load balancing. I am also with 
you that it might be a good enough soln on load balancing. I can upload a 
simple patch for this. Would you have some bandwidth trying and it out and 
reporting how well it works?

 Connection imbalance leads to overloaded ZK instances
 -

 Key: ZOOKEEPER-856
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-856
 Project: Zookeeper
  Issue Type: Bug
Reporter: Travis Crawford
 Attachments: zk_open_file_descriptor_count_members.gif, 
 zk_open_file_descriptor_count_total.gif


 We've experienced a number of issues lately where ruok requests would take 
 upwards of 10 seconds to return, and ZooKeeper instances were extremely 
 sluggish. The sluggish instance requires a restart to make it responsive 
 again.
 I believe the issue is connections are very imbalanced, leading to certain 
 instances having many thousands of connections, while other instances are 
 largely idle.
 A potential solution is periodically disconnecting/reconnecting to balance 
 connections over time; this seems fine because sessions should not be 
 affected, and therefore ephemaral nodes and watches should not be affected.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-856) Connection imbalance leads to overloaded ZK instances

2010-08-26 Thread Mahadev konar (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-856:


Fix Version/s: 3.4.0

 Connection imbalance leads to overloaded ZK instances
 -

 Key: ZOOKEEPER-856
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-856
 Project: Zookeeper
  Issue Type: Bug
Reporter: Travis Crawford
 Fix For: 3.4.0

 Attachments: zk_open_file_descriptor_count_members.gif, 
 zk_open_file_descriptor_count_total.gif


 We've experienced a number of issues lately where ruok requests would take 
 upwards of 10 seconds to return, and ZooKeeper instances were extremely 
 sluggish. The sluggish instance requires a restart to make it responsive 
 again.
 I believe the issue is connections are very imbalanced, leading to certain 
 instances having many thousands of connections, while other instances are 
 largely idle.
 A potential solution is periodically disconnecting/reconnecting to balance 
 connections over time; this seems fine because sessions should not be 
 affected, and therefore ephemaral nodes and watches should not be affected.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-856) Connection imbalance leads to overloaded ZK instances

2010-08-26 Thread Travis Crawford (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903065#action_12903065
 ] 

Travis Crawford commented on ZOOKEEPER-856:
---

@mahadev - I would love to help test a patch :) I'm currently using 3.3.1 + 
ZOOKEEPER-744 + ZOOKEEPER-790, applied in that order.

If there's a knob for how frequently to disconnect/reconnect I can try out 
different settings to see what a sensible default would be.

Do you think this should be a client or server setting? I'm thinking a server 
setting because otherwise its not possible to enforce the policy.

 Connection imbalance leads to overloaded ZK instances
 -

 Key: ZOOKEEPER-856
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-856
 Project: Zookeeper
  Issue Type: Bug
Reporter: Travis Crawford
 Fix For: 3.4.0

 Attachments: zk_open_file_descriptor_count_members.gif, 
 zk_open_file_descriptor_count_total.gif


 We've experienced a number of issues lately where ruok requests would take 
 upwards of 10 seconds to return, and ZooKeeper instances were extremely 
 sluggish. The sluggish instance requires a restart to make it responsive 
 again.
 I believe the issue is connections are very imbalanced, leading to certain 
 instances having many thousands of connections, while other instances are 
 largely idle.
 A potential solution is periodically disconnecting/reconnecting to balance 
 connections over time; this seems fine because sessions should not be 
 affected, and therefore ephemaral nodes and watches should not be affected.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Putting copyright notices in ZK?

2010-08-26 Thread Vishal K
Hi All,

I work for VMware. My company tells me that any contirubtion that I make to
ZK needs to have a line saying Copyright [year of creation - year of last
modification] VMware, Inc. All Rights Reserved.
If portions of a file are modified, then I could identify only those
portions of the file, if needed. No change to license is required.

Needless to say, I am personally ok to make contirbutions without any such
notices. What is ZK's policy on this? What would be a good solution in this
case satisfyigng both the parties (ZK and my company's legal dept.)?
Thanks.
-Vishal


Re: Putting copyright notices in ZK?

2010-08-26 Thread Henry Robinson
Hi Vishal -

I'm afraid we don't allow author or copyright information in source
files. Putting
one's own copyright notice is against Apache policy (and we are guided by
the rules of the ASF). The SVN logs will keep track of ownership details,
but it's not at all clear what copyright notices even mean once you have
granted license to the ASF by virtue of submitting your patch. To avoid any
confusion, we just disallow author specific information in the source.

I hope you can find some compromise with your legal department - I'm pretty
sure I know of other contributions from VMWare employees to open source
projects that don't have this restriction, so I'm hopeful that you can
resolve this issue.

Best,
Henry


On 26 August 2010 14:58, Vishal K vishalm...@gmail.com wrote:

 Hi All,

 I work for VMware. My company tells me that any contirubtion that I make to
 ZK needs to have a line saying Copyright [year of creation - year of last
 modification] VMware, Inc. All Rights Reserved.
 If portions of a file are modified, then I could identify only those
 portions of the file, if needed. No change to license is required.

 Needless to say, I am personally ok to make contirbutions without any such
 notices. What is ZK's policy on this? What would be a good solution in this
 case satisfyigng both the parties (ZK and my company's legal dept.)?
 Thanks.
 -Vishal




-- 
Henry Robinson
Software Engineer
Cloudera
415-994-6679


[jira] Commented: (ZOOKEEPER-856) Connection imbalance leads to overloaded ZK instances

2010-08-26 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903146#action_12903146
 ] 

Patrick Hunt commented on ZOOKEEPER-856:


Have you monitored the jvms for gc activity? Are you using CMS/incremental gc 
rather than the default GC setup? I'm all for adding balancing, but it would be 
good to rule GC/swap/IO out as an issue.

 Connection imbalance leads to overloaded ZK instances
 -

 Key: ZOOKEEPER-856
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-856
 Project: Zookeeper
  Issue Type: Bug
Reporter: Travis Crawford
 Fix For: 3.4.0

 Attachments: zk_open_file_descriptor_count_members.gif, 
 zk_open_file_descriptor_count_total.gif


 We've experienced a number of issues lately where ruok requests would take 
 upwards of 10 seconds to return, and ZooKeeper instances were extremely 
 sluggish. The sluggish instance requires a restart to make it responsive 
 again.
 I believe the issue is connections are very imbalanced, leading to certain 
 instances having many thousands of connections, while other instances are 
 largely idle.
 A potential solution is periodically disconnecting/reconnecting to balance 
 connections over time; this seems fine because sessions should not be 
 affected, and therefore ephemaral nodes and watches should not be affected.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-856) Connection imbalance leads to overloaded ZK instances

2010-08-26 Thread Travis Crawford (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903170#action_12903170
 ] 

Travis Crawford commented on ZOOKEEPER-856:
---

@patrick - We're using these settings, which I believe are based on what's 
recommended in the troubleshooting guide.

-XX:+PrintGC
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCApplicationConcurrentTime
-XX:+UseConcMarkSweepGC

Looking at the logs I do see lots of GC activity. For example:

Total time for which application threads were stopped: 0.5599050 seconds
Application time: 0.0056590 seconds

I only see this on the hosts that became unresponsive after acquiring lots of 
connections.

Any suggestions for the GC flags? If there's something better I can experiment, 
and update the wiki if we discover something interesting.

http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting

 Connection imbalance leads to overloaded ZK instances
 -

 Key: ZOOKEEPER-856
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-856
 Project: Zookeeper
  Issue Type: Bug
Reporter: Travis Crawford
 Fix For: 3.4.0

 Attachments: zk_open_file_descriptor_count_members.gif, 
 zk_open_file_descriptor_count_total.gif


 We've experienced a number of issues lately where ruok requests would take 
 upwards of 10 seconds to return, and ZooKeeper instances were extremely 
 sluggish. The sluggish instance requires a restart to make it responsive 
 again.
 I believe the issue is connections are very imbalanced, leading to certain 
 instances having many thousands of connections, while other instances are 
 largely idle.
 A potential solution is periodically disconnecting/reconnecting to balance 
 connections over time; this seems fine because sessions should not be 
 affected, and therefore ephemaral nodes and watches should not be affected.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.