Build failed in Hudson: ZooKeeper-trunk #915
See https://hudson.apache.org/hudson/job/ZooKeeper-trunk/915/ -- [...truncated 510 lines...] [javadoc] Javadoc execution [javadoc] Loading source file https://hudson.apache.org/hudson/job/ZooKeeper-trunk/ws/trunk/src/java/main/org/apache/zookeeper/AsyncCallback.java... [javadoc] Loading source file https://hudson.apache.org/hudson/job/ZooKeeper-trunk/ws/trunk/src/java/main/org/apache/zookeeper/CreateMode.java... [javadoc] Loading source file https://hudson.apache.org/hudson/job/ZooKeeper-trunk/ws/trunk/src/java/main/org/apache/zookeeper/KeeperException.java... [javadoc] Loading source file https://hudson.apache.org/hudson/job/ZooKeeper-trunk/ws/trunk/src/java/main/org/apache/zookeeper/ServerAdminClient.java... [javadoc] Loading source file https://hudson.apache.org/hudson/job/ZooKeeper-trunk/ws/trunk/src/java/main/org/apache/zookeeper/Watcher.java... [javadoc] Loading source file https://hudson.apache.org/hudson/job/ZooKeeper-trunk/ws/trunk/src/java/main/org/apache/zookeeper/ZooDefs.java... [javadoc] Loading source file https://hudson.apache.org/hudson/job/ZooKeeper-trunk/ws/trunk/src/java/main/org/apache/zookeeper/ZooKeeper.java... [javadoc] Loading source file https://hudson.apache.org/hudson/job/ZooKeeper-trunk/ws/trunk/src/java/main/org/apache/zookeeper/ZooKeeperMain.java... [javadoc] Loading source file https://hudson.apache.org/hudson/job/ZooKeeper-trunk/ws/trunk/src/java/main/org/apache/zookeeper/server/LogFormatter.java... [javadoc] Loading source file https://hudson.apache.org/hudson/job/ZooKeeper-trunk/ws/trunk/src/java/main/org/apache/zookeeper/server/PurgeTxnLog.java... [javadoc] Loading source file https://hudson.apache.org/hudson/job/ZooKeeper-trunk/ws/trunk/src/java/main/org/apache/zookeeper/server/ZooKeeperServerMain.java... [javadoc] Loading source file https://hudson.apache.org/hudson/job/ZooKeeper-trunk/ws/trunk/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeerMain.java... [javadoc] Loading source file https://hudson.apache.org/hudson/job/ZooKeeper-trunk/ws/trunk/src/java/main/org/apache/zookeeper/server/upgrade/UpgradeMain.java... [javadoc] Loading source files for package org.apache.zookeeper.data... [javadoc] Constructing Javadoc information... [javadoc] Standard Doclet version 1.6.0_11 [javadoc] Building tree for all the packages and classes... [javadoc] Building index for all the packages and classes... [javadoc] Building index for all classes... javadoc-jar: [jar] Building jar: https://hudson.apache.org/hudson/job/ZooKeeper-trunk/ws/trunk/build/zookeeper-3.4.0-javadoc.jar ivy-retrieve-jdiff: [mkdir] Created dir: https://hudson.apache.org/hudson/job/ZooKeeper-trunk/ws/trunk/build/jdiff/lib [ivy:retrieve] :: resolving dependencies :: org.apache.zookeeper#zookeeper;3.4.0 [ivy:retrieve] confs: [jdiff] [ivy:retrieve] found jdiff#jdiff;1.0.9 in default [ivy:retrieve] found xerces#xerces;1.4.4 in default [ivy:retrieve] :: resolution report :: resolve 122ms :: artifacts dl 4ms - | |modules|| artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| - | jdiff | 2 | 0 | 0 | 0 || 2 | 0 | - [ivy:retrieve] :: retrieving :: org.apache.zookeeper#zookeeper [ivy:retrieve] confs: [jdiff] [ivy:retrieve] 2 artifacts copied, 0 already retrieved (1896kB/11ms) write-null: api-xml: [javadoc] Generating Javadoc [javadoc] Javadoc execution [javadoc] Loading source files for package org.apache.zookeeper... [javadoc] Constructing Javadoc information... [javadoc] JDiff: doclet started ... [javadoc] JDiff: writing the API to file 'https://hudson.apache.org/hudson/job/ZooKeeper-trunk/ws/trunk/src/java/lib/jdiff/zookeeper_3.4.0.xml'... [javadoc] https://hudson.apache.org/hudson/job/ZooKeeper-trunk/ws/trunk/src/java/main/org/apache/zookeeper/ClientWatchManager.java:38: warning - @return tag has no arguments. [javadoc] JDiff: finished (took 0s, not including scanning the source files). [javadoc] 1 warning api-report: [mkdir] Created dir: https://hudson.apache.org/hudson/job/ZooKeeper-trunk/ws/trunk/build/docs/jdiff [javadoc] Generating Javadoc [javadoc] Javadoc execution [javadoc] Loading source file https://hudson.apache.org/hudson/job/ZooKeeper-trunk/ws/trunk/build/jdiff/lib/Null.java... [javadoc] Loading source files for package org.apache.jute.compiler... [javadoc] Loading source files for package org.apache.jute.compiler.generated... [javadoc] Loading source files for package org.apache.zookeeper... [javadoc] Loading source files for package org.apache.zookeeper.common... [javadoc] Loading
[jira] Updated: (ZOOKEEPER-855) clientPortBindAddress should be clientPortAddress
[ https://issues.apache.org/jira/browse/ZOOKEEPER-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jared Cantwell updated ZOOKEEPER-855: - Priority: Trivial (was: Major) clientPortBindAddress should be clientPortAddress - Key: ZOOKEEPER-855 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-855 Project: Zookeeper Issue Type: Bug Components: documentation Affects Versions: 3.3.0, 3.3.1 Reporter: Jared Cantwell Priority: Trivial The server documentation states that the configuration parameter for binding to a specific ip address is clientPortBindAddress. The code believes the parameter is clientPortAddress. The documentation for 3.3.X versions needs changed to reflect the correct parameter . This parameter was added in ZOOKEEPER-635. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (ZOOKEEPER-855) clientPortBindAddress should be clientPortAddress
clientPortBindAddress should be clientPortAddress - Key: ZOOKEEPER-855 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-855 Project: Zookeeper Issue Type: Bug Components: documentation Affects Versions: 3.3.1, 3.3.0 Reporter: Jared Cantwell The server documentation states that the configuration parameter for binding to a specific ip address is clientPortBindAddress. The code believes the parameter is clientPortAddress. The documentation for 3.3.X versions needs changed to reflect the correct parameter . This parameter was added in ZOOKEEPER-635. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-366) Session timeout detection can go wrong if the leader system time changes
[ https://issues.apache.org/jira/browse/ZOOKEEPER-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902937#action_12902937 ] Patrick Hunt commented on ZOOKEEPER-366: One thing we should do - add sufficient logging (warn level or higher I would say) to ensure if this does happen in production we have a record of it in the log. Session timeout detection can go wrong if the leader system time changes Key: ZOOKEEPER-366 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-366 Project: Zookeeper Issue Type: Bug Reporter: Benjamin Reed Assignee: Benjamin Reed Attachments: ZOOKEEPER-366.patch the leader tracks session expirations by calculating when a session will timeout and then periodically checking to see what needs to be timed out based on the current time. this works great as long as the leaders clock progresses at a steady pace. the problem comes when there are big (session size) changes in clock, by ntp for example. if time gets adjusted forward, all the sessions could timeout immediately. if time goes backward sessions that should timeout may take a lot longer to actually expire. this is really just a leader issue. the easiest way to deal with this is to have the leader relinquish leadership if it detects a big jump forward in time. when a new leader gets elected, it will recalculate timeouts of active sessions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Proposed: Leader communication should listen on specified IP, not wildcard address
Hello, My project currently has the need to specify the local address that is used for leader communication (and not use the default of listening on all interfaces). This is similar to the clientPortAddress parameter that was recently added. After reviewing the code, we can't think of a reason why only the port would be used with the wildcard interface, when servers are already connecting specifically to that interface anyway. Is binding to the wildcard interface for leader communication intentional? I believe the change would be straightforward-- one change for each leader port used. Note: this doesn't account for all leader election algorithms, only the default. Index: src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java === --- src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java (revision 989805) +++ src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java (working copy) @@ -434,7 +434,7 @@ ss = ServerSocketChannel.open(); int port = self.quorumPeers.get(self.getId()).electionAddr.getPort(); ss.socket().setReuseAddress(true); -InetSocketAddress addr = new InetSocketAddress(port); +InetSocketAddress addr = self.quorumPeers.get(self.getId()).electionAddr; LOG.info(My election bind port: + addr.toString()); setName(addr.toString()); ss.socket().bind(addr); Index: src/java/main/org/apache/zookeeper/server/quorum/Leader.java === --- src/java/main/org/apache/zookeeper/server/quorum/Leader.java (revision 989805) +++ src/java/main/org/apache/zookeeper/server/quorum/Leader.java(working copy) @@ -128,10 +128,11 @@ Leader(QuorumPeer self,LeaderZooKeeperServer zk) throws IOException { this.self = self; try { -ss = new ServerSocket(self.getQuorumAddress().getPort()); +ss = new ServerSocket(); +ss.bind(self.getQuorumAddress()); } catch (BindException e) { -LOG.error(Couldn't bind to port -+ self.getQuorumAddress().getPort(), e); +LOG.error(Couldn't bind to address ++ self.getQuorumAddress().getAddress() + : + self.getQuorumAddress().getPort(), e); throw e; } this.zk=zk; Does this seem like a reasonable change? Thoughts? ~Jared
[jira] Commented: (ZOOKEEPER-822) Leader election taking a long time to complete
[ https://issues.apache.org/jira/browse/ZOOKEEPER-822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902988#action_12902988 ] Vishal K commented on ZOOKEEPER-822: The fix for problem 1 and 2 above eliminates the bug. I will have a patch out soon. Leader election taking a long time to complete --- Key: ZOOKEEPER-822 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-822 Project: Zookeeper Issue Type: Bug Components: quorum Affects Versions: 3.3.0 Reporter: Vishal K Priority: Blocker Attachments: 822.tar.gz, rhel.tar.gz, test_zookeeper_1.log, test_zookeeper_2.log, zk_leader_election.tar.gz, zookeeper-3.4.0.tar.gz Created a 3 node cluster. 1 Fail the ZK leader 2. Let leader election finish. Restart the leader and let it join the 3. Repeat After a few rounds leader election takes anywhere 25- 60 seconds to finish. Note- we didn't have any ZK clients and no new znodes were created. zoo.cfg is shown below: #Mon Jul 19 12:15:10 UTC 2010 server.1=192.168.4.12\:2888\:3888 server.0=192.168.4.11\:2888\:3888 clientPort=2181 dataDir=/var/zookeeper syncLimit=2 server.2=192.168.4.13\:2888\:3888 initLimit=5 tickTime=2000 I have attached logs from two nodes that took a long time to form the cluster after failing the leader. The leader was down anyways so logs from that node shouldn't matter. Look for START HERE. Logs after that point should be of our interest. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (ZOOKEEPER-856) Connection imbalance leads to overloaded ZK instances
Connection imbalance leads to overloaded ZK instances - Key: ZOOKEEPER-856 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-856 Project: Zookeeper Issue Type: Bug Reporter: Travis Crawford We've experienced a number of issues lately where ruok requests would take upwards of 10 seconds to return, and ZooKeeper instances were extremely sluggish. The sluggish instance requires a restart to make it responsive again. I believe the issue is connections are very imbalanced, leading to certain instances having many thousands of connections, while other instances are largely idle. A potential solution is periodically disconnecting/reconnecting to balance connections over time; this seems fine because sessions should not be affected, and therefore ephemaral nodes and watches should not be affected. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-856) Connection imbalance leads to overloaded ZK instances
[ https://issues.apache.org/jira/browse/ZOOKEEPER-856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Travis Crawford updated ZOOKEEPER-856: -- Attachment: zk_open_file_descriptor_count_total.gif zk_open_file_descriptor_count_members.gif Attached are two graphs showing: - Total ZooKeeper connections to a 3 node cluster - Connections per member in the cluster In the totals graph, notice how its largely unchanged over time. This period represents a steady-state period of usage. In the members graph, notice how the number of connections is significantly different between machines. This cluster allows the leader to service reads, so that's not something to factor in when interpreting number of connections. These graphs look very similar to an issue I had with another service (scribe) and we solved the issue by disconnecting every N+-K messages. We tried getting fancy by publishing load metrics and using a smart selection algorithm. Turns out in practice though the periodic disconnect/reconnect was easier to implement and worked better, so I'm tossing that idea out as a potential solution here. Connection imbalance leads to overloaded ZK instances - Key: ZOOKEEPER-856 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-856 Project: Zookeeper Issue Type: Bug Reporter: Travis Crawford Attachments: zk_open_file_descriptor_count_members.gif, zk_open_file_descriptor_count_total.gif We've experienced a number of issues lately where ruok requests would take upwards of 10 seconds to return, and ZooKeeper instances were extremely sluggish. The sluggish instance requires a restart to make it responsive again. I believe the issue is connections are very imbalanced, leading to certain instances having many thousands of connections, while other instances are largely idle. A potential solution is periodically disconnecting/reconnecting to balance connections over time; this seems fine because sessions should not be affected, and therefore ephemaral nodes and watches should not be affected. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-856) Connection imbalance leads to overloaded ZK instances
[ https://issues.apache.org/jira/browse/ZOOKEEPER-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903017#action_12903017 ] Mahadev konar commented on ZOOKEEPER-856: - travis, we have had a lot of discussion on load balancing. I'd really want to try and see how the disconnect and reconnect works for load balancing. I am also with you that it might be a good enough soln on load balancing. I can upload a simple patch for this. Would you have some bandwidth trying and it out and reporting how well it works? Connection imbalance leads to overloaded ZK instances - Key: ZOOKEEPER-856 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-856 Project: Zookeeper Issue Type: Bug Reporter: Travis Crawford Attachments: zk_open_file_descriptor_count_members.gif, zk_open_file_descriptor_count_total.gif We've experienced a number of issues lately where ruok requests would take upwards of 10 seconds to return, and ZooKeeper instances were extremely sluggish. The sluggish instance requires a restart to make it responsive again. I believe the issue is connections are very imbalanced, leading to certain instances having many thousands of connections, while other instances are largely idle. A potential solution is periodically disconnecting/reconnecting to balance connections over time; this seems fine because sessions should not be affected, and therefore ephemaral nodes and watches should not be affected. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-856) Connection imbalance leads to overloaded ZK instances
[ https://issues.apache.org/jira/browse/ZOOKEEPER-856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-856: Fix Version/s: 3.4.0 Connection imbalance leads to overloaded ZK instances - Key: ZOOKEEPER-856 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-856 Project: Zookeeper Issue Type: Bug Reporter: Travis Crawford Fix For: 3.4.0 Attachments: zk_open_file_descriptor_count_members.gif, zk_open_file_descriptor_count_total.gif We've experienced a number of issues lately where ruok requests would take upwards of 10 seconds to return, and ZooKeeper instances were extremely sluggish. The sluggish instance requires a restart to make it responsive again. I believe the issue is connections are very imbalanced, leading to certain instances having many thousands of connections, while other instances are largely idle. A potential solution is periodically disconnecting/reconnecting to balance connections over time; this seems fine because sessions should not be affected, and therefore ephemaral nodes and watches should not be affected. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-856) Connection imbalance leads to overloaded ZK instances
[ https://issues.apache.org/jira/browse/ZOOKEEPER-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903065#action_12903065 ] Travis Crawford commented on ZOOKEEPER-856: --- @mahadev - I would love to help test a patch :) I'm currently using 3.3.1 + ZOOKEEPER-744 + ZOOKEEPER-790, applied in that order. If there's a knob for how frequently to disconnect/reconnect I can try out different settings to see what a sensible default would be. Do you think this should be a client or server setting? I'm thinking a server setting because otherwise its not possible to enforce the policy. Connection imbalance leads to overloaded ZK instances - Key: ZOOKEEPER-856 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-856 Project: Zookeeper Issue Type: Bug Reporter: Travis Crawford Fix For: 3.4.0 Attachments: zk_open_file_descriptor_count_members.gif, zk_open_file_descriptor_count_total.gif We've experienced a number of issues lately where ruok requests would take upwards of 10 seconds to return, and ZooKeeper instances were extremely sluggish. The sluggish instance requires a restart to make it responsive again. I believe the issue is connections are very imbalanced, leading to certain instances having many thousands of connections, while other instances are largely idle. A potential solution is periodically disconnecting/reconnecting to balance connections over time; this seems fine because sessions should not be affected, and therefore ephemaral nodes and watches should not be affected. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Putting copyright notices in ZK?
Hi All, I work for VMware. My company tells me that any contirubtion that I make to ZK needs to have a line saying Copyright [year of creation - year of last modification] VMware, Inc. All Rights Reserved. If portions of a file are modified, then I could identify only those portions of the file, if needed. No change to license is required. Needless to say, I am personally ok to make contirbutions without any such notices. What is ZK's policy on this? What would be a good solution in this case satisfyigng both the parties (ZK and my company's legal dept.)? Thanks. -Vishal
Re: Putting copyright notices in ZK?
Hi Vishal - I'm afraid we don't allow author or copyright information in source files. Putting one's own copyright notice is against Apache policy (and we are guided by the rules of the ASF). The SVN logs will keep track of ownership details, but it's not at all clear what copyright notices even mean once you have granted license to the ASF by virtue of submitting your patch. To avoid any confusion, we just disallow author specific information in the source. I hope you can find some compromise with your legal department - I'm pretty sure I know of other contributions from VMWare employees to open source projects that don't have this restriction, so I'm hopeful that you can resolve this issue. Best, Henry On 26 August 2010 14:58, Vishal K vishalm...@gmail.com wrote: Hi All, I work for VMware. My company tells me that any contirubtion that I make to ZK needs to have a line saying Copyright [year of creation - year of last modification] VMware, Inc. All Rights Reserved. If portions of a file are modified, then I could identify only those portions of the file, if needed. No change to license is required. Needless to say, I am personally ok to make contirbutions without any such notices. What is ZK's policy on this? What would be a good solution in this case satisfyigng both the parties (ZK and my company's legal dept.)? Thanks. -Vishal -- Henry Robinson Software Engineer Cloudera 415-994-6679
[jira] Commented: (ZOOKEEPER-856) Connection imbalance leads to overloaded ZK instances
[ https://issues.apache.org/jira/browse/ZOOKEEPER-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903146#action_12903146 ] Patrick Hunt commented on ZOOKEEPER-856: Have you monitored the jvms for gc activity? Are you using CMS/incremental gc rather than the default GC setup? I'm all for adding balancing, but it would be good to rule GC/swap/IO out as an issue. Connection imbalance leads to overloaded ZK instances - Key: ZOOKEEPER-856 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-856 Project: Zookeeper Issue Type: Bug Reporter: Travis Crawford Fix For: 3.4.0 Attachments: zk_open_file_descriptor_count_members.gif, zk_open_file_descriptor_count_total.gif We've experienced a number of issues lately where ruok requests would take upwards of 10 seconds to return, and ZooKeeper instances were extremely sluggish. The sluggish instance requires a restart to make it responsive again. I believe the issue is connections are very imbalanced, leading to certain instances having many thousands of connections, while other instances are largely idle. A potential solution is periodically disconnecting/reconnecting to balance connections over time; this seems fine because sessions should not be affected, and therefore ephemaral nodes and watches should not be affected. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-856) Connection imbalance leads to overloaded ZK instances
[ https://issues.apache.org/jira/browse/ZOOKEEPER-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903170#action_12903170 ] Travis Crawford commented on ZOOKEEPER-856: --- @patrick - We're using these settings, which I believe are based on what's recommended in the troubleshooting guide. -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+UseConcMarkSweepGC Looking at the logs I do see lots of GC activity. For example: Total time for which application threads were stopped: 0.5599050 seconds Application time: 0.0056590 seconds I only see this on the hosts that became unresponsive after acquiring lots of connections. Any suggestions for the GC flags? If there's something better I can experiment, and update the wiki if we discover something interesting. http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting Connection imbalance leads to overloaded ZK instances - Key: ZOOKEEPER-856 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-856 Project: Zookeeper Issue Type: Bug Reporter: Travis Crawford Fix For: 3.4.0 Attachments: zk_open_file_descriptor_count_members.gif, zk_open_file_descriptor_count_total.gif We've experienced a number of issues lately where ruok requests would take upwards of 10 seconds to return, and ZooKeeper instances were extremely sluggish. The sluggish instance requires a restart to make it responsive again. I believe the issue is connections are very imbalanced, leading to certain instances having many thousands of connections, while other instances are largely idle. A potential solution is periodically disconnecting/reconnecting to balance connections over time; this seems fine because sessions should not be affected, and therefore ephemaral nodes and watches should not be affected. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.