[jira] Updated: (ZOOKEEPER-804) c unit tests failing due to assertion cptr failed
[ https://issues.apache.org/jira/browse/ZOOKEEPER-804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Hunt updated ZOOKEEPER-804: --- Attachment: ZOOKEEPER-804-1.patch Updated patch to apply against latest trunk (hopefully branch too). c unit tests failing due to assertion cptr failed --- Key: ZOOKEEPER-804 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-804 Project: Zookeeper Issue Type: Bug Components: c client Affects Versions: 3.4.0 Environment: gcc 4.4.3, ubuntu lucid lynx, dual core laptop (intel) Reporter: Patrick Hunt Assignee: Michi Mutsuzaki Priority: Critical Fix For: 3.3.2, 3.4.0 Attachments: ZOOKEEPER-804-1.patch, ZOOKEEPER-804-1.patch, ZOOKEEPER-804.patch I'm seeing this frequently: [exec] Zookeeper_simpleSystem::testPing : elapsed 18006 : OK [exec] Zookeeper_simpleSystem::testAcl : elapsed 1022 : OK [exec] Zookeeper_simpleSystem::testChroot : elapsed 3145 : OK [exec] Zookeeper_simpleSystem::testAuth ZooKeeper server started : elapsed 25687 : OK [exec] zktest-mt: /home/phunt/dev/workspace/gitzk/src/c/src/zookeeper.c:1952: zookeeper_process: Assertion `cptr' failed. [exec] make: *** [run-check] Aborted [exec] Zookeeper_simpleSystem::testHangingClient Mahadev can you take a look? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-804) c unit tests failing due to assertion cptr failed
[ https://issues.apache.org/jira/browse/ZOOKEEPER-804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Hunt updated ZOOKEEPER-804: --- Resolution: Fixed Status: Resolved (was: Patch Available) +1 on the second patch. Tested and it seems fine, committed to trunk/branch33 both. c unit tests failing due to assertion cptr failed --- Key: ZOOKEEPER-804 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-804 Project: Zookeeper Issue Type: Bug Components: c client Affects Versions: 3.4.0 Environment: gcc 4.4.3, ubuntu lucid lynx, dual core laptop (intel) Reporter: Patrick Hunt Assignee: Michi Mutsuzaki Priority: Critical Fix For: 3.3.2, 3.4.0 Attachments: ZOOKEEPER-804-1.patch, ZOOKEEPER-804-1.patch, ZOOKEEPER-804.patch I'm seeing this frequently: [exec] Zookeeper_simpleSystem::testPing : elapsed 18006 : OK [exec] Zookeeper_simpleSystem::testAcl : elapsed 1022 : OK [exec] Zookeeper_simpleSystem::testChroot : elapsed 3145 : OK [exec] Zookeeper_simpleSystem::testAuth ZooKeeper server started : elapsed 25687 : OK [exec] zktest-mt: /home/phunt/dev/workspace/gitzk/src/c/src/zookeeper.c:1952: zookeeper_process: Assertion `cptr' failed. [exec] make: *** [run-check] Aborted [exec] Zookeeper_simpleSystem::testHangingClient Mahadev can you take a look? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Restarting discussion on ZooKeeper as a TLP
It's been a few days, any thoughts? Acceptable? I'd like to keep moving the ball forward. Thanks. Patrick On Sun, Oct 17, 2010 at 8:43 PM, 明珠刘 redis...@gmail.com wrote: +1 2010/10/14 Patrick Hunt ph...@apache.org In March of this year we discussed a request from the Apache Board, and Hadoop PMC, that we become a TLP rather than a subproject of Hadoop: Original discussion http://markmail.org/thread/42cobkpzlgotcbin I originally voted against this move, my primary concern being that we were not ready to move to tlp status given our small contributor base and limited contributor diversity. However I'd now like to revisit that discussion/decision. Since that time the team has been working hard to attract new contributors, and we've seen significant new contributions come in. There has also been feedback from board/pmc addressing many of these concerns (both on the list and in private). I am now less concerned about this issue and don't see it as a blocker for us to move to TLP status. A second concern was that by becoming a TLP the project would lose it's connection with Hadoop, a big source of new users for us. I've been assured (and you can see with the other projects that have moved to tlp status; pig/hive/hbase/etc...) that this connection will be maintained. The Hadoop ZooKeeper tab for example will redirect to our new homepage. Other Apache members also pointed out to me that we are essentially operating as a TLP within the Hadoop PMC. Most of the other PMC members have little or no experience with ZooKeeper and this makes it difficult for them to monitor and advise us. By moving to TLP status we'll be able to govern ourselves and better set our direction. I believe we are ready to become a TLP. Please respond to this email with your thoughts and any issues. I will call a vote in a few days, once discussion settles. Regards, Patrick
Re: Fix release 3.3.2 planning, status.
https://issues.apache.org/jira/browse/ZOOKEEPER-794 https://issues.apache.org/jira/browse/ZOOKEEPER-794I've done a bunch of testing on a number of macines, could someone take a look at this and +1 it? (or not) I'd like to get 3.3.2 moving. Regards, Patrick On Mon, Oct 18, 2010 at 9:19 AM, Patrick Hunt ph...@apache.org wrote: Hi Camille, unfortunately there's a blocker on 3.3.2 at the moment. http://bit.ly/asOSNl I just updated that patch to fix a build issue, hopefully one of the committers can review asap. Additionally there are a number of other patch available patches attached to 3.3.2. I'd like to get those included give everyone's done a bunch of work on them. Again, committers need to review/commit/reject appropriately. What do ppl think, are we pretty close? Ben/Flavio/Henry/Mahadev please review some of the outstanding patches. Coordinate with me if you have issues/questions. Regards, Patrick On Mon, Oct 18, 2010 at 7:56 AM, Fournier, Camille F. [Tech] camille.fourn...@gs.com wrote: Hi guys, Any updates on the 3.3.2 release schedule? Trying to plan a release myself and wondering if I'll have to go to production with patched 3.3.1 or have time to QA with the 3.3.2 release. Thanks, Camille -Original Message- From: Patrick Hunt [mailto:ph...@apache.org] Sent: Thursday, September 23, 2010 12:45 PM To: zookeeper-dev@hadoop.apache.org Subject: Fix release 3.3.2 planning, status. Looking at the JIRA queue for 3.3.2 I see that there are two blockers, one is currently PA and the other is pretty close (it has a patch that should go in soon). There are a few JIRAs that already went into the branch that are important to get out there ASAP, esp ZOOKEEPER-846 (fix close issue found by hbase). One issue that's been slowing us down is hudson. The trunk was not passing it's hudson validation, which was causing a slow down in patch review. Mahadev and I fixed this. However with recent changes to the hudson hw/security environment the patch testing process (automated) is broken. Giri is working on this. In the mean time we'll have to test ourselves. Committers -- be sure to verify RAT, Findbugs, etc... in addition to verifying via test. I've setup an additional Hudson environment inside Cloudera that also verifies the trunk/branch. If issues are found I will report them (unfortunately I can't provide access to cloudera's hudson env to non-cloudera employees at this time). I'd like to clear out the PAs asap and get a release candidate built. Anyone see a problem with shooting for an RC mid next week? Patrick
Re: Restarting discussion on ZooKeeper as a TLP
+1. On Wed, Oct 20, 2010 at 1:50 PM, Patrick Hunt ph...@apache.org wrote: It's been a few days, any thoughts? Acceptable? I'd like to keep moving the ball forward. Thanks. Patrick On Sun, Oct 17, 2010 at 8:43 PM, 明珠刘 redis...@gmail.com wrote: +1 2010/10/14 Patrick Hunt ph...@apache.org In March of this year we discussed a request from the Apache Board, and Hadoop PMC, that we become a TLP rather than a subproject of Hadoop: Original discussion http://markmail.org/thread/42cobkpzlgotcbin I originally voted against this move, my primary concern being that we were not ready to move to tlp status given our small contributor base and limited contributor diversity. However I'd now like to revisit that discussion/decision. Since that time the team has been working hard to attract new contributors, and we've seen significant new contributions come in. There has also been feedback from board/pmc addressing many of these concerns (both on the list and in private). I am now less concerned about this issue and don't see it as a blocker for us to move to TLP status. A second concern was that by becoming a TLP the project would lose it's connection with Hadoop, a big source of new users for us. I've been assured (and you can see with the other projects that have moved to tlp status; pig/hive/hbase/etc...) that this connection will be maintained. The Hadoop ZooKeeper tab for example will redirect to our new homepage. Other Apache members also pointed out to me that we are essentially operating as a TLP within the Hadoop PMC. Most of the other PMC members have little or no experience with ZooKeeper and this makes it difficult for them to monitor and advise us. By moving to TLP status we'll be able to govern ourselves and better set our direction. I believe we are ready to become a TLP. Please respond to this email with your thoughts and any issues. I will call a vote in a few days, once discussion settles. Regards, Patrick
Re: implications of netty on client connections
It may just be the case that we haven't tested sufficiently for this case (running out of fds) and we need to handle this better even in nio. Probably by cutting off op_connect in the selector. We should be able to do similar in netty. Btw, on unix one can access the open/max fd count using this: http://download.oracle.com/javase/6/docs/jre/api/management/extension/com/sun/management/UnixOperatingSystemMXBean.html Secondly, are you running into a kernel limit or a zk limit? Take a look at this post describing 1million concurrent connections to a box: http://www.metabrew.com/article/a-million-user-comet-application-with-mochiweb-part-3 specifically: -- During various test with lots of connections, I ended up making some additional changes to my sysctl.conf. This was part trial-and-error, I don’t really know enough about the internals to make especially informed decisions about which values to change. My policy was to wait for things to break, check /var/log/kern.log and see what mysterious error was reported, then increase stuff that sounded sensible after a spot of googling. Here are the settings in place during the above test: net.core.rmem_max = 33554432 net.core.wmem_max = 33554432 net.ipv4.tcp_rmem = 4096 16384 33554432 net.ipv4.tcp_wmem = 4096 16384 33554432 net.ipv4.tcp_mem = 786432 1048576 26777216 net.ipv4.tcp_max_tw_buckets = 36 net.core.netdev_max_backlog = 2500 vm.min_free_kbytes = 65536 vm.swappiness = 0 net.ipv4.ip_local_port_range = 1024 65535 -- I'm guessing that even with this, at some point you'll run into a limit in our server implementation. In particular I suspect that we may start to respond more slowly to pings, eventually getting so bad it would time out. We'd have to debug that and address (optimize). http://www.metabrew.com/article/a-million-user-comet-application-with-mochiweb-part-3 Patrick On Tue, Oct 19, 2010 at 7:16 AM, Fournier, Camille F. [Tech] camille.fourn...@gs.com wrote: Hi everyone, I'm curious what the implications of using netty are going to be for the case where a server gets close to its max available file descriptors. Right now our somewhat limited testing has shown that a ZK server performs fine up to the point when it runs out of available fds, at which point performance degrades sharply and new connections get into a somewhat bad state. Is netty going to enable the server to handle this situation more gracefully (or is there a way to do this already that I haven't found)? Limiting connections from the same client is not enough since we can potentially have far more clients wanting to connect than available fds for certain use cases we might consider. Thanks, Camille
Re: Restarting discussion on ZooKeeper as a TLP
+1, thanks for following through with the protocol. On 20 October 2010 11:02, Vishal K vishalm...@gmail.com wrote: +1. On Wed, Oct 20, 2010 at 1:50 PM, Patrick Hunt ph...@apache.org wrote: It's been a few days, any thoughts? Acceptable? I'd like to keep moving the ball forward. Thanks. Patrick On Sun, Oct 17, 2010 at 8:43 PM, 明珠刘 redis...@gmail.com wrote: +1 2010/10/14 Patrick Hunt ph...@apache.org In March of this year we discussed a request from the Apache Board, and Hadoop PMC, that we become a TLP rather than a subproject of Hadoop: Original discussion http://markmail.org/thread/42cobkpzlgotcbin I originally voted against this move, my primary concern being that we were not ready to move to tlp status given our small contributor base and limited contributor diversity. However I'd now like to revisit that discussion/decision. Since that time the team has been working hard to attract new contributors, and we've seen significant new contributions come in. There has also been feedback from board/pmc addressing many of these concerns (both on the list and in private). I am now less concerned about this issue and don't see it as a blocker for us to move to TLP status. A second concern was that by becoming a TLP the project would lose it's connection with Hadoop, a big source of new users for us. I've been assured (and you can see with the other projects that have moved to tlp status; pig/hive/hbase/etc...) that this connection will be maintained. The Hadoop ZooKeeper tab for example will redirect to our new homepage. Other Apache members also pointed out to me that we are essentially operating as a TLP within the Hadoop PMC. Most of the other PMC members have little or no experience with ZooKeeper and this makes it difficult for them to monitor and advise us. By moving to TLP status we'll be able to govern ourselves and better set our direction. I believe we are ready to become a TLP. Please respond to this email with your thoughts and any issues. I will call a vote in a few days, once discussion settles. Regards, Patrick -- Henry Robinson Software Engineer Cloudera 415-994-6679
[jira] Created: (ZOOKEEPER-907) Spurious KeeperErrorCode = Session moved messages
Spurious KeeperErrorCode = Session moved messages --- Key: ZOOKEEPER-907 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-907 Project: Zookeeper Issue Type: Bug Affects Versions: 3.3.1 Reporter: Vishal K The sync request does not set the session owner in Request. As a result, the leader keeps printing: 2010-07-01 10:55:36,733 - INFO [ProcessThread:-1:preprequestproces...@405] - Got user-level KeeperException when processing sessionid:0x298d3b1fa9 type:sync: cxid:0x6 zxid:0xfffe txntype:unknown reqpath:/ Error Path:null Error:KeeperErrorCode = Session moved -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-907) Spurious KeeperErrorCode = Session moved messages
[ https://issues.apache.org/jira/browse/ZOOKEEPER-907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Hunt updated ZOOKEEPER-907: --- Priority: Blocker (was: Major) Fix Version/s: 3.4.0 3.3.2 sounds like a blocker to me. can this be easily addressed? I'd still like to get 3.3.2 out asap. Spurious KeeperErrorCode = Session moved messages --- Key: ZOOKEEPER-907 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-907 Project: Zookeeper Issue Type: Bug Affects Versions: 3.3.1 Reporter: Vishal K Priority: Blocker Fix For: 3.3.2, 3.4.0 The sync request does not set the session owner in Request. As a result, the leader keeps printing: 2010-07-01 10:55:36,733 - INFO [ProcessThread:-1:preprequestproces...@405] - Got user-level KeeperException when processing sessionid:0x298d3b1fa9 type:sync: cxid:0x6 zxid:0xfffe txntype:unknown reqpath:/ Error Path:null Error:KeeperErrorCode = Session moved -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-820) update c unit tests to ensure zombie java server processes don't cause failure
[ https://issues.apache.org/jira/browse/ZOOKEEPER-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Hunt updated ZOOKEEPER-820: --- Resolution: Fixed Hadoop Flags: [Reviewed] Status: Resolved (was: Patch Available) +1, committed to trunk/branch33. Thanks Michi! update c unit tests to ensure zombie java server processes don't cause failure Key: ZOOKEEPER-820 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-820 Project: Zookeeper Issue Type: Bug Affects Versions: 3.3.1 Reporter: Patrick Hunt Assignee: Michi Mutsuzaki Priority: Critical Fix For: 3.3.2, 3.4.0 Attachments: ZOOKEEPER-820-1.patch, ZOOKEEPER-820.patch, ZOOKEEPER-820.patch, ZOOKEEPER-820.patch When the c unit tests are run sometimes the server doesn't shutdown at the end of the test, this causes subsequent tests (hudson esp) to fail. 1) we should try harder to make the server shut down at the end of the test, I suspect this is related to test failing/cleanup 2) before the tests are run we should see if the old server is still running and try to shut it down -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-907) Spurious KeeperErrorCode = Session moved messages
[ https://issues.apache.org/jira/browse/ZOOKEEPER-907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vishal K updated ZOOKEEPER-907: --- Attachment: ZOOKEEPER-907.patch attaching patch. Spurious KeeperErrorCode = Session moved messages --- Key: ZOOKEEPER-907 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-907 Project: Zookeeper Issue Type: Bug Affects Versions: 3.3.1 Reporter: Vishal K Priority: Blocker Fix For: 3.3.2, 3.4.0 Attachments: ZOOKEEPER-907.patch The sync request does not set the session owner in Request. As a result, the leader keeps printing: 2010-07-01 10:55:36,733 - INFO [ProcessThread:-1:preprequestproces...@405] - Got user-level KeeperException when processing sessionid:0x298d3b1fa9 type:sync: cxid:0x6 zxid:0xfffe txntype:unknown reqpath:/ Error Path:null Error:KeeperErrorCode = Session moved -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-907) Spurious KeeperErrorCode = Session moved messages
[ https://issues.apache.org/jira/browse/ZOOKEEPER-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12923095#action_12923095 ] Patrick Hunt commented on ZOOKEEPER-907: Nice. Thanks! Can you include a test? Something like this really should have had a test already... great if you could add one. (also hate to mention but the path on the patch is messed up (long prefix), can you address that as well?) Spurious KeeperErrorCode = Session moved messages --- Key: ZOOKEEPER-907 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-907 Project: Zookeeper Issue Type: Bug Affects Versions: 3.3.1 Reporter: Vishal K Priority: Blocker Fix For: 3.3.2, 3.4.0 Attachments: ZOOKEEPER-907.patch The sync request does not set the session owner in Request. As a result, the leader keeps printing: 2010-07-01 10:55:36,733 - INFO [ProcessThread:-1:preprequestproces...@405] - Got user-level KeeperException when processing sessionid:0x298d3b1fa9 type:sync: cxid:0x6 zxid:0xfffe txntype:unknown reqpath:/ Error Path:null Error:KeeperErrorCode = Session moved -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (ZOOKEEPER-907) Spurious KeeperErrorCode = Session moved messages
[ https://issues.apache.org/jira/browse/ZOOKEEPER-907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Hunt reassigned ZOOKEEPER-907: -- Assignee: Vishal K Spurious KeeperErrorCode = Session moved messages --- Key: ZOOKEEPER-907 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-907 Project: Zookeeper Issue Type: Bug Affects Versions: 3.3.1 Reporter: Vishal K Assignee: Vishal K Priority: Blocker Fix For: 3.3.2, 3.4.0 Attachments: ZOOKEEPER-907.patch The sync request does not set the session owner in Request. As a result, the leader keeps printing: 2010-07-01 10:55:36,733 - INFO [ProcessThread:-1:preprequestproces...@405] - Got user-level KeeperException when processing sessionid:0x298d3b1fa9 type:sync: cxid:0x6 zxid:0xfffe txntype:unknown reqpath:/ Error Path:null Error:KeeperErrorCode = Session moved -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-906) Improve C client connection reliability by making it sleep between reconnect attempts as in Java Client
[ https://issues.apache.org/jira/browse/ZOOKEEPER-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radu Marin updated ZOOKEEPER-906: - Attachment: (was: ZOOKEEPER.patch) Improve C client connection reliability by making it sleep between reconnect attempts as in Java Client --- Key: ZOOKEEPER-906 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-906 Project: Zookeeper Issue Type: Improvement Components: c client Affects Versions: 3.3.1 Reporter: Radu Marin Assignee: Radu Marin Fix For: 3.4.0 Original Estimate: 24h Remaining Estimate: 24h Currently, when a C client get disconnected, it retries a couple of hosts (not all) with no delay between attempts and then if it doesn't succeed it sleeps for 1/3 session expiration timeout period before trying again. In the worst case the disconnect event can occur after 2/3 of session expiration timeout has past, and sleeping for even more 1/3 session timeout will cause a session loss in most of the times. A better approach is to check all hosts but with random delay between reconnect attempts. Also the delay must be independent of session timeout so if we increase the session timeout we also increase the number of available attempts. This improvement covers the case when the C client experiences network problems for a short period of time and is not able to reach any zookeeper hosts. Java client already uses this logic and works very good. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-906) Improve C client connection reliability by making it sleep between reconnect attempts as in Java Client
[ https://issues.apache.org/jira/browse/ZOOKEEPER-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radu Marin updated ZOOKEEPER-906: - Attachment: ZOOKEEPER-906.patch + update last_connect_index when a new successful connection is established. + api for configuring max_reconnect_delay (zoo_set_max_reconnect_delay). Improve C client connection reliability by making it sleep between reconnect attempts as in Java Client --- Key: ZOOKEEPER-906 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-906 Project: Zookeeper Issue Type: Improvement Components: c client Affects Versions: 3.3.1 Reporter: Radu Marin Assignee: Radu Marin Fix For: 3.4.0 Attachments: ZOOKEEPER-906.patch Original Estimate: 24h Remaining Estimate: 24h Currently, when a C client get disconnected, it retries a couple of hosts (not all) with no delay between attempts and then if it doesn't succeed it sleeps for 1/3 session expiration timeout period before trying again. In the worst case the disconnect event can occur after 2/3 of session expiration timeout has past, and sleeping for even more 1/3 session timeout will cause a session loss in most of the times. A better approach is to check all hosts but with random delay between reconnect attempts. Also the delay must be independent of session timeout so if we increase the session timeout we also increase the number of available attempts. This improvement covers the case when the C client experiences network problems for a short period of time and is not able to reach any zookeeper hosts. Java client already uses this logic and works very good. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-906) Improve C client connection reliability by making it sleep between reconnect attempts as in Java Client
[ https://issues.apache.org/jira/browse/ZOOKEEPER-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radu Marin updated ZOOKEEPER-906: - Attachment: ZOOKEEPER-906.patch called svn diff from trunk Improve C client connection reliability by making it sleep between reconnect attempts as in Java Client --- Key: ZOOKEEPER-906 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-906 Project: Zookeeper Issue Type: Improvement Components: c client Affects Versions: 3.3.1 Reporter: Radu Marin Assignee: Radu Marin Fix For: 3.4.0 Attachments: ZOOKEEPER-906.patch Original Estimate: 24h Remaining Estimate: 24h Currently, when a C client get disconnected, it retries a couple of hosts (not all) with no delay between attempts and then if it doesn't succeed it sleeps for 1/3 session expiration timeout period before trying again. In the worst case the disconnect event can occur after 2/3 of session expiration timeout has past, and sleeping for even more 1/3 session timeout will cause a session loss in most of the times. A better approach is to check all hosts but with random delay between reconnect attempts. Also the delay must be independent of session timeout so if we increase the session timeout we also increase the number of available attempts. This improvement covers the case when the C client experiences network problems for a short period of time and is not able to reach any zookeeper hosts. Java client already uses this logic and works very good. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-907) Spurious KeeperErrorCode = Session moved messages
[ https://issues.apache.org/jira/browse/ZOOKEEPER-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12923107#action_12923107 ] Vishal K commented on ZOOKEEPER-907: sure, I will write a test. What do you think is the effect of this bug? In PrepRequestProcessor.pRequest(), the leader will not pass sync request to nextProcessor. Does that mean that sync did not succeed? Spurious KeeperErrorCode = Session moved messages --- Key: ZOOKEEPER-907 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-907 Project: Zookeeper Issue Type: Bug Affects Versions: 3.3.1 Reporter: Vishal K Assignee: Vishal K Priority: Blocker Fix For: 3.3.2, 3.4.0 Attachments: ZOOKEEPER-907.patch The sync request does not set the session owner in Request. As a result, the leader keeps printing: 2010-07-01 10:55:36,733 - INFO [ProcessThread:-1:preprequestproces...@405] - Got user-level KeeperException when processing sessionid:0x298d3b1fa9 type:sync: cxid:0x6 zxid:0xfffe txntype:unknown reqpath:/ Error Path:null Error:KeeperErrorCode = Session moved -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-906) Improve C client connection reliability by making it sleep between reconnect attempts as in Java Client
[ https://issues.apache.org/jira/browse/ZOOKEEPER-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12923109#action_12923109 ] Radu Marin commented on ZOOKEEPER-906: -- @Jared Cantwell: Yes you got it right. The last_connect_index is intended to detect a complete unsuccessful loop through all hosts so the client can delay more (for max_reconnect_delay period). It also represents the index of the last successful host connection, and indeed it was not updated on connection establishement. I have fixed that in the new patch. Huge thanks for reviewing and pointing that out! Improve C client connection reliability by making it sleep between reconnect attempts as in Java Client --- Key: ZOOKEEPER-906 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-906 Project: Zookeeper Issue Type: Improvement Components: c client Affects Versions: 3.3.1 Reporter: Radu Marin Assignee: Radu Marin Fix For: 3.4.0 Attachments: ZOOKEEPER-906.patch Original Estimate: 24h Remaining Estimate: 24h Currently, when a C client get disconnected, it retries a couple of hosts (not all) with no delay between attempts and then if it doesn't succeed it sleeps for 1/3 session expiration timeout period before trying again. In the worst case the disconnect event can occur after 2/3 of session expiration timeout has past, and sleeping for even more 1/3 session timeout will cause a session loss in most of the times. A better approach is to check all hosts but with random delay between reconnect attempts. Also the delay must be independent of session timeout so if we increase the session timeout we also increase the number of available attempts. This improvement covers the case when the C client experiences network problems for a short period of time and is not able to reach any zookeeper hosts. Java client already uses this logic and works very good. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Work stopped: (ZOOKEEPER-906) Improve C client connection reliability by making it sleep between reconnect attempts as in Java Client
[ https://issues.apache.org/jira/browse/ZOOKEEPER-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on ZOOKEEPER-906 stopped by Radu Marin. Improve C client connection reliability by making it sleep between reconnect attempts as in Java Client --- Key: ZOOKEEPER-906 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-906 Project: Zookeeper Issue Type: Improvement Components: c client Affects Versions: 3.3.1 Reporter: Radu Marin Assignee: Radu Marin Fix For: 3.4.0 Attachments: ZOOKEEPER-906.patch Original Estimate: 24h Remaining Estimate: 24h Currently, when a C client get disconnected, it retries a couple of hosts (not all) with no delay between attempts and then if it doesn't succeed it sleeps for 1/3 session expiration timeout period before trying again. In the worst case the disconnect event can occur after 2/3 of session expiration timeout has past, and sleeping for even more 1/3 session timeout will cause a session loss in most of the times. A better approach is to check all hosts but with random delay between reconnect attempts. Also the delay must be independent of session timeout so if we increase the session timeout we also increase the number of available attempts. This improvement covers the case when the C client experiences network problems for a short period of time and is not able to reach any zookeeper hosts. Java client already uses this logic and works very good. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-804) c unit tests failing due to assertion cptr failed
[ https://issues.apache.org/jira/browse/ZOOKEEPER-804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12923116#action_12923116 ] Hudson commented on ZOOKEEPER-804: -- Integrated in ZooKeeper-trunk #973 (See [https://hudson.apache.org/hudson/job/ZooKeeper-trunk/973/]) ZOOKEEPER-804. c unit tests failing due to assertion cptr failed (second patch) c unit tests failing due to assertion cptr failed --- Key: ZOOKEEPER-804 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-804 Project: Zookeeper Issue Type: Bug Components: c client Affects Versions: 3.4.0 Environment: gcc 4.4.3, ubuntu lucid lynx, dual core laptop (intel) Reporter: Patrick Hunt Assignee: Michi Mutsuzaki Priority: Critical Fix For: 3.3.2, 3.4.0 Attachments: ZOOKEEPER-804-1.patch, ZOOKEEPER-804-1.patch, ZOOKEEPER-804.patch I'm seeing this frequently: [exec] Zookeeper_simpleSystem::testPing : elapsed 18006 : OK [exec] Zookeeper_simpleSystem::testAcl : elapsed 1022 : OK [exec] Zookeeper_simpleSystem::testChroot : elapsed 3145 : OK [exec] Zookeeper_simpleSystem::testAuth ZooKeeper server started : elapsed 25687 : OK [exec] zktest-mt: /home/phunt/dev/workspace/gitzk/src/c/src/zookeeper.c:1952: zookeeper_process: Assertion `cptr' failed. [exec] make: *** [run-check] Aborted [exec] Zookeeper_simpleSystem::testHangingClient Mahadev can you take a look? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
RE: implications of netty on client connections
Thanks Patrick, I'll look and see if I can figure out a clean change for this. It was the kernel limit for max number of open fds for the process that was where the problem shows up (not zk limit). FWIW, we tested with a process fd limit of 16K, and ZK performed reasonably well until the fd limit was reached, at which point it choked. There was a throughput degradation, but mostly going from 0 to 4000 connections. 4000 to 16000 was mostly flat until the sharp drop. For our use case it is fine to have a bit of performance loss with huge numbers of connections, so long as we can handle the choke, which for initial rollout I'm planning on just monitoring for. C -Original Message- From: Patrick Hunt [mailto:ph...@apache.org] Sent: Wednesday, October 20, 2010 2:06 PM To: zookeeper-dev@hadoop.apache.org Subject: Re: implications of netty on client connections It may just be the case that we haven't tested sufficiently for this case (running out of fds) and we need to handle this better even in nio. Probably by cutting off op_connect in the selector. We should be able to do similar in netty. Btw, on unix one can access the open/max fd count using this: http://download.oracle.com/javase/6/docs/jre/api/management/extension/com/sun/management/UnixOperatingSystemMXBean.html Secondly, are you running into a kernel limit or a zk limit? Take a look at this post describing 1million concurrent connections to a box: http://www.metabrew.com/article/a-million-user-comet-application-with-mochiweb-part-3 specifically: -- During various test with lots of connections, I ended up making some additional changes to my sysctl.conf. This was part trial-and-error, I don't really know enough about the internals to make especially informed decisions about which values to change. My policy was to wait for things to break, check /var/log/kern.log and see what mysterious error was reported, then increase stuff that sounded sensible after a spot of googling. Here are the settings in place during the above test: net.core.rmem_max = 33554432 net.core.wmem_max = 33554432 net.ipv4.tcp_rmem = 4096 16384 33554432 net.ipv4.tcp_wmem = 4096 16384 33554432 net.ipv4.tcp_mem = 786432 1048576 26777216 net.ipv4.tcp_max_tw_buckets = 36 net.core.netdev_max_backlog = 2500 vm.min_free_kbytes = 65536 vm.swappiness = 0 net.ipv4.ip_local_port_range = 1024 65535 -- I'm guessing that even with this, at some point you'll run into a limit in our server implementation. In particular I suspect that we may start to respond more slowly to pings, eventually getting so bad it would time out. We'd have to debug that and address (optimize). http://www.metabrew.com/article/a-million-user-comet-application-with-mochiweb-part-3 Patrick On Tue, Oct 19, 2010 at 7:16 AM, Fournier, Camille F. [Tech] camille.fourn...@gs.com wrote: Hi everyone, I'm curious what the implications of using netty are going to be for the case where a server gets close to its max available file descriptors. Right now our somewhat limited testing has shown that a ZK server performs fine up to the point when it runs out of available fds, at which point performance degrades sharply and new connections get into a somewhat bad state. Is netty going to enable the server to handle this situation more gracefully (or is there a way to do this already that I haven't found)? Limiting connections from the same client is not enough since we can potentially have far more clients wanting to connect than available fds for certain use cases we might consider. Thanks, Camille
Re: (ZOOKEEPER-905) enhance zkServer.sh for easier zookeeper automation-izing
Hi there. I submitted a patch/jira issue for zkServer.sh (ZOOKEEPER-905). I'm not sure what else to say about it that's not covered in the comments. p.s. Thanks for the great software - I'm enjoying building my applications around it. -- nicholas harteau n...@ikami.com
[jira] Updated: (ZOOKEEPER-906) Improve C client connection reliability by making it sleep between reconnect attempts as in Java Client
[ https://issues.apache.org/jira/browse/ZOOKEEPER-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radu Marin updated ZOOKEEPER-906: - Attachment: (was: ZOOKEEPER-906.patch) Improve C client connection reliability by making it sleep between reconnect attempts as in Java Client --- Key: ZOOKEEPER-906 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-906 Project: Zookeeper Issue Type: Improvement Components: c client Affects Versions: 3.3.1 Reporter: Radu Marin Assignee: Radu Marin Fix For: 3.4.0 Attachments: ZOOKEEPER-906.patch Original Estimate: 24h Remaining Estimate: 24h Currently, when a C client get disconnected, it retries a couple of hosts (not all) with no delay between attempts and then if it doesn't succeed it sleeps for 1/3 session expiration timeout period before trying again. In the worst case the disconnect event can occur after 2/3 of session expiration timeout has past, and sleeping for even more 1/3 session timeout will cause a session loss in most of the times. A better approach is to check all hosts but with random delay between reconnect attempts. Also the delay must be independent of session timeout so if we increase the session timeout we also increase the number of available attempts. This improvement covers the case when the C client experiences network problems for a short period of time and is not able to reach any zookeeper hosts. Java client already uses this logic and works very good. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-907) Spurious KeeperErrorCode = Session moved messages
[ https://issues.apache.org/jira/browse/ZOOKEEPER-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12923200#action_12923200 ] Benjamin Reed commented on ZOOKEEPER-907: - yes, this will fail the sync. it will not get passed through the pipeline. it will give you a partial sync though :) Spurious KeeperErrorCode = Session moved messages --- Key: ZOOKEEPER-907 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-907 Project: Zookeeper Issue Type: Bug Affects Versions: 3.3.1 Reporter: Vishal K Assignee: Vishal K Priority: Blocker Fix For: 3.3.2, 3.4.0 Attachments: ZOOKEEPER-907.patch The sync request does not set the session owner in Request. As a result, the leader keeps printing: 2010-07-01 10:55:36,733 - INFO [ProcessThread:-1:preprequestproces...@405] - Got user-level KeeperException when processing sessionid:0x298d3b1fa9 type:sync: cxid:0x6 zxid:0xfffe txntype:unknown reqpath:/ Error Path:null Error:KeeperErrorCode = Session moved -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-794) Callbacks are not invoked when the client is closed
[ https://issues.apache.org/jira/browse/ZOOKEEPER-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12923251#action_12923251 ] Alexis Midon commented on ZOOKEEPER-794: no pb, thanks for your close review and testing. Callbacks are not invoked when the client is closed --- Key: ZOOKEEPER-794 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-794 Project: Zookeeper Issue Type: Bug Components: java client Affects Versions: 3.3.1 Reporter: Alexis Midon Assignee: Alexis Midon Priority: Blocker Fix For: 3.3.2, 3.4.0 Attachments: ZOOKEEPER-794.patch.txt, ZOOKEEPER-794.txt, ZOOKEEPER-794_2.patch, ZOOKEEPER-794_3.patch, ZOOKEEPER-794_4.patch.txt, ZOOKEEPER-794_5.patch.txt, ZOOKEEPER-794_5_br33.patch I noticed that ZooKeeper has different behaviors when calling synchronous or asynchronous actions on a closed ZooKeeper client. Actually a synchronous call will throw a session expired exception while an asynchronous call will do nothing. No exception, no callback invocation. Actually, even if the EventThread receives the Packet with the session expired err code, the packet is never processed since the thread has been killed by the ventOfDeath. So the call back is not invoked. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-794) Callbacks are not invoked when the client is closed
[ https://issues.apache.org/jira/browse/ZOOKEEPER-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Hunt updated ZOOKEEPER-794: --- Resolution: Fixed Status: Resolved (was: Patch Available) committed to trunk/branch33, thanks Alexis! Callbacks are not invoked when the client is closed --- Key: ZOOKEEPER-794 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-794 Project: Zookeeper Issue Type: Bug Components: java client Affects Versions: 3.3.1 Reporter: Alexis Midon Assignee: Alexis Midon Priority: Blocker Fix For: 3.3.2, 3.4.0 Attachments: ZOOKEEPER-794.patch.txt, ZOOKEEPER-794.txt, ZOOKEEPER-794_2.patch, ZOOKEEPER-794_3.patch, ZOOKEEPER-794_4.patch.txt, ZOOKEEPER-794_5.patch.txt, ZOOKEEPER-794_5_br33.patch I noticed that ZooKeeper has different behaviors when calling synchronous or asynchronous actions on a closed ZooKeeper client. Actually a synchronous call will throw a session expired exception while an asynchronous call will do nothing. No exception, no callback invocation. Actually, even if the EventThread receives the Packet with the session expired err code, the packet is never processed since the thread has been killed by the ventOfDeath. So the call back is not invoked. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.