[jira] Commented: (ZOOKEEPER-272) getChildren can fail for large numbers of children
[ https://issues.apache.org/jira/browse/ZOOKEEPER-272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12668371#action_12668371 ] Nitay Joffe commented on ZOOKEEPER-272: --- What about having getChildren() return an IteratorString instead of a ListString? Using such an abstraction allows us to put the implementation details of what walking the list children mean inside ZooKeeper instead of client code. With an Iterator we can make the next() call handle the multiple offset getChildren() calls under the hood, rather than put this burden on the client. Furthermore, if we have, say, some compressed protocol, a lot of space/cpu can be saved by only decompressing/storing the current child being worked on. getChildren can fail for large numbers of children -- Key: ZOOKEEPER-272 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-272 Project: Zookeeper Issue Type: Bug Reporter: Joshua Tuberville Assignee: Mahadev konar Fix For: 3.1.0 Attachments: ZOOKEEPER-272.patch Zookeeper allows creation of an abritrary number of children, yet if the String array of children names exceeds 4,194,304 bytes a getChildren will fail because ClientCnxn$SendThread.readLength() throws an exception on line 490. Mahadev Konar questioned this byte limit's need. In any case consistency of create children, get children should exist. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-272) getChildren can fail for large numbers of children
[ https://issues.apache.org/jira/browse/ZOOKEEPER-272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12668373#action_12668373 ] Mahadev konar commented on ZOOKEEPER-272: - sorry i should have mentioned this earlier. I planned to open a jira for exactly what you mentioned above. Thsi fix is a quick fix for the upcoming release. getChildren can fail for large numbers of children -- Key: ZOOKEEPER-272 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-272 Project: Zookeeper Issue Type: Bug Reporter: Joshua Tuberville Assignee: Mahadev konar Fix For: 3.1.0 Attachments: ZOOKEEPER-272.patch Zookeeper allows creation of an abritrary number of children, yet if the String array of children names exceeds 4,194,304 bytes a getChildren will fail because ClientCnxn$SendThread.readLength() throws an exception on line 490. Mahadev Konar questioned this byte limit's need. In any case consistency of create children, get children should exist. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Build failed in Hudson: ZooKeeper-trunk #213
See http://hudson.zones.apache.org/hudson/job/ZooKeeper-trunk/213/changes -- [...truncated 62500 lines...] [junit] 2009-01-29 11:43:48,322 - INFO [main:clientb...@300] - STOPPING server [junit] 2009-01-29 11:43:48,322 - INFO [main:nioserverc...@732] - closing session:0x11f22339740 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/127.0.0.1:33221 remote=/127.0.0.1:46688] [junit] 2009-01-29 11:43:48,323 - WARN [main-SendThread:clientcnxn$sendthr...@895] - Exception closing session 0x11f22339740 to sun.nio.ch.selectionkeyi...@7976c1 [junit] java.io.IOException: Read error rc = -1 java.nio.DirectByteBuffer[pos=0 lim=4 cap=4] [junit] at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:628) [junit] at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:873) [junit] 2009-01-29 11:43:48,323 - INFO [NIOServerCxn.Factory:33221:nioservercnxn$fact...@171] - NIOServerCnxn factory exited run method [junit] 2009-01-29 11:43:48,323 - INFO [main:finalrequestproces...@265] - shutdown of request processor complete [junit] 2009-01-29 11:43:48,323 - INFO [SyncThread:0:syncrequestproces...@117] - SyncRequestProcessor exited! [junit] 2009-01-29 11:43:48,323 - INFO [ProcessThread:0:preprequestproces...@104] - PrepRequestProcessor exited loop! [junit] 2009-01-29 11:43:48,423 - INFO [main:clientb...@306] - STARTING server [junit] 2009-01-29 11:43:48,424 - INFO [main:zookeeperser...@157] - Created server [junit] 2009-01-29 11:43:48,425 - INFO [main:files...@70] - Reading snapshot http://hudson.zones.apache.org/hudson/job/ZooKeeper-trunk/ws/trunk/build/test/tmp/test4650600035400512335.junit.dir/version-2/snapshot.5 [junit] 2009-01-29 11:43:48,427 - INFO [main:filetxnsnap...@197] - Snapshotting: 6 [junit] 2009-01-29 11:43:48,429 - INFO [NIOServerCxn.Factory:33221:nioserverc...@604] - Processing stat command from /127.0.0.1:46690 [junit] 2009-01-29 11:43:48,430 - WARN [NIOServerCxn.Factory:33221:nioserverc...@402] - Exception causing close of session 0x0 due to java.io.IOException: Responded to info probe [junit] 2009-01-29 11:43:48,430 - INFO [NIOServerCxn.Factory:33221:nioserverc...@732] - closing session:0x0 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/127.0.0.1:33221 remote=/127.0.0.1:46690] [junit] 2009-01-29 11:43:50,123 - INFO [main-SendThread:clientcnxn$sendthr...@797] - Attempting connection to server /127.0.0.1:33221 [junit] 2009-01-29 11:43:50,124 - INFO [main-SendThread:clientcnxn$sendthr...@712] - Priming connection to java.nio.channels.SocketChannel[connected local=/127.0.0.1:46691 remote=/127.0.0.1:33221] [junit] 2009-01-29 11:43:50,124 - INFO [main-SendThread:clientcnxn$sendthr...@865] - Server connection successful [junit] 2009-01-29 11:43:50,124 - INFO [NIOServerCxn.Factory:33221:nioserverc...@488] - Connected to /127.0.0.1:46691 lastZxid 6 [junit] 2009-01-29 11:43:50,125 - INFO [NIOServerCxn.Factory:33221:nioserverc...@860] - Finished init of 0x11f22339740 valid:true [junit] 2009-01-29 11:43:50,125 - INFO [NIOServerCxn.Factory:33221:nioserverc...@516] - Renewing session 0x11f22339740 [junit] 2009-01-29 11:43:51,000 - INFO [SessionTracker:sessiontrackeri...@142] - SessionTrackerImpl exited loop! [junit] 2009-01-29 11:43:51,132 - INFO [main:zookee...@434] - Closing session: 0x11f22339740 [junit] 2009-01-29 11:43:51,132 - INFO [main:clientc...@996] - Closing ClientCnxn for session: 0x11f22339740 [junit] 2009-01-29 11:43:51,133 - INFO [ProcessThread:0:preprequestproces...@344] - Processed session termination request for id: 0x11f22339740 [junit] 2009-01-29 11:43:51,134 - INFO [SyncThread:0:nioserverc...@732] - closing session:0x11f22339740 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/127.0.0.1:33221 remote=/127.0.0.1:46691] [junit] 2009-01-29 11:43:51,134 - INFO [main-SendThread:clientcnxn$sendthr...@889] - Exception while closing send thread for session 0x11f22339740 : Read error rc = -1 java.nio.DirectByteBuffer[pos=0 lim=4 cap=4] [junit] 2009-01-29 11:43:51,234 - INFO [main:clientc...@982] - Disconnecting ClientCnxn for session: 0x11f22339740 [junit] 2009-01-29 11:43:51,234 - INFO [main:zookee...@442] - Session: 0x11f22339740 closed [junit] 2009-01-29 11:43:51,234 - INFO [main-EventThread:clientcnxn$eventthr...@449] - EventThread shut down [junit] 2009-01-29 11:43:51,235 - INFO [main:clientb...@312] - tearDown starting [junit] 2009-01-29 11:43:51,235 - INFO [NIOServerCxn.Factory:33221:nioservercnxn$fact...@171] - NIOServerCnxn factory exited run method [junit] 2009-01-29 11:43:51,235 - INFO [main:finalrequestproces...@265] - shutdown of request processor complete [junit] 2009-01-29 11:43:51,235 - INFO
[jira] Updated: (ZOOKEEPER-281) autoreconf fails for /zookeeper-3.0.1/src/c/
[ https://issues.apache.org/jira/browse/ZOOKEEPER-281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim P. Dementiev updated ZOOKEEPER-281: - Attachment: autoreconf.log Result of autoreconf -i -f -v -v -v autoreconf.log 21. autoreconf fails for /zookeeper-3.0.1/src/c/ Key: ZOOKEEPER-281 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-281 Project: Zookeeper Issue Type: Bug Components: c client Affects Versions: 3.0.1 Environment: Linux dememax-laptop 2.6.27-gentoo-r8 #2 SMP Fri Jan 23 13:42:35 MSK 2009 i686 Intel(R) Core(TM)2 CPU T5600 @ 1.83GHz GenuineIntel GNU/Linux autoconf (GNU Autoconf) 2.63 automake (GNU automake) 1.10.2 m4 (GNU M4) 1.4.11 aclocal (GNU automake) 1.10.2 ltmain.sh (GNU libtool) 1.5.26 (1.1220.2.493 2008/02/01 16:58:18) basename (GNU coreutils) 6.10 gettext (GNU gettext-runtime) 0.17 GNU ld (GNU Binutils) 2.18 Reporter: Maxim P. Dementiev Attachments: autoreconf.log autoreconf -i -f -v autoreconf-2.63: Entering directory `.' autoreconf-2.63: configure.ac: not using Gettext autoreconf-2.63: running: aclocal --force configure.ac:21: error: AC_SUBST: `DX_FLAG_[]DX_CURRENT_FEATURE' is not a valid shell variable name acinclude.m4:77: DX_REQUIRE_PROG is expanded from... acinclude.m4:117: DX_ARG_ABLE is expanded from... acinclude.m4:178: DX_INIT_DOXYGEN is expanded from... configure.ac:21: the top level autom4te-2.63: /usr/bin/m4 failed with exit status: 1 aclocal-1.10: autom4te failed with exit status: 1 autoreconf-2.63: aclocal failed with exit status: 1 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-276) Bookkeeper contribution
[ https://issues.apache.org/jira/browse/ZOOKEEPER-276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Flavio Paiva Junqueira updated ZOOKEEPER-276: - Attachment: ZOOKEEPER-276.patch Bookkeeper contribution --- Key: ZOOKEEPER-276 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-276 Project: Zookeeper Issue Type: New Feature Reporter: Luca Telloli Assignee: Flavio Paiva Junqueira Fix For: 3.1.0 Attachments: ZOOKEEPER-276.patch, ZOOKEEPER-276.patch, ZOOKEEPER-276.patch, ZOOKEEPER-276.patch BookKeeper is a system to reliably log streams of records. In BookKeeper, servers are bookies, log streams are ledgers, and each unit of a log (aka record) is a ledger entry. BookKeeper is designed to be reliable; bookies, the servers that store ledgers can be byzantine, which means that some subset of the bookies can fail, corrupt data, discard data, but as long as there are enough correctly behaving servers the service as a whole behaves correctly; the meta data for BookKeeper is stored in ZooKeeper. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-272) getChildren can fail for large numbers of children
[ https://issues.apache.org/jira/browse/ZOOKEEPER-272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12668532#action_12668532 ] Joshua Tuberville commented on ZOOKEEPER-272: - Consider returning IterableString instead of IteratorString so you can take advantage of Java 5 foreach idiom for (String child : foo.getChildren()) { // Something done with child } http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Iterable.html getChildren can fail for large numbers of children -- Key: ZOOKEEPER-272 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-272 Project: Zookeeper Issue Type: Bug Reporter: Joshua Tuberville Assignee: Mahadev konar Fix For: 3.1.0 Attachments: ZOOKEEPER-272.patch Zookeeper allows creation of an abritrary number of children, yet if the String array of children names exceeds 4,194,304 bytes a getChildren will fail because ClientCnxn$SendThread.readLength() throws an exception on line 490. Mahadev Konar questioned this byte limit's need. In any case consistency of create children, get children should exist. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-281) autoreconf fails for /zookeeper-3.0.1/src/c/
[ https://issues.apache.org/jira/browse/ZOOKEEPER-281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12668546#action_12668546 ] Patrick Hunt commented on ZOOKEEPER-281: A web search seems to indicate that there are problems with doxygen's autoconf integration: http://www.nabble.com/aclocal-1.9-and-aclocal-1.10-are-failing-while-aclocal-1.7-is-not-td18647659.html can you try editing acinclude.m4 line 79 to look like: if test $DX_FLAG_$[DX_CURRENT_FEATURE$$1] = 1; then and retry? autoreconf fails for /zookeeper-3.0.1/src/c/ Key: ZOOKEEPER-281 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-281 Project: Zookeeper Issue Type: Bug Components: c client Affects Versions: 3.0.1 Environment: Linux dememax-laptop 2.6.27-gentoo-r8 #2 SMP Fri Jan 23 13:42:35 MSK 2009 i686 Intel(R) Core(TM)2 CPU T5600 @ 1.83GHz GenuineIntel GNU/Linux autoconf (GNU Autoconf) 2.63 automake (GNU automake) 1.10.2 m4 (GNU M4) 1.4.11 aclocal (GNU automake) 1.10.2 ltmain.sh (GNU libtool) 1.5.26 (1.1220.2.493 2008/02/01 16:58:18) basename (GNU coreutils) 6.10 gettext (GNU gettext-runtime) 0.17 GNU ld (GNU Binutils) 2.18 Reporter: Maxim P. Dementiev Attachments: autoreconf.log autoreconf -i -f -v autoreconf-2.63: Entering directory `.' autoreconf-2.63: configure.ac: not using Gettext autoreconf-2.63: running: aclocal --force configure.ac:21: error: AC_SUBST: `DX_FLAG_[]DX_CURRENT_FEATURE' is not a valid shell variable name acinclude.m4:77: DX_REQUIRE_PROG is expanded from... acinclude.m4:117: DX_ARG_ABLE is expanded from... acinclude.m4:178: DX_INIT_DOXYGEN is expanded from... configure.ac:21: the top level autom4te-2.63: /usr/bin/m4 failed with exit status: 1 aclocal-1.10: autom4te failed with exit status: 1 autoreconf-2.63: aclocal failed with exit status: 1 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-215) expand system test environment
[ https://issues.apache.org/jira/browse/ZOOKEEPER-215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Reed updated ZOOKEEPER-215: Attachment: ZOOKEEPER-215.patch Added a way to run the system unit test more easily and fixed a couple of bugs. expand system test environment -- Key: ZOOKEEPER-215 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-215 Project: Zookeeper Issue Type: New Feature Components: tests Reporter: Patrick Hunt Assignee: Benjamin Reed Fix For: 3.1.0 Attachments: ZOOKEEPER-215.patch, ZOOKEEPER-215.patch, ZOOKEEPER-215.patch Currently our system tests are lumped in with our unit tests. It would be great to have a system test environment where we could run larger scale testing. Say you have 20 hosts, and you would like to test a serving ensemble with 7 servers and 100 clients running particular operations. It should be easy to test this scenario. Additionally during the test it should be possible to simulate serving node failure, etc... I've had a brief conversation with Ben about this and he's going to take this JIRA. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-275) Bug in FastLeaderElection
[ https://issues.apache.org/jira/browse/ZOOKEEPER-275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-275: Status: Open (was: Patch Available) the patch does not apply any mroe. I tried creating a new patch but its a little tricky so I didnt want to mess up the patch. Can you regenerate the patch Flavio? Bug in FastLeaderElection - Key: ZOOKEEPER-275 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-275 Project: Zookeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.0.1, 3.0.0 Reporter: Flavio Paiva Junqueira Assignee: Flavio Paiva Junqueira Fix For: 3.1.0 Attachments: ZOOKEEPER-275.patch I found an execution in which leader election does not make progress. Here is the problematic scenario: - We have an ensemble of 3 servers, and we start only 2; - We let them elect a leader, and then crash the one with lowest id, say S_1 (call the other S_2); - We restart the crashed server. Upon restarting S_1, S_2 has its logical clock more advanced, and S_1 has its logical clock set to 1. Once S_1 receives a notification from S_2, it notices that it is in the wrong round and it advances its logical clock to the same value as S_1. Now, the problem comes exactly in this point because in the current code S_1 resets its vote to its initial vote (its own id and zxid). Since S_2 has already notified S_1, it won't do it again, and we are stuck. The patch I'm submitting fixes this problem by setting the vote of S_1 to the one received if it satisfies the total order predicate (received zxid is higher or received zxid is the same and received id is higher). Related to this problem, I noticed that by trying to avoid unnecessary notification duplicates, there could be scenarios in which a server fails before electing a leader and restarts before leader election succeeds. This could happen, for example, when there isn't enough servers available and one available crashes and restarts. I fixed this problem in the attached patch by allowing a server to send a new batch of notifications if there is at least one outgoing queue of pending notifications empty. This is ok because we space out consecutive batches of notifications. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-215) expand system test environment
[ https://issues.apache.org/jira/browse/ZOOKEEPER-215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Reed updated ZOOKEEPER-215: Status: Patch Available (was: Open) expand system test environment -- Key: ZOOKEEPER-215 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-215 Project: Zookeeper Issue Type: New Feature Components: tests Reporter: Patrick Hunt Assignee: Benjamin Reed Fix For: 3.1.0 Attachments: ZOOKEEPER-215.patch, ZOOKEEPER-215.patch, ZOOKEEPER-215.patch Currently our system tests are lumped in with our unit tests. It would be great to have a system test environment where we could run larger scale testing. Say you have 20 hosts, and you would like to test a serving ensemble with 7 servers and 100 clients running particular operations. It should be easy to test this scenario. Additionally during the test it should be possible to simulate serving node failure, etc... I've had a brief conversation with Ben about this and he's going to take this JIRA. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-275) Bug in FastLeaderElection
[ https://issues.apache.org/jira/browse/ZOOKEEPER-275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Flavio Paiva Junqueira updated ZOOKEEPER-275: - Attachment: ZOOKEEPER-275.patch Bug in FastLeaderElection - Key: ZOOKEEPER-275 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-275 Project: Zookeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.0.0, 3.0.1 Reporter: Flavio Paiva Junqueira Assignee: Flavio Paiva Junqueira Fix For: 3.1.0 Attachments: ZOOKEEPER-275.patch, ZOOKEEPER-275.patch I found an execution in which leader election does not make progress. Here is the problematic scenario: - We have an ensemble of 3 servers, and we start only 2; - We let them elect a leader, and then crash the one with lowest id, say S_1 (call the other S_2); - We restart the crashed server. Upon restarting S_1, S_2 has its logical clock more advanced, and S_1 has its logical clock set to 1. Once S_1 receives a notification from S_2, it notices that it is in the wrong round and it advances its logical clock to the same value as S_1. Now, the problem comes exactly in this point because in the current code S_1 resets its vote to its initial vote (its own id and zxid). Since S_2 has already notified S_1, it won't do it again, and we are stuck. The patch I'm submitting fixes this problem by setting the vote of S_1 to the one received if it satisfies the total order predicate (received zxid is higher or received zxid is the same and received id is higher). Related to this problem, I noticed that by trying to avoid unnecessary notification duplicates, there could be scenarios in which a server fails before electing a leader and restarts before leader election succeeds. This could happen, for example, when there isn't enough servers available and one available crashes and restarts. I fixed this problem in the attached patch by allowing a server to send a new batch of notifications if there is at least one outgoing queue of pending notifications empty. This is ok because we space out consecutive batches of notifications. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-275) Bug in FastLeaderElection
[ https://issues.apache.org/jira/browse/ZOOKEEPER-275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-275: Attachment: ZOOKEEPER-275.patch fixed some log.info to be log.debug and fixed the duplicated log messages as well. Bug in FastLeaderElection - Key: ZOOKEEPER-275 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-275 Project: Zookeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.0.0, 3.0.1 Reporter: Flavio Paiva Junqueira Assignee: Flavio Paiva Junqueira Fix For: 3.1.0 Attachments: ZOOKEEPER-275.patch, ZOOKEEPER-275.patch, ZOOKEEPER-275.patch I found an execution in which leader election does not make progress. Here is the problematic scenario: - We have an ensemble of 3 servers, and we start only 2; - We let them elect a leader, and then crash the one with lowest id, say S_1 (call the other S_2); - We restart the crashed server. Upon restarting S_1, S_2 has its logical clock more advanced, and S_1 has its logical clock set to 1. Once S_1 receives a notification from S_2, it notices that it is in the wrong round and it advances its logical clock to the same value as S_1. Now, the problem comes exactly in this point because in the current code S_1 resets its vote to its initial vote (its own id and zxid). Since S_2 has already notified S_1, it won't do it again, and we are stuck. The patch I'm submitting fixes this problem by setting the vote of S_1 to the one received if it satisfies the total order predicate (received zxid is higher or received zxid is the same and received id is higher). Related to this problem, I noticed that by trying to avoid unnecessary notification duplicates, there could be scenarios in which a server fails before electing a leader and restarts before leader election succeeds. This could happen, for example, when there isn't enough servers available and one available crashes and restarts. I fixed this problem in the attached patch by allowing a server to send a new batch of notifications if there is at least one outgoing queue of pending notifications empty. This is ok because we space out consecutive batches of notifications. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-275) Bug in FastLeaderElection
[ https://issues.apache.org/jira/browse/ZOOKEEPER-275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-275: Attachment: ZOOKEEPER-275.patch the last patches fail with compile-test since one of the test uses an old api. Bug in FastLeaderElection - Key: ZOOKEEPER-275 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-275 Project: Zookeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.0.0, 3.0.1 Reporter: Flavio Paiva Junqueira Assignee: Flavio Paiva Junqueira Fix For: 3.1.0 Attachments: ZOOKEEPER-275.patch, ZOOKEEPER-275.patch, ZOOKEEPER-275.patch, ZOOKEEPER-275.patch I found an execution in which leader election does not make progress. Here is the problematic scenario: - We have an ensemble of 3 servers, and we start only 2; - We let them elect a leader, and then crash the one with lowest id, say S_1 (call the other S_2); - We restart the crashed server. Upon restarting S_1, S_2 has its logical clock more advanced, and S_1 has its logical clock set to 1. Once S_1 receives a notification from S_2, it notices that it is in the wrong round and it advances its logical clock to the same value as S_1. Now, the problem comes exactly in this point because in the current code S_1 resets its vote to its initial vote (its own id and zxid). Since S_2 has already notified S_1, it won't do it again, and we are stuck. The patch I'm submitting fixes this problem by setting the vote of S_1 to the one received if it satisfies the total order predicate (received zxid is higher or received zxid is the same and received id is higher). Related to this problem, I noticed that by trying to avoid unnecessary notification duplicates, there could be scenarios in which a server fails before electing a leader and restarts before leader election succeeds. This could happen, for example, when there isn't enough servers available and one available crashes and restarts. I fixed this problem in the attached patch by allowing a server to send a new batch of notifications if there is at least one outgoing queue of pending notifications empty. This is ok because we space out consecutive batches of notifications. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-272) getChildren can fail for large numbers of children
[ https://issues.apache.org/jira/browse/ZOOKEEPER-272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Reed updated ZOOKEEPER-272: Hadoop Flags: [Reviewed] getChildren can fail for large numbers of children -- Key: ZOOKEEPER-272 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-272 Project: Zookeeper Issue Type: Bug Reporter: Joshua Tuberville Assignee: Mahadev konar Fix For: 3.1.0 Attachments: ZOOKEEPER-272.patch Zookeeper allows creation of an abritrary number of children, yet if the String array of children names exceeds 4,194,304 bytes a getChildren will fail because ClientCnxn$SendThread.readLength() throws an exception on line 490. Mahadev Konar questioned this byte limit's need. In any case consistency of create children, get children should exist. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-272) getChildren can fail for large numbers of children
[ https://issues.apache.org/jira/browse/ZOOKEEPER-272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12668652#action_12668652 ] Benjamin Reed commented on ZOOKEEPER-272: - This is a good comment, but it should be tracked in a separate Jira. getChildren can fail for large numbers of children -- Key: ZOOKEEPER-272 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-272 Project: Zookeeper Issue Type: Bug Reporter: Joshua Tuberville Assignee: Mahadev konar Fix For: 3.1.0 Attachments: ZOOKEEPER-272.patch Zookeeper allows creation of an abritrary number of children, yet if the String array of children names exceeds 4,194,304 bytes a getChildren will fail because ClientCnxn$SendThread.readLength() throws an exception on line 490. Mahadev Konar questioned this byte limit's need. In any case consistency of create children, get children should exist. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-275) Bug in FastLeaderElection
[ https://issues.apache.org/jira/browse/ZOOKEEPER-275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-275: Resolution: Fixed Hadoop Flags: [Reviewed] Status: Resolved (was: Patch Available) I just committed this. Thanks flavio. Bug in FastLeaderElection - Key: ZOOKEEPER-275 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-275 Project: Zookeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.0.0, 3.0.1 Reporter: Flavio Paiva Junqueira Assignee: Flavio Paiva Junqueira Fix For: 3.1.0 Attachments: ZOOKEEPER-275.patch, ZOOKEEPER-275.patch, ZOOKEEPER-275.patch, ZOOKEEPER-275.patch I found an execution in which leader election does not make progress. Here is the problematic scenario: - We have an ensemble of 3 servers, and we start only 2; - We let them elect a leader, and then crash the one with lowest id, say S_1 (call the other S_2); - We restart the crashed server. Upon restarting S_1, S_2 has its logical clock more advanced, and S_1 has its logical clock set to 1. Once S_1 receives a notification from S_2, it notices that it is in the wrong round and it advances its logical clock to the same value as S_1. Now, the problem comes exactly in this point because in the current code S_1 resets its vote to its initial vote (its own id and zxid). Since S_2 has already notified S_1, it won't do it again, and we are stuck. The patch I'm submitting fixes this problem by setting the vote of S_1 to the one received if it satisfies the total order predicate (received zxid is higher or received zxid is the same and received id is higher). Related to this problem, I noticed that by trying to avoid unnecessary notification duplicates, there could be scenarios in which a server fails before electing a leader and restarts before leader election succeeds. This could happen, for example, when there isn't enough servers available and one available crashes and restarts. I fixed this problem in the attached patch by allowing a server to send a new batch of notifications if there is at least one outgoing queue of pending notifications empty. This is ok because we space out consecutive batches of notifications. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-269) connectionloss - add more documentation to detail
[ https://issues.apache.org/jira/browse/ZOOKEEPER-269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-269: Resolution: Fixed Hadoop Flags: [Reviewed] Status: Resolved (was: Patch Available) I just committed this. Thanks flavio and pat. connectionloss - add more documentation to detail -- Key: ZOOKEEPER-269 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-269 Project: Zookeeper Issue Type: Improvement Components: documentation Affects Versions: 3.0.0, 3.0.1 Reporter: Patrick Hunt Assignee: Flavio Paiva Junqueira Priority: Minor Fix For: 3.1.0 Attachments: ZOOKEEPER-269.patch, ZOOKEEPER-269.patch, ZOOKEEPER-269.patch discussion with user, this should be better documented: -- There are basically 2 cases where you can see connectionloss: 1) you call an operation on a session that is no longer alive 2) you are disconnected from a server when there are pending async operations to that server (you made an async request which has not yet completed) Patrick Kevin Burton wrote: Can this be thrown when using multiple servers as long as 1 of them is online? Trying to figure out of I should try some type of reconnect if a single machine fails instead of failing altogether. Kevin -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-262) unnecesssarily complex reentrant zookeeper_close() logic
[ https://issues.apache.org/jira/browse/ZOOKEEPER-262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-262: Fix Version/s: (was: 4.0.0) (was: 3.1.0) moving this jira to the next release. this is not a blocker for 3.1 release. unnecesssarily complex reentrant zookeeper_close() logic Key: ZOOKEEPER-262 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-262 Project: Zookeeper Issue Type: Improvement Components: c client Affects Versions: 3.0.0, 3.0.1, 3.1.0, 3.2.0, 4.0.0 Reporter: Chris Darroch Priority: Minor Fix For: 3.2.0 Attachments: zookeeper-close.patch While working on a wrapper for the C API I puzzled over the problem of how to determine when the multi-threaded adaptor's IO and completion threads had exited. Looking at the code in api_epilog() and adaptor_finish() it seemed clear that any thread could be the last one out the door, and whichever was last would turn out the lights by calling zookeeper_close(). However, on further examination I found that in fact, the close_requested flag guards entry to zookeeper_close() in api_epilog(), and close_requested can only be set non-zero within zookeeper_close(). Thus, only the user's main thread can invoke zookeeper_close() and kick off the shutdown process. When that happens, zookeeper_close() then invokes adaptor_finish() and returns ZOK immediately afterward. Since adaptor_finish() is only called in this one context, it means all the code in that function to check pthread_self() and call pthread_detach() if the current thread is the IO or completion thread is redundant. The adaptor_finish() function always signals and then waits to join with the IO and completion threads because it can only be called by the user's main thread. After joining with the two internal threads, adaptor_finish() calls api_epilog(), which might seem like a trivial final action. However, this is actually where all the work gets done, because in this one case, api_epilog() sees a non-zero close_requested flag value and invokes zookeeper_close(). Note that zookeeper_close() is already on the stack; this is a re-entrant invocation. This time around, zookeeper_close() skips the call to adaptor_finish() -- assuming the reference count has been properly decremented to zero! -- and does the actual final cleanup steps, including deallocating the zh structure. Fortunately, none of the callers on the stack (api_epilog(), adaptor_finish(), and the first zookeeper_close()) touches zh after this. This all works OK, and in particular, the fact that I can be certain that the IO and completion threads have exited after zookeeper_close() returns is great. So too is the fact that those threads can't invoke zookeeper_close() without my knowing about it. However, the actual mechanics of the shutdown seem unnecessarily complex. I'd be worried a bit about a new maintainer looking at adaptor_finish() and reasonably concluding that it can be called by any thread, including the IO and completion ones. Or thinking that the zh handle can still be used after that innocuous-looking call to adaptor_finish() in zookeeper_close() -- the one that actually causes all the work to be done and the handle to be deallocated! I'll attach a patch which I think simplifies the code a bit and makes the shutdown mechanics a little more clear, and might prevent unintentional errors in the future. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-276) Bookkeeper contribution
[ https://issues.apache.org/jira/browse/ZOOKEEPER-276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12668718#action_12668718 ] Mahadev konar commented on ZOOKEEPER-276: - also ant tar on top level dir gives the following error - {noformat} compile: [echo] contrib: bookkeeper [javac] Compiling 26 source files to /Users/mahadev/workspace/zookeeper-commit-trunk/build/contrib/bookkeeper/classes [javac] /Users/mahadev/workspace/zookeeper-commit-trunk/src/contrib/bookkeeper/src/java/org/apache/bookkeeper/bookie/Bookie.java:302: method does not override a method from its superclass [javac] @Override [javac] ^ [javac] /Users/mahadev/workspace/zookeeper-commit-trunk/src/contrib/bookkeeper/src/java/org/apache/bookkeeper/proto/BookieClient.java:317: method does not override a method from its superclass [javac] @Override [javac] ^ [javac] /Users/mahadev/workspace/zookeeper-commit-trunk/src/contrib/bookkeeper/src/java/org/apache/bookkeeper/proto/BookieServer.java:87: method does not override a method from its superclass [javac] @Override [javac] ^ [javac] /Users/mahadev/workspace/zookeeper-commit-trunk/src/contrib/bookkeeper/src/java/org/apache/bookkeeper/proto/BookieServer.java:164: method does not override a method from its superclass [javac] @Override [javac] ^ [javac] /Users/mahadev/workspace/zookeeper-commit-trunk/src/contrib/bookkeeper/src/java/org/apache/bookkeeper/util/LocalBookKeeper.java:143: method does not override a method from its superclass [javac] @Override [javac] ^ [javac] 5 errors {noformat} Bookkeeper contribution --- Key: ZOOKEEPER-276 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-276 Project: Zookeeper Issue Type: New Feature Reporter: Luca Telloli Assignee: Flavio Paiva Junqueira Fix For: 3.1.0 Attachments: ZOOKEEPER-276.patch, ZOOKEEPER-276.patch, ZOOKEEPER-276.patch, ZOOKEEPER-276.patch BookKeeper is a system to reliably log streams of records. In BookKeeper, servers are bookies, log streams are ledgers, and each unit of a log (aka record) is a ledger entry. BookKeeper is designed to be reliable; bookies, the servers that store ledgers can be byzantine, which means that some subset of the bookies can fail, corrupt data, discard data, but as long as there are enough correctly behaving servers the service as a whole behaves correctly; the meta data for BookKeeper is stored in ZooKeeper. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (ZOOKEEPER-276) Bookkeeper contribution
[ https://issues.apache.org/jira/browse/ZOOKEEPER-276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12668718#action_12668718 ] mahadev edited comment on ZOOKEEPER-276 at 1/29/09 6:50 PM: -- also ant tar on top level dir fails with the following error - {noformat} compile: [echo] contrib: bookkeeper [javac] Compiling 26 source files to /Users/mahadev/workspace/zookeeper-commit-trunk/build/contrib/bookkeeper/classes [javac] /Users/mahadev/workspace/zookeeper-commit-trunk/src/contrib/bookkeeper/src/java/org/apache/bookkeeper/bookie/Bookie.java:302: method does not override a method from its superclass [javac] @Override [javac] ^ [javac] /Users/mahadev/workspace/zookeeper-commit-trunk/src/contrib/bookkeeper/src/java/org/apache/bookkeeper/proto/BookieClient.java:317: method does not override a method from its superclass [javac] @Override [javac] ^ [javac] /Users/mahadev/workspace/zookeeper-commit-trunk/src/contrib/bookkeeper/src/java/org/apache/bookkeeper/proto/BookieServer.java:87: method does not override a method from its superclass [javac] @Override [javac] ^ [javac] /Users/mahadev/workspace/zookeeper-commit-trunk/src/contrib/bookkeeper/src/java/org/apache/bookkeeper/proto/BookieServer.java:164: method does not override a method from its superclass [javac] @Override [javac] ^ [javac] /Users/mahadev/workspace/zookeeper-commit-trunk/src/contrib/bookkeeper/src/java/org/apache/bookkeeper/util/LocalBookKeeper.java:143: method does not override a method from its superclass [javac] @Override [javac] ^ [javac] 5 errors {noformat} was (Author: mahadev): also ant tar on top level dir gives the following error - {noformat} compile: [echo] contrib: bookkeeper [javac] Compiling 26 source files to /Users/mahadev/workspace/zookeeper-commit-trunk/build/contrib/bookkeeper/classes [javac] /Users/mahadev/workspace/zookeeper-commit-trunk/src/contrib/bookkeeper/src/java/org/apache/bookkeeper/bookie/Bookie.java:302: method does not override a method from its superclass [javac] @Override [javac] ^ [javac] /Users/mahadev/workspace/zookeeper-commit-trunk/src/contrib/bookkeeper/src/java/org/apache/bookkeeper/proto/BookieClient.java:317: method does not override a method from its superclass [javac] @Override [javac] ^ [javac] /Users/mahadev/workspace/zookeeper-commit-trunk/src/contrib/bookkeeper/src/java/org/apache/bookkeeper/proto/BookieServer.java:87: method does not override a method from its superclass [javac] @Override [javac] ^ [javac] /Users/mahadev/workspace/zookeeper-commit-trunk/src/contrib/bookkeeper/src/java/org/apache/bookkeeper/proto/BookieServer.java:164: method does not override a method from its superclass [javac] @Override [javac] ^ [javac] /Users/mahadev/workspace/zookeeper-commit-trunk/src/contrib/bookkeeper/src/java/org/apache/bookkeeper/util/LocalBookKeeper.java:143: method does not override a method from its superclass [javac] @Override [javac] ^ [javac] 5 errors {noformat} Bookkeeper contribution --- Key: ZOOKEEPER-276 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-276 Project: Zookeeper Issue Type: New Feature Reporter: Luca Telloli Assignee: Flavio Paiva Junqueira Fix For: 3.1.0 Attachments: ZOOKEEPER-276.patch, ZOOKEEPER-276.patch, ZOOKEEPER-276.patch, ZOOKEEPER-276.patch BookKeeper is a system to reliably log streams of records. In BookKeeper, servers are bookies, log streams are ledgers, and each unit of a log (aka record) is a ledger entry. BookKeeper is designed to be reliable; bookies, the servers that store ledgers can be byzantine, which means that some subset of the bookies can fail, corrupt data, discard data, but as long as there are enough correctly behaving servers the service as a whole behaves correctly; the meta data for BookKeeper is stored in ZooKeeper. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-16) Need to do path validation
[ https://issues.apache.org/jira/browse/ZOOKEEPER-16?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12668760#action_12668760 ] Benjamin Reed commented on ZOOKEEPER-16: The patch looks good. there are two issues that i see: 1) we don't validate the path at the server. in some sense that is the most important place to do it. we need to put checks into PrepRequestProcessor and FinalRequestProcessor. 2) the check for C may give false positives for unicode characters. if the server check is in place, we can probably just make the client do the obvious check: c 0x00 c 0x1f and then let the server catch the rest. Need to do path validation -- Key: ZOOKEEPER-16 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-16 Project: Zookeeper Issue Type: Bug Components: c client, java client, server Affects Versions: 3.0.0, 3.0.1 Reporter: Patrick Hunt Assignee: Patrick Hunt Fix For: 3.1.0 Attachments: ZOOKEEPER-16.patch, ZOOKEEPER-16.patch, ZOOKEEPER-16.patch Moved from SourceForge to Apache. http://sourceforge.net/tracker/index.php?func=detailaid=1963141group_id=209147atid=1008544 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-16) Need to do path validation
[ https://issues.apache.org/jira/browse/ZOOKEEPER-16?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Reed updated ZOOKEEPER-16: --- Status: Open (was: Patch Available) Need to do path validation -- Key: ZOOKEEPER-16 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-16 Project: Zookeeper Issue Type: Bug Components: c client, java client, server Affects Versions: 3.0.1, 3.0.0 Reporter: Patrick Hunt Assignee: Patrick Hunt Fix For: 3.1.0 Attachments: ZOOKEEPER-16.patch, ZOOKEEPER-16.patch, ZOOKEEPER-16.patch Moved from SourceForge to Apache. http://sourceforge.net/tracker/index.php?func=detailaid=1963141group_id=209147atid=1008544 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.