[jira] [Commented] (ZOOKEEPER-2469) infinite loop in ZK re-login
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15367064#comment-15367064 ] Mahadev konar commented on ZOOKEEPER-2469: -- [~sershe] done. > infinite loop in ZK re-login > > > Key: ZOOKEEPER-2469 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2469 > Project: ZooKeeper > Issue Type: Bug >Reporter: Sergey Shelukhin >Assignee: Sergey Shelukhin > > {noformat} > int retry = 1; > while (retry >= 0) { > try { > reLogin(); > break; > } catch (LoginException le) { > if (retry > 0) { > --retry; > // sleep for 10 seconds. > try { > Thread.sleep(10 * 1000); > } catch (InterruptedException e) { > LOG.error("Interrupted during login > retry after LoginException:", le); > throw le; > } > } else { > LOG.error("Could not refresh TGT for > principal: " + principal + ".", le); > } > } > } > {noformat} > will retry forever. Should return like the one above -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (ZOOKEEPER-2469) infinite loop in ZK re-login
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-2469: - Assignee: Sergey Shelukhin > infinite loop in ZK re-login > > > Key: ZOOKEEPER-2469 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2469 > Project: ZooKeeper > Issue Type: Bug >Reporter: Sergey Shelukhin >Assignee: Sergey Shelukhin > > {noformat} > int retry = 1; > while (retry >= 0) { > try { > reLogin(); > break; > } catch (LoginException le) { > if (retry > 0) { > --retry; > // sleep for 10 seconds. > try { > Thread.sleep(10 * 1000); > } catch (InterruptedException e) { > LOG.error("Interrupted during login > retry after LoginException:", le); > throw le; > } > } else { > LOG.error("Could not refresh TGT for > principal: " + principal + ".", le); > } > } > } > {noformat} > will retry forever. Should return like the one above -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-1848) [WINDOWS] Java NIO socket channels does not work with Windows ipv6 on JDK6
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13954961#comment-13954961 ] Mahadev konar commented on ZOOKEEPER-1848: -- +1 for the patch. Rerunning it through jenkins again. [WINDOWS] Java NIO socket channels does not work with Windows ipv6 on JDK6 -- Key: ZOOKEEPER-1848 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1848 Project: ZooKeeper Issue Type: Bug Reporter: Enis Soztutar Assignee: Enis Soztutar Fix For: 3.5.0 Attachments: zookeeper-1848_v1.patch, zookeeper-1848_v2.patch ZK uses Java NIO to create ServerSorcket's from ServerSocketChannels. Under windows, the ipv4 and ipv6 is implemented independently, and Java seems that it cannot reuse the same socket channel for both ipv4 and ipv6 sockets. We are getting java.net.SocketException: Address family not supported by protocol family exceptions. When, ZK client resolves localhost, it gets both v4 127.0.0.1 and v6 ::1 address, but the socket channel cannot bind to both v4 and v6. The problem is reported as: http://bugs.sun.com/view_bug.do?bug_id=6230761 http://stackoverflow.com/questions/1357091/binding-an-ipv6-server-socket-on-windows Although the JDK bug is reported as resolved, I have tested with jdk1.6.0_33 without any success. Although JDK7 seems to have fixed this problem. See HBASE-6825 for reference. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (ZOOKEEPER-1667) Watch event isn't handled correctly when a client reestablish to a server
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13801581#comment-13801581 ] Mahadev konar commented on ZOOKEEPER-1667: -- +1 - the patch looks good to me. Watch event isn't handled correctly when a client reestablish to a server - Key: ZOOKEEPER-1667 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1667 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.3.6, 3.4.5 Reporter: Jacky007 Assignee: Flavio Junqueira Priority: Blocker Fix For: 3.4.6, 3.5.0 Attachments: ZOOKEEPER-1667-b3.4.patch, ZOOKEEPER-1667-b3.4.patch, ZOOKEEPER-1667.patch, ZOOKEEPER-1667-r34.patch, ZOOKEEPER-1667-trunk.patch When a client reestablish to a server, it will send the watches which have not been triggered. But the code in DataTree does not handle it correctly. It is obvious, we just do not notice it :) scenario: 1) Client a set a data watch on /d, then disconnect, client b delete /d and create it again. When client a reestablish to zk, it will receive a NodeCreated rather than a NodeDataChanged. 2) Client a set a exists watch on /e(not exist), then disconnect, client b create /e. When client a reestablish to zk, it will receive a NodeDataChanged rather than a NodeCreated. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1646) mt c client tests fail on Ubuntu Raring
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13798082#comment-13798082 ] Mahadev konar commented on ZOOKEEPER-1646: -- +1 for the patch. Nice catch Pat! mt c client tests fail on Ubuntu Raring --- Key: ZOOKEEPER-1646 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1646 Project: ZooKeeper Issue Type: Bug Components: c client Affects Versions: 3.4.5, 3.5.0 Environment: Ubuntu 13.04 (raring), glibc 2.17 Reporter: James Page Assignee: Patrick Hunt Priority: Blocker Fix For: 3.4.6, 3.5.0 Attachments: ZOOKEEPER-1646.patch Misc tests fail in the c client binding under the current Ubuntu development release: ./zktest-mt ZooKeeper server startedRunning Zookeeper_clientretry::testRetry ZooKeeper server started ZooKeeper server started : elapsed 9315 : OK Zookeeper_operations::testAsyncWatcher1 : assertion : elapsed 1054 Zookeeper_operations::testAsyncGetOperation : assertion : elapsed 1055 Zookeeper_operations::testOperationsAndDisconnectConcurrently1 : assertion : elapsed 1066 Zookeeper_operations::testOperationsAndDisconnectConcurrently2 : elapsed 0 : OK Zookeeper_operations::testConcurrentOperations1 : assertion : elapsed 1055 Zookeeper_init::testBasic : elapsed 1 : OK Zookeeper_init::testAddressResolution : elapsed 0 : OK Zookeeper_init::testMultipleAddressResolution : elapsed 0 : OK Zookeeper_init::testNullAddressString : elapsed 0 : OK Zookeeper_init::testEmptyAddressString : elapsed 0 : OK Zookeeper_init::testOneSpaceAddressString : elapsed 0 : OK Zookeeper_init::testTwoSpacesAddressString : elapsed 0 : OK Zookeeper_init::testInvalidAddressString1 : elapsed 0 : OK Zookeeper_init::testInvalidAddressString2 : elapsed 175 : OK Zookeeper_init::testNonexistentHost : elapsed 92 : OK Zookeeper_init::testOutOfMemory_init : elapsed 0 : OK Zookeeper_init::testOutOfMemory_getaddrs1 : elapsed 0 : OK Zookeeper_init::testOutOfMemory_getaddrs2 : elapsed 1 : OK Zookeeper_init::testPermuteAddrsList : elapsed 0 : OK Zookeeper_close::testIOThreadStoppedOnExpire : assertion : elapsed 1056 Zookeeper_close::testCloseUnconnected : elapsed 0 : OK Zookeeper_close::testCloseUnconnected1 : elapsed 91 : OK Zookeeper_close::testCloseConnected1 : assertion : elapsed 1056 Zookeeper_close::testCloseFromWatcher1 : assertion : elapsed 1076 Zookeeper_simpleSystem::testAsyncWatcherAutoReset ZooKeeper server started : elapsed 12155 : OK Zookeeper_simpleSystem::testDeserializeString : elapsed 0 : OK Zookeeper_simpleSystem::testNullData : elapsed 1031 : OK Zookeeper_simpleSystem::testIPV6 : elapsed 1005 : OK Zookeeper_simpleSystem::testPath : elapsed 1024 : OK Zookeeper_simpleSystem::testPathValidation : elapsed 1053 : OK Zookeeper_simpleSystem::testPing : elapsed 17287 : OK Zookeeper_simpleSystem::testAcl : elapsed 1019 : OK Zookeeper_simpleSystem::testChroot : elapsed 3052 : OK Zookeeper_simpleSystem::testAuth : assertion : elapsed 7010 Zookeeper_simpleSystem::testHangingClient : elapsed 1015 : OK Zookeeper_simpleSystem::testWatcherAutoResetWithGlobal ZooKeeper server started ZooKeeper server started ZooKeeper server started : elapsed 20556 : OK Zookeeper_simpleSystem::testWatcherAutoResetWithLocal ZooKeeper server started ZooKeeper server started ZooKeeper server started : elapsed 20563 : OK Zookeeper_simpleSystem::testGetChildren2 : elapsed 1041 : OK Zookeeper_multi::testCreate : elapsed 1017 : OK Zookeeper_multi::testCreateDelete : elapsed 1007 : OK Zookeeper_multi::testInvalidVersion : elapsed 1011 : OK Zookeeper_multi::testNestedCreate : elapsed 1009 : OK Zookeeper_multi::testSetData : elapsed 6019 : OK Zookeeper_multi::testUpdateConflict : elapsed 1014 : OK Zookeeper_multi::testDeleteUpdateConflict : elapsed 1007 : OK Zookeeper_multi::testAsyncMulti : elapsed 2001 : OK Zookeeper_multi::testMultiFail : elapsed 1006 : OK Zookeeper_multi::testCheck : elapsed 1020 : OK Zookeeper_multi::testWatch : elapsed 2013 : OK Zookeeper_watchers::testDefaultSessionWatcher1zktest-mt: tests/ZKMocks.cc:271: SyncedBoolCondition DeliverWatchersWrapper::isDelivered() const: Assertion `i1000' failed. Aborted (core dumped) It would appear that the zookeeper connection does not transition to connected within the required time; I increased the time allowed but no change. Ubuntu raring has glibc 2.17; the test suite works fine on previous Ubuntu releases and this is the only difference that stood out. Interestingly the cli_mt worked just fine connecting to the same zookeeper instance that the tests left lying around so I'm assuming this is a test error rather than an actual bug. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-442) need a way to remove watches that are no longer of interest
[ https://issues.apache.org/jira/browse/ZOOKEEPER-442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13791557#comment-13791557 ] Mahadev konar commented on ZOOKEEPER-442: - Thanks Rakesh. Good to see the initiative. Ill read through the doc and get back to you. need a way to remove watches that are no longer of interest --- Key: ZOOKEEPER-442 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-442 Project: ZooKeeper Issue Type: New Feature Reporter: Benjamin Reed Assignee: Daniel Gómez Ferro Priority: Critical Fix For: 3.5.0 Attachments: Remove Watch API.pdf, ZOOKEEPER-442.patch, ZOOKEEPER-442.patch, ZOOKEEPER-442.patch, ZOOKEEPER-442.patch, ZOOKEEPER-442.patch, ZOOKEEPER-442.patch, ZOOKEEPER-442.patch currently the only way a watch cleared is to trigger it. we need a way to enumerate the outstanding watch objects, find watch events the objects are watching for, and remove interests in an event. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (ZOOKEEPER-1791) ZooKeeper package includes unnecessary jars that are part of the package.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-1791: - Attachment: ZOOKEEPER-1791.patch ZooKeeper package includes unnecessary jars that are part of the package. - Key: ZOOKEEPER-1791 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1791 Project: ZooKeeper Issue Type: Bug Components: build Affects Versions: 3.5.0 Reporter: Mahadev konar Assignee: Mahadev konar Fix For: 3.5.0 Attachments: ZOOKEEPER-1791.patch ZooKeeper package includes unnecessary jars that are part of the package. Packages like fatjar and {code} maven-ant-tasks-2.1.3.jar maven-artifact-2.2.1.jar maven-artifact-manager-2.2.1.jar maven-error-diagnostics-2.2.1.jar maven-model-2.2.1.jar maven-plugin-registry-2.2.1.jar maven-profile-2.2.1.jar maven-project-2.2.1.jar maven-repository-metadata-2.2.1.jar {code} are part of the zookeeper package and rpm (via bigtop). -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (ZOOKEEPER-1791) ZooKeeper package includes unnecessary jars that are part of the package.
Mahadev konar created ZOOKEEPER-1791: Summary: ZooKeeper package includes unnecessary jars that are part of the package. Key: ZOOKEEPER-1791 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1791 Project: ZooKeeper Issue Type: Bug Components: build Affects Versions: 3.5.0 Reporter: Mahadev konar Assignee: Mahadev konar Fix For: 3.5.0 Attachments: ZOOKEEPER-1791.patch ZooKeeper package includes unnecessary jars that are part of the package. Packages like fatjar and {code} maven-ant-tasks-2.1.3.jar maven-artifact-2.2.1.jar maven-artifact-manager-2.2.1.jar maven-error-diagnostics-2.2.1.jar maven-model-2.2.1.jar maven-plugin-registry-2.2.1.jar maven-profile-2.2.1.jar maven-project-2.2.1.jar maven-repository-metadata-2.2.1.jar {code} are part of the zookeeper package and rpm (via bigtop). -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-900) FLE implementation should be improved to use non-blocking sockets
[ https://issues.apache.org/jira/browse/ZOOKEEPER-900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13790067#comment-13790067 ] Mahadev konar commented on ZOOKEEPER-900: - [~phunt] I htink we can close this one in favor of another jira. FLE implementation should be improved to use non-blocking sockets - Key: ZOOKEEPER-900 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-900 Project: ZooKeeper Issue Type: Bug Reporter: Vishal Kher Assignee: Vishal Kher Priority: Critical Fix For: 3.5.0 Attachments: ZOOKEEPER-900.patch, ZOOKEEPER-900.patch1, ZOOKEEPER-900.patch2 From earlier email exchanges: 1. Blocking connects and accepts: a) The first problem is in manager.toSend(). This invokes connectOne(), which does a blocking connect. While testing, I changed the code so that connectOne() starts a new thread called AsyncConnct(). AsyncConnect.run() does a socketChannel.connect(). After starting AsyncConnect, connectOne starts a timer. connectOne continues with normal operations if the connection is established before the timer expires, otherwise, when the timer expires it interrupts AsyncConnect() thread and returns. In this way, I can have an upper bound on the amount of time we need to wait for connect to succeed. Of course, this was a quick fix for my testing. Ideally, we should use Selector to do non-blocking connects/accepts. I am planning to do that later once we at least have a quick fix for the problem and consensus from others for the real fix (this problem is big blocker for us). Note that it is OK to do blocking IO in SenderWorker and RecvWorker threads since they block IO to the respective peer. b) The blocking IO problem is not just restricted to connectOne(), but also in receiveConnection(). The Listener thread calls receiveConnection() for each incoming connection request. receiveConnection does blocking IO to get peer's info (s.read(msgBuffer)). Worse, it invokes connectOne() back to the peer that had sent the connection request. All of this is happening from the Listener. In short, if a peer fails after initiating a connection, the Listener thread won't be able to accept connections from other peers, because it would be stuck in read() or connetOne(). Also the code has an inherent cycle. initiateConnection() and receiveConnection() will have to be very carefully synchronized otherwise, we could run into deadlocks. This code is going to be difficult to maintain/modify. Also see: https://issues.apache.org/jira/browse/ZOOKEEPER-822 -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (ZOOKEEPER-1147) Add support for local sessions
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-1147: - Attachment: ZOOKEEPER-1147.patch Minor conflict with the current patch fails on applying with QuorumPeerMain.java - attaching a new one which fixes the conflict. Add support for local sessions -- Key: ZOOKEEPER-1147 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1147 Project: ZooKeeper Issue Type: Improvement Components: server Affects Versions: 3.3.3 Reporter: Vishal Kathuria Assignee: Thawan Kooburat Labels: api-change, scaling Fix For: 3.5.0 Attachments: ZOOKEEPER-1147.patch, ZOOKEEPER-1147.patch, ZOOKEEPER-1147.patch, ZOOKEEPER-1147.patch, ZOOKEEPER-1147.patch, ZOOKEEPER-1147.patch, ZOOKEEPER-1147.patch, ZOOKEEPER-1147.patch, ZOOKEEPER-1147.patch Original Estimate: 840h Remaining Estimate: 840h This improvement is in the bucket of making ZooKeeper work at a large scale. We are planning on having about a 1 million clients connect to a ZooKeeper ensemble through a set of 50-100 observers. Majority of these clients are read only - ie they do not do any updates or create ephemeral nodes. In ZooKeeper today, the client creates a session and the session creation is handled like any other update. In the above use case, the session create/drop workload can easily overwhelm an ensemble. The following is a proposal for a local session, to support a larger number of connections. 1. The idea is to introduce a new type of session - local session. A local session doesn't have a full functionality of a normal session. 2. Local sessions cannot create ephemeral nodes. 3. Once a local session is lost, you cannot re-establish it using the session-id/password. The session and its watches are gone for good. 4. When a local session connects, the session info is only maintained on the zookeeper server (in this case, an observer) that it is connected to. The leader is not aware of the creation of such a session and there is no state written to disk. 5. The pings and expiration is handled by the server that the session is connected to. With the above changes, we can make ZooKeeper scale to a much larger number of clients without making the core ensemble a bottleneck. In terms of API, there are two options that are being considered 1. Let the client specify at the connect time which kind of session do they want. 2. All sessions connect as local sessions and automatically get promoted to global sessions when they do an operation that requires a global session (e.g. creating an ephemeral node) Chubby took the approach of lazily promoting all sessions to global, but I don't think that would work in our case, where we want to keep sessions which never create ephemeral nodes as always local. Option 2 would make it more broadly usable but option 1 would be easier to implement. We are thinking of implementing option 1 as the first cut. There would be a client flag, IsLocalSession (much like the current readOnly flag) that would be used to determine whether to create a local session or a global session. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1147) Add support for local sessions
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13788998#comment-13788998 ] Mahadev konar commented on ZOOKEEPER-1147: -- [~fpj] looks like the patch is ready to get in. You want to look through before we commit? Add support for local sessions -- Key: ZOOKEEPER-1147 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1147 Project: ZooKeeper Issue Type: Improvement Components: server Affects Versions: 3.3.3 Reporter: Vishal Kathuria Assignee: Thawan Kooburat Labels: api-change, scaling Fix For: 3.5.0 Attachments: ZOOKEEPER-1147.patch, ZOOKEEPER-1147.patch, ZOOKEEPER-1147.patch, ZOOKEEPER-1147.patch, ZOOKEEPER-1147.patch, ZOOKEEPER-1147.patch, ZOOKEEPER-1147.patch, ZOOKEEPER-1147.patch, ZOOKEEPER-1147.patch Original Estimate: 840h Remaining Estimate: 840h This improvement is in the bucket of making ZooKeeper work at a large scale. We are planning on having about a 1 million clients connect to a ZooKeeper ensemble through a set of 50-100 observers. Majority of these clients are read only - ie they do not do any updates or create ephemeral nodes. In ZooKeeper today, the client creates a session and the session creation is handled like any other update. In the above use case, the session create/drop workload can easily overwhelm an ensemble. The following is a proposal for a local session, to support a larger number of connections. 1. The idea is to introduce a new type of session - local session. A local session doesn't have a full functionality of a normal session. 2. Local sessions cannot create ephemeral nodes. 3. Once a local session is lost, you cannot re-establish it using the session-id/password. The session and its watches are gone for good. 4. When a local session connects, the session info is only maintained on the zookeeper server (in this case, an observer) that it is connected to. The leader is not aware of the creation of such a session and there is no state written to disk. 5. The pings and expiration is handled by the server that the session is connected to. With the above changes, we can make ZooKeeper scale to a much larger number of clients without making the core ensemble a bottleneck. In terms of API, there are two options that are being considered 1. Let the client specify at the connect time which kind of session do they want. 2. All sessions connect as local sessions and automatically get promoted to global sessions when they do an operation that requires a global session (e.g. creating an ephemeral node) Chubby took the approach of lazily promoting all sessions to global, but I don't think that would work in our case, where we want to keep sessions which never create ephemeral nodes as always local. Option 2 would make it more broadly usable but option 1 would be easier to implement. We are thinking of implementing option 1 as the first cut. There would be a client flag, IsLocalSession (much like the current readOnly flag) that would be used to determine whether to create a local session or a global session. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-442) need a way to remove watches that are no longer of interest
[ https://issues.apache.org/jira/browse/ZOOKEEPER-442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13788503#comment-13788503 ] Mahadev konar commented on ZOOKEEPER-442: - [~eribeiro] if you are interested, feel free to take it up. I'd be happy to provide guidance/other help on this. Thanks need a way to remove watches that are no longer of interest --- Key: ZOOKEEPER-442 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-442 Project: ZooKeeper Issue Type: New Feature Reporter: Benjamin Reed Assignee: Daniel Gómez Ferro Priority: Critical Fix For: 3.5.0 Attachments: ZOOKEEPER-442.patch, ZOOKEEPER-442.patch, ZOOKEEPER-442.patch, ZOOKEEPER-442.patch, ZOOKEEPER-442.patch, ZOOKEEPER-442.patch, ZOOKEEPER-442.patch currently the only way a watch cleared is to trigger it. we need a way to enumerate the outstanding watch objects, find watch events the objects are watching for, and remove interests in an event. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (ZOOKEEPER-1696) Fail to run zookeeper client on Weblogic application server
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-1696: - Fix Version/s: 3.4.6 Fail to run zookeeper client on Weblogic application server --- Key: ZOOKEEPER-1696 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1696 Project: ZooKeeper Issue Type: Bug Components: java client Affects Versions: 3.4.5 Environment: Java version: jdk170_06 WebLogic Server Version: 10.3.6.0 Reporter: Dmitry Konstantinov Assignee: Jeffrey Zhong Priority: Critical Fix For: 3.4.6 Attachments: zookeeper-1696.patch The problem in details is described here: http://comments.gmane.org/gmane.comp.java.zookeeper.user/2897 The provided link also contains a reference to fix implementation. {noformat} Apr 24, 2013 1:03:28 PM MSK Warning org.apache.zookeeper.ClientCnxn devapp090 clust2 [ACTIVE] ExecuteThread: '2' for queue: 'weblogic.kernel.Default (devapp090:2182) internal 1366794208810 BEA-00 WARN org.apache.zookeeper.ClientCnxn - Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect java.lang.IllegalArgumentException: No Configuration was registered that can handle the configuration named Client at com.bea.common.security.jdkutils.JAASConfiguration.getAppConfigurationEntry(JAASConfiguration.java:130) at org.apache.zookeeper.client.ZooKeeperSaslClient.init(ZooKeeperSaslClient.java:97) at org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:943) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:993) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1733) FLETest#testLE is flaky on windows boxes
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-1733: - Fix Version/s: 3.4.6 FLETest#testLE is flaky on windows boxes Key: ZOOKEEPER-1733 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1733 Project: ZooKeeper Issue Type: Bug Affects Versions: 3.4.5 Reporter: Jeffrey Zhong Assignee: Jeffrey Zhong Fix For: 3.4.6 Attachments: zookeeper-1733.patch FLETest#testLE fail intermittently on windows boxes. The reason is that in LEThread#run() we have: {code} if(leader == i){ synchronized(finalObj){ successCount++; if(successCount (count/2)) finalObj.notify(); } break; } {code} Basically once we have a confirmed leader, the leader thread dies due to the break of while loop. While in the verification step, we check if the leader thread alive or not as following: {code} if(threads.get((int) leader).isAlive()){ Assert.fail(Leader hasn't joined: + leader); } {code} On windows boxes, the above verification step fails frequently because leader thread most likely already exits. Do we know why we have the leader alive verification step only lead thread can bump up successCount = count/2? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1733) FLETest#testLE is flaky on windows boxes
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-1733: - Fix Version/s: (was: 3.4.6) 3.5.0 FLETest#testLE is flaky on windows boxes Key: ZOOKEEPER-1733 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1733 Project: ZooKeeper Issue Type: Bug Affects Versions: 3.4.5 Reporter: Jeffrey Zhong Assignee: Jeffrey Zhong Fix For: 3.5.0 Attachments: zookeeper-1733.patch FLETest#testLE fail intermittently on windows boxes. The reason is that in LEThread#run() we have: {code} if(leader == i){ synchronized(finalObj){ successCount++; if(successCount (count/2)) finalObj.notify(); } break; } {code} Basically once we have a confirmed leader, the leader thread dies due to the break of while loop. While in the verification step, we check if the leader thread alive or not as following: {code} if(threads.get((int) leader).isAlive()){ Assert.fail(Leader hasn't joined: + leader); } {code} On windows boxes, the above verification step fails frequently because leader thread most likely already exits. Do we know why we have the leader alive verification step only lead thread can bump up successCount = count/2? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1733) FLETest#testLE is flaky on windows boxes
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13770309#comment-13770309 ] Mahadev konar commented on ZOOKEEPER-1733: -- Running this through jenkins. FLETest#testLE is flaky on windows boxes Key: ZOOKEEPER-1733 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1733 Project: ZooKeeper Issue Type: Bug Affects Versions: 3.4.5 Reporter: Jeffrey Zhong Assignee: Jeffrey Zhong Fix For: 3.5.0 Attachments: zookeeper-1733.patch FLETest#testLE fail intermittently on windows boxes. The reason is that in LEThread#run() we have: {code} if(leader == i){ synchronized(finalObj){ successCount++; if(successCount (count/2)) finalObj.notify(); } break; } {code} Basically once we have a confirmed leader, the leader thread dies due to the break of while loop. While in the verification step, we check if the leader thread alive or not as following: {code} if(threads.get((int) leader).isAlive()){ Assert.fail(Leader hasn't joined: + leader); } {code} On windows boxes, the above verification step fails frequently because leader thread most likely already exits. Do we know why we have the leader alive verification step only lead thread can bump up successCount = count/2? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1751) ClientCnxn#run could miss the second ping or connection get dropped before a ping
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-1751: - Fix Version/s: 3.4.6 ClientCnxn#run could miss the second ping or connection get dropped before a ping - Key: ZOOKEEPER-1751 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1751 Project: ZooKeeper Issue Type: Bug Affects Versions: 3.4.5 Reporter: Jeffrey Zhong Assignee: Jeffrey Zhong Fix For: 3.4.6 Attachments: zookeeper-1751.patch We could throw SessionTimeoutException exception even when timeToNextPing may also be negative depending on the time when the following line is executed by the thread because we check time out before sending a ping. {code} to = readTimeout - clientCnxnSocket.getIdleRecv(); {code} In addition, we only ping twice no matter how long the session time out value is. For example, we set session time out = 60mins then we only try ping twice in 40mins window. Therefore, the connection could be dropped by OS after idle time out. The issue is causing randomly connection loss or session expired issues in client side which is bad for applications like HBase. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1696) Fail to run zookeeper client on Weblogic application server
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13770315#comment-13770315 ] Mahadev konar commented on ZOOKEEPER-1696: -- +1 for the patch. Given it ran through jenkins committing this to 3.4 and trunk. Fail to run zookeeper client on Weblogic application server --- Key: ZOOKEEPER-1696 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1696 Project: ZooKeeper Issue Type: Bug Components: java client Affects Versions: 3.4.5 Environment: Java version: jdk170_06 WebLogic Server Version: 10.3.6.0 Reporter: Dmitry Konstantinov Assignee: Jeffrey Zhong Priority: Critical Fix For: 3.4.6 Attachments: zookeeper-1696.patch The problem in details is described here: http://comments.gmane.org/gmane.comp.java.zookeeper.user/2897 The provided link also contains a reference to fix implementation. {noformat} Apr 24, 2013 1:03:28 PM MSK Warning org.apache.zookeeper.ClientCnxn devapp090 clust2 [ACTIVE] ExecuteThread: '2' for queue: 'weblogic.kernel.Default (devapp090:2182) internal 1366794208810 BEA-00 WARN org.apache.zookeeper.ClientCnxn - Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect java.lang.IllegalArgumentException: No Configuration was registered that can handle the configuration named Client at com.bea.common.security.jdkutils.JAASConfiguration.getAppConfigurationEntry(JAASConfiguration.java:130) at org.apache.zookeeper.client.ZooKeeperSaslClient.init(ZooKeeperSaslClient.java:97) at org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:943) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:993) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1657) Increased CPU usage by unnecessary SASL checks
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13761595#comment-13761595 ] Mahadev konar commented on ZOOKEEPER-1657: -- +1 for the patch. Looks good. Thanks Eugene/Flavio. Increased CPU usage by unnecessary SASL checks -- Key: ZOOKEEPER-1657 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1657 Project: ZooKeeper Issue Type: Bug Components: java client Affects Versions: 3.4.5 Reporter: Gunnar Wagenknecht Assignee: Philip K. Warren Labels: performance Fix For: 3.5.0, 3.4.6 Attachments: ZOOKEEPER-1657.patch, ZOOKEEPER-1657.patch, ZOOKEEPER-1657.patch, ZOOKEEPER-1657.patch, ZOOKEEPER-1657.patch, zookeeper-hotspot-gone.png, zookeeper-hotspot.png I did some profiling in one of our Java environments and found an interesting footprint in ZooKeeper. The SASL support seems to trigger a lot times on the client although it's not even in use. Is there a switch to disable SASL completely? The attached screenshot shows a 10-minute profiling session on one of our production Jetty servers. The Jetty server handles ~1k web requests per minute. The average response time per web request is a few milli seconds. The profiling was performed on a machine running for 24h. We noticed a significant CPU increase on our servers when deploying an update from ZooKeeper 3.3.2 to ZooKeeper 3.4.5. Thus, we started investigating. The screenshot shows that only 32% CPU time are spent in Jetty. In contrast, 65% are spend in ZooKeeper. A few notes/thoughts: * {{ClientCnxn$SendThread.clientTunneledAuthenticationInProgress}} seems to be the culprit * {{javax.security.auth.login.Configuration.getConfiguration}} seems to be called very often? * There is quite a bit reflection involved in {{java.security.AccessController.doPrivileged}} * No security manager is active in the JVM: I tend to place an if-check in the code before calling {{AccessController.doPrivileged}}. When no SM is installed, the runnable can be called directly which safes cycles. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-767) Submitting Demo/Recipe Shared / Exclusive Lock Code
[ https://issues.apache.org/jira/browse/ZOOKEEPER-767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13658530#comment-13658530 ] Mahadev konar commented on ZOOKEEPER-767: - Flavio, Agreed, I think its definitely a better match for Curator. Submitting Demo/Recipe Shared / Exclusive Lock Code --- Key: ZOOKEEPER-767 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-767 Project: ZooKeeper Issue Type: Improvement Components: recipes Affects Versions: 3.3.0 Reporter: Sam Baskinger Assignee: Sam Baskinger Priority: Minor Fix For: 3.5.0 Attachments: ZOOKEEPER-767.patch, ZOOKEEPER-767.patch, ZOOKEEPER-767.patch, ZOOKEEPER-767.patch, ZOOKEEPER-767.patch, ZOOKEEPER-767.patch Time Spent: 8h Networked Insights would like to share-back some code for shared/exclusive locking that we are using in our labs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1686) Publish ZK 3.4.5 test jar
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-1686: - Assignee: Mahadev konar Publish ZK 3.4.5 test jar - Key: ZOOKEEPER-1686 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1686 Project: ZooKeeper Issue Type: Bug Components: build, tests Affects Versions: 3.4.5 Reporter: Todd Lipcon Assignee: Mahadev konar ZooKeeper 3.4.2 used to publish a jar with the tests classifier for use by downstream project tests. It seems this didn't get published for 3.4.4 or 3.4.5 (see https://repository.apache.org/index.html#nexus-search;quick~org.apache.zookeeper). Would someone mind please publishing these artifacts? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1382) Zookeeper server holds onto dead/expired session ids in the watch data structures
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13592041#comment-13592041 ] Mahadev konar commented on ZOOKEEPER-1382: -- Michael, Would you be able to upload a patch for trunk as well? Zookeeper server holds onto dead/expired session ids in the watch data structures - Key: ZOOKEEPER-1382 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1382 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.4.5 Reporter: Neha Narkhede Assignee: Neha Narkhede Priority: Critical Fix For: 3.4.6 Attachments: ZOOKEEPER-1382_3.3.4.patch, ZOOKEEPER-1382-branch-3.4.patch I've observed that zookeeper server holds onto expired session ids in the watcher data structures. The result is the wchp command reports session ids that cannot be found through cons/dump and those expired session ids sit there maybe until the server is restarted. Here are snippets from the client and the server logs that lead to this state, for one particular session id 0x134485fd7bcb26f - There are 4 servers in the zookeeper cluster - 223, 224, 225 (leader), 226 and I'm using ZkClient to connect to the cluster From the application log - application.log.2012-01-26-325.gz:2012/01/26 04:56:36.177 INFO [ClientCnxn] [main-SendThread(223.prod:12913)] [application Session establishment complete on server 223.prod/172.17.135.38:12913, sessionid = 0x134485fd7bcb26f, negotiated timeout = 6000 application.log.2012-01-27.gz:2012/01/27 09:52:37.714 INFO [ClientCnxn] [main-SendThread(223.prod:12913)] [application] Client session timed out, have not heard from server in 9827ms for sessionid 0x134485fd7bcb26f, closing socket connection and attempting reconnect application.log.2012-01-27.gz:2012/01/27 09:52:38.191 INFO [ClientCnxn] [main-SendThread(226.prod:12913)] [application] Unable to reconnect to ZooKeeper service, session 0x134485fd7bcb26f has expired, closing socket connection On the leader zk, 225 - zookeeper.log.2012-01-27-leader-225.gz:2012-01-27 09:52:34,010 - INFO [SessionTracker:ZooKeeperServer@314] - Expiring session 0x134485fd7bcb26f, timeout of 6000ms exceeded zookeeper.log.2012-01-27-leader-225.gz:2012-01-27 09:52:34,010 - INFO [ProcessThread:-1:PrepRequestProcessor@391] - Processed session termination for sessionid: 0x134485fd7bcb26f On the server, the client was initially connected to, 223 - zookeeper.log.2012-01-26-223.gz:2012-01-26 04:56:36,173 - INFO [CommitProcessor:1:NIOServerCnxn@1580] - Established session 0x134485fd7bcb26f with negotiated timeout 6000 for client /172.17.136.82:45020 zookeeper.log.2012-01-27-223.gz:2012-01-27 09:52:34,018 - INFO [CommitProcessor:1:NIOServerCnxn@1435] - Closed socket connection for client /172.17.136.82:45020 which had sessionid 0x134485fd7bcb26f Here are the log snippets from 226, which is the server, the client reconnected to, before getting session expired event - 2012-01-27 09:52:38,190 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:12913:NIOServerCnxn@770] - Client attempting to renew session 0x134485fd7bcb26f at /172.17.136.82:49367 2012-01-27 09:52:38,191 - INFO [QuorumPeer:/0.0.0.0:12913:NIOServerCnxn@1573] - Invalid session 0x134485fd7bcb26f for client /172.17.136.82:49367, probably expired 2012-01-27 09:52:38,191 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:12913:NIOServerCnxn@1435] - Closed socket connection for client /172.17.136.82:49367 which had sessionid 0x134485fd7bcb26f wchp output from 226, taken on 01/30 - nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f *226.*wchp* | wc -l 3 wchp output from 223, taken on 01/30 - nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f *223.*wchp* | wc -l 0 cons output from 223 and 226, taken on 01/30 - nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f *226.*cons* | wc -l 0 nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f *223.*cons* | wc -l 0 So, what seems to have happened is that the client was able to re-register the watches on the new server (226), after it got disconnected from 223, inspite of having an expired session id. In NIOServerCnxn, I saw that after suspecting that a session is expired, a server removes the cnxn and its watches from its internal data structures. But before that it allows more requests to be processed even if the session is expired - // Now that the session is ready we can start receiving packets synchronized (this.factory) { sk.selector().wakeup(); enableRecv(); } } catch
[jira] [Commented] (ZOOKEEPER-1551) Observer ignore txns that comes after snapshot and UPTODATE
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13592043#comment-13592043 ] Mahadev konar commented on ZOOKEEPER-1551: -- [~fpj] would you be able to review the latest patch? Observer ignore txns that comes after snapshot and UPTODATE Key: ZOOKEEPER-1551 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1551 Project: ZooKeeper Issue Type: Bug Components: quorum, server Affects Versions: 3.4.3 Reporter: Thawan Kooburat Assignee: Thawan Kooburat Priority: Blocker Fix For: 3.5.0, 3.4.6 Attachments: ZOOKEEPER-1551.patch, ZOOKEEPER-1551.patch In Learner.java, txns which comes after the learner has taken the snapshot (after NEWLEADER packet) are stored in packetsNotCommitted. The follower has special logic to apply these txns at the end of syncWithLeader() method. However, the observer will ignore these txns completely, causing data inconsistency. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1657) Increased CPU usage by unnecessary SASL checks
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-1657: - Fix Version/s: 3.4.6 3.5.0 Increased CPU usage by unnecessary SASL checks -- Key: ZOOKEEPER-1657 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1657 Project: ZooKeeper Issue Type: Bug Components: java client Affects Versions: 3.4.5 Reporter: Gunnar Wagenknecht Labels: performance Fix For: 3.5.0, 3.4.6 Attachments: ZOOKEEPER-1657.patch, ZOOKEEPER-1657.patch, ZOOKEEPER-1657.patch, zookeeper-hotspot.png I did some profiling in one of our Java environments and found an interesting footprint in ZooKeeper. The SASL support seems to trigger a lot times on the client although it's not even in use. Is there a switch to disable SASL completely? The attached screenshot shows a 10-minute profiling session on one of our production Jetty servers. The Jetty server handles ~1k web requests per minute. The average response time per web request is a few milli seconds. The profiling was performed on a machine running for 24h. We noticed a significant CPU increase on our servers when deploying an update from ZooKeeper 3.3.2 to ZooKeeper 3.4.5. Thus, we started investigating. The screenshot shows that only 32% CPU time are spent in Jetty. In contrast, 65% are spend in ZooKeeper. A few notes/thoughts: * {{ClientCnxn$SendThread.clientTunneledAuthenticationInProgress}} seems to be the culprit * {{javax.security.auth.login.Configuration.getConfiguration}} seems to be called very often? * There is quite a bit reflection involved in {{java.security.AccessController.doPrivileged}} * No security manager is active in the JVM: I tend to place an if-check in the code before calling {{AccessController.doPrivileged}}. When no SM is installed, the runnable can be called directly which safes cycles. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1147) Add support for local sessions
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13556316#comment-13556316 ] Mahadev konar commented on ZOOKEEPER-1147: -- bq. Yes, a session retains the same ID when it is upgraded from local session to global session. I think this is desirable. Can you elaborate why this may cause problem? Yes its desirable. Before I comment on what I think might be wrong, when does the server who has the local sessionid remove it from its data structures? Is it when it gets a response from in final request processor about the session creation? Until then the session is in a local session? Add support for local sessions -- Key: ZOOKEEPER-1147 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1147 Project: ZooKeeper Issue Type: Improvement Components: server Affects Versions: 3.3.3 Reporter: Vishal Kathuria Assignee: Thawan Kooburat Labels: api-change, scaling Fix For: 3.5.0 Attachments: ZOOKEEPER-1147.patch Original Estimate: 840h Remaining Estimate: 840h This improvement is in the bucket of making ZooKeeper work at a large scale. We are planning on having about a 1 million clients connect to a ZooKeeper ensemble through a set of 50-100 observers. Majority of these clients are read only - ie they do not do any updates or create ephemeral nodes. In ZooKeeper today, the client creates a session and the session creation is handled like any other update. In the above use case, the session create/drop workload can easily overwhelm an ensemble. The following is a proposal for a local session, to support a larger number of connections. 1. The idea is to introduce a new type of session - local session. A local session doesn't have a full functionality of a normal session. 2. Local sessions cannot create ephemeral nodes. 3. Once a local session is lost, you cannot re-establish it using the session-id/password. The session and its watches are gone for good. 4. When a local session connects, the session info is only maintained on the zookeeper server (in this case, an observer) that it is connected to. The leader is not aware of the creation of such a session and there is no state written to disk. 5. The pings and expiration is handled by the server that the session is connected to. With the above changes, we can make ZooKeeper scale to a much larger number of clients without making the core ensemble a bottleneck. In terms of API, there are two options that are being considered 1. Let the client specify at the connect time which kind of session do they want. 2. All sessions connect as local sessions and automatically get promoted to global sessions when they do an operation that requires a global session (e.g. creating an ephemeral node) Chubby took the approach of lazily promoting all sessions to global, but I don't think that would work in our case, where we want to keep sessions which never create ephemeral nodes as always local. Option 2 would make it more broadly usable but option 1 would be easier to implement. We are thinking of implementing option 1 as the first cut. There would be a client flag, IsLocalSession (much like the current readOnly flag) that would be used to determine whether to create a local session or a global session. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (ZOOKEEPER-1147) Add support for local sessions
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13556316#comment-13556316 ] Mahadev konar edited comment on ZOOKEEPER-1147 at 1/17/13 3:42 PM: --- bq. Yes, a session retains the same ID when it is upgraded from local session to global session. I think this is desirable. Can you elaborate why this may cause problem? Yes its desirable. Before I comment on what I think might be wrong, when does the server who has the local sessionid remove it from its data structures? Is it when it gets a create session in final request processor? Until then the session is a local session? was (Author: mahadev): bq. Yes, a session retains the same ID when it is upgraded from local session to global session. I think this is desirable. Can you elaborate why this may cause problem? Yes its desirable. Before I comment on what I think might be wrong, when does the server who has the local sessionid remove it from its data structures? Is it when it gets a response from in final request processor about the session creation? Until then the session is in a local session? Add support for local sessions -- Key: ZOOKEEPER-1147 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1147 Project: ZooKeeper Issue Type: Improvement Components: server Affects Versions: 3.3.3 Reporter: Vishal Kathuria Assignee: Thawan Kooburat Labels: api-change, scaling Fix For: 3.5.0 Attachments: ZOOKEEPER-1147.patch Original Estimate: 840h Remaining Estimate: 840h This improvement is in the bucket of making ZooKeeper work at a large scale. We are planning on having about a 1 million clients connect to a ZooKeeper ensemble through a set of 50-100 observers. Majority of these clients are read only - ie they do not do any updates or create ephemeral nodes. In ZooKeeper today, the client creates a session and the session creation is handled like any other update. In the above use case, the session create/drop workload can easily overwhelm an ensemble. The following is a proposal for a local session, to support a larger number of connections. 1. The idea is to introduce a new type of session - local session. A local session doesn't have a full functionality of a normal session. 2. Local sessions cannot create ephemeral nodes. 3. Once a local session is lost, you cannot re-establish it using the session-id/password. The session and its watches are gone for good. 4. When a local session connects, the session info is only maintained on the zookeeper server (in this case, an observer) that it is connected to. The leader is not aware of the creation of such a session and there is no state written to disk. 5. The pings and expiration is handled by the server that the session is connected to. With the above changes, we can make ZooKeeper scale to a much larger number of clients without making the core ensemble a bottleneck. In terms of API, there are two options that are being considered 1. Let the client specify at the connect time which kind of session do they want. 2. All sessions connect as local sessions and automatically get promoted to global sessions when they do an operation that requires a global session (e.g. creating an ephemeral node) Chubby took the approach of lazily promoting all sessions to global, but I don't think that would work in our case, where we want to keep sessions which never create ephemeral nodes as always local. Option 2 would make it more broadly usable but option 1 would be easier to implement. We are thinking of implementing option 1 as the first cut. There would be a client flag, IsLocalSession (much like the current readOnly flag) that would be used to determine whether to create a local session or a global session. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1147) Add support for local sessions
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13557012#comment-13557012 ] Mahadev konar commented on ZOOKEEPER-1147: -- [~thawan] I thin the above scenario is ok. The only issue I think we have is the sensitive local sessions. Since we have had too many issues with disconnects and session expiry I think this might cause more issues than we already have. Is there something we can do here? I cant seem to find a way around it without doing client side changes. Add support for local sessions -- Key: ZOOKEEPER-1147 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1147 Project: ZooKeeper Issue Type: Improvement Components: server Affects Versions: 3.3.3 Reporter: Vishal Kathuria Assignee: Thawan Kooburat Labels: api-change, scaling Fix For: 3.5.0 Attachments: ZOOKEEPER-1147.patch Original Estimate: 840h Remaining Estimate: 840h This improvement is in the bucket of making ZooKeeper work at a large scale. We are planning on having about a 1 million clients connect to a ZooKeeper ensemble through a set of 50-100 observers. Majority of these clients are read only - ie they do not do any updates or create ephemeral nodes. In ZooKeeper today, the client creates a session and the session creation is handled like any other update. In the above use case, the session create/drop workload can easily overwhelm an ensemble. The following is a proposal for a local session, to support a larger number of connections. 1. The idea is to introduce a new type of session - local session. A local session doesn't have a full functionality of a normal session. 2. Local sessions cannot create ephemeral nodes. 3. Once a local session is lost, you cannot re-establish it using the session-id/password. The session and its watches are gone for good. 4. When a local session connects, the session info is only maintained on the zookeeper server (in this case, an observer) that it is connected to. The leader is not aware of the creation of such a session and there is no state written to disk. 5. The pings and expiration is handled by the server that the session is connected to. With the above changes, we can make ZooKeeper scale to a much larger number of clients without making the core ensemble a bottleneck. In terms of API, there are two options that are being considered 1. Let the client specify at the connect time which kind of session do they want. 2. All sessions connect as local sessions and automatically get promoted to global sessions when they do an operation that requires a global session (e.g. creating an ephemeral node) Chubby took the approach of lazily promoting all sessions to global, but I don't think that would work in our case, where we want to keep sessions which never create ephemeral nodes as always local. Option 2 would make it more broadly usable but option 1 would be easier to implement. We are thinking of implementing option 1 as the first cut. There would be a client flag, IsLocalSession (much like the current readOnly flag) that would be used to determine whether to create a local session or a global session. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1621) ZooKeeper does not recover from crash when disk was full
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-1621: - Assignee: Mahadev konar ZooKeeper does not recover from crash when disk was full Key: ZOOKEEPER-1621 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1621 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.4.3 Environment: Ubuntu 12.04, Amazon EC2 instance Reporter: David Arthur Assignee: Mahadev konar Fix For: 3.5.0 Attachments: zookeeper.log.gz The disk that ZooKeeper was using filled up. During a snapshot write, I got the following exception 2013-01-16 03:11:14,098 - ERROR [SyncThread:0:SyncRequestProcessor@151] - Severe unrecoverable error, exiting java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:282) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123) at org.apache.zookeeper.server.persistence.FileTxnLog.commit(FileTxnLog.java:309) at org.apache.zookeeper.server.persistence.FileTxnSnapLog.commit(FileTxnSnapLog.java:306) at org.apache.zookeeper.server.ZKDatabase.commit(ZKDatabase.java:484) at org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:162) at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:101) Then many subsequent exceptions like: 2013-01-16 15:02:23,984 - ERROR [main:Util@239] - Last transaction was partial. 2013-01-16 15:02:23,985 - ERROR [main:ZooKeeperServerMain@63] - Unexpected exception, exiting abnormally java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) at org.apache.zookeeper.server.persistence.FileHeader.deserialize(FileHeader.java:64) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.inStreamCreated(FileTxnLog.java:558) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.createInputArchive(FileTxnLog.java:577) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.goToNextLog(FileTxnLog.java:543) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:625) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:529) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:504) at org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:341) at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:130) at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223) at org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:259) at org.apache.zookeeper.server.ZooKeeperServer.startdata(ZooKeeperServer.java:386) at org.apache.zookeeper.server.NIOServerCnxnFactory.startup(NIOServerCnxnFactory.java:138) at org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:112) at org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:86) at org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:52) at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:116) at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78) It seems to me that writing the transaction log should be fully atomic to avoid such situations. Is this not the case? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1621) ZooKeeper does not recover from crash when disk was full
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13557022#comment-13557022 ] Mahadev konar commented on ZOOKEEPER-1621: -- Looks like the header was incomplete. Unfortunately we do not handle corrupt header but do handle corrupt txn's later. Am suprised that this happened twice in a row for 2 users. Ill upload a patch and test case. ZooKeeper does not recover from crash when disk was full Key: ZOOKEEPER-1621 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1621 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.4.3 Environment: Ubuntu 12.04, Amazon EC2 instance Reporter: David Arthur Fix For: 3.5.0 Attachments: zookeeper.log.gz The disk that ZooKeeper was using filled up. During a snapshot write, I got the following exception 2013-01-16 03:11:14,098 - ERROR [SyncThread:0:SyncRequestProcessor@151] - Severe unrecoverable error, exiting java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:282) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123) at org.apache.zookeeper.server.persistence.FileTxnLog.commit(FileTxnLog.java:309) at org.apache.zookeeper.server.persistence.FileTxnSnapLog.commit(FileTxnSnapLog.java:306) at org.apache.zookeeper.server.ZKDatabase.commit(ZKDatabase.java:484) at org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:162) at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:101) Then many subsequent exceptions like: 2013-01-16 15:02:23,984 - ERROR [main:Util@239] - Last transaction was partial. 2013-01-16 15:02:23,985 - ERROR [main:ZooKeeperServerMain@63] - Unexpected exception, exiting abnormally java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) at org.apache.zookeeper.server.persistence.FileHeader.deserialize(FileHeader.java:64) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.inStreamCreated(FileTxnLog.java:558) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.createInputArchive(FileTxnLog.java:577) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.goToNextLog(FileTxnLog.java:543) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:625) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:529) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:504) at org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:341) at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:130) at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223) at org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:259) at org.apache.zookeeper.server.ZooKeeperServer.startdata(ZooKeeperServer.java:386) at org.apache.zookeeper.server.NIOServerCnxnFactory.startup(NIOServerCnxnFactory.java:138) at org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:112) at org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:86) at org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:52) at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:116) at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78) It seems to me that writing the transaction log should be fully atomic to avoid such situations. Is this not the case? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1624) PrepRequestProcessor abort multi-operation incorrectly
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-1624: - Fix Version/s: 3.5.0 PrepRequestProcessor abort multi-operation incorrectly -- Key: ZOOKEEPER-1624 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1624 Project: ZooKeeper Issue Type: Bug Components: server Reporter: Thawan Kooburat Assignee: Thawan Kooburat Priority: Critical Fix For: 3.5.0 Attachments: ZOOKEEPER-1624.patch We found this issue when trying to issue multiple instances of the following multi-op concurrently multi { 1. create sequential node /a- 2. create node /b } The expected result is that only the first multi-op request should success and the rest of request should fail because /b is already exist However, the reported result is that the subsequence multi-op failed because of sequential node creation failed which is not possible. Below is the return code for each sub-op when issuing 3 instances of the above multi-op asynchronously 1. ZOK, ZOK 2. ZOK, ZNODEEXISTS, 3. ZNODEEXISTS, ZRUNTIMEINCONSISTENCY, When I added more debug log. The cause is that PrepRequestProcessor rollback outstandingChanges of the second multi-op incorrectly causing sequential node name generation to be incorrect. Below is the sequential node name generated by PrepRequestProcessor 1. create /a-0001 2. create /a-0003 3. create /a-0001 The bug is getPendingChanges() method. In failed to copied ChangeRecord for the parent node (/). So rollbackPendingChanges() cannot restore the right previous change record of the parent node when aborting the second multi-op The impact of this bug is that sequential node creation on the same parent node may fail until the previous one is committed. I am not sure if there is other implication or not. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1621) ZooKeeper does not recover from crash when disk was full
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-1621: - Fix Version/s: 3.4.6 ZooKeeper does not recover from crash when disk was full Key: ZOOKEEPER-1621 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1621 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.4.3 Environment: Ubuntu 12.04, Amazon EC2 instance Reporter: David Arthur Fix For: 3.4.6 The disk that ZooKeeper was using filled up. During a snapshot write, I got the following exception 2013-01-16 03:11:14,098 - ERROR [SyncThread:0:SyncRequestProcessor@151] - Severe unrecoverable error, exiting java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:282) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123) at org.apache.zookeeper.server.persistence.FileTxnLog.commit(FileTxnLog.java:309) at org.apache.zookeeper.server.persistence.FileTxnSnapLog.commit(FileTxnSnapLog.java:306) at org.apache.zookeeper.server.ZKDatabase.commit(ZKDatabase.java:484) at org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:162) at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:101) Then many subsequent exceptions like: 2013-01-16 15:02:23,984 - ERROR [main:Util@239] - Last transaction was partial. 2013-01-16 15:02:23,985 - ERROR [main:ZooKeeperServerMain@63] - Unexpected exception, exiting abnormally java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) at org.apache.zookeeper.server.persistence.FileHeader.deserialize(FileHeader.java:64) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.inStreamCreated(FileTxnLog.java:558) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.createInputArchive(FileTxnLog.java:577) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.goToNextLog(FileTxnLog.java:543) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:625) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:529) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:504) at org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:341) at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:130) at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223) at org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:259) at org.apache.zookeeper.server.ZooKeeperServer.startdata(ZooKeeperServer.java:386) at org.apache.zookeeper.server.NIOServerCnxnFactory.startup(NIOServerCnxnFactory.java:138) at org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:112) at org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:86) at org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:52) at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:116) at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78) It seems to me that writing the transaction log should be fully atomic to avoid such situations. Is this not the case? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1621) ZooKeeper does not recover from crash when disk was full
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-1621: - Priority: Major (was: Critical) ZooKeeper does not recover from crash when disk was full Key: ZOOKEEPER-1621 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1621 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.4.3 Environment: Ubuntu 12.04, Amazon EC2 instance Reporter: David Arthur Fix For: 3.4.6 The disk that ZooKeeper was using filled up. During a snapshot write, I got the following exception 2013-01-16 03:11:14,098 - ERROR [SyncThread:0:SyncRequestProcessor@151] - Severe unrecoverable error, exiting java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:282) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123) at org.apache.zookeeper.server.persistence.FileTxnLog.commit(FileTxnLog.java:309) at org.apache.zookeeper.server.persistence.FileTxnSnapLog.commit(FileTxnSnapLog.java:306) at org.apache.zookeeper.server.ZKDatabase.commit(ZKDatabase.java:484) at org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:162) at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:101) Then many subsequent exceptions like: 2013-01-16 15:02:23,984 - ERROR [main:Util@239] - Last transaction was partial. 2013-01-16 15:02:23,985 - ERROR [main:ZooKeeperServerMain@63] - Unexpected exception, exiting abnormally java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) at org.apache.zookeeper.server.persistence.FileHeader.deserialize(FileHeader.java:64) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.inStreamCreated(FileTxnLog.java:558) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.createInputArchive(FileTxnLog.java:577) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.goToNextLog(FileTxnLog.java:543) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:625) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:529) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:504) at org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:341) at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:130) at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223) at org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:259) at org.apache.zookeeper.server.ZooKeeperServer.startdata(ZooKeeperServer.java:386) at org.apache.zookeeper.server.NIOServerCnxnFactory.startup(NIOServerCnxnFactory.java:138) at org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:112) at org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:86) at org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:52) at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:116) at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78) It seems to me that writing the transaction log should be fully atomic to avoid such situations. Is this not the case? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1621) ZooKeeper does not recover from crash when disk was full
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13555169#comment-13555169 ] Mahadev konar commented on ZOOKEEPER-1621: -- David, So there exceptions are thrown when ZooKeeper is running? Am not sure why its exiting so many times. Do you guys restart the ZK server if it dies? ZooKeeper does not recover from crash when disk was full Key: ZOOKEEPER-1621 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1621 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.4.3 Environment: Ubuntu 12.04, Amazon EC2 instance Reporter: David Arthur Fix For: 3.5.0 The disk that ZooKeeper was using filled up. During a snapshot write, I got the following exception 2013-01-16 03:11:14,098 - ERROR [SyncThread:0:SyncRequestProcessor@151] - Severe unrecoverable error, exiting java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:282) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123) at org.apache.zookeeper.server.persistence.FileTxnLog.commit(FileTxnLog.java:309) at org.apache.zookeeper.server.persistence.FileTxnSnapLog.commit(FileTxnSnapLog.java:306) at org.apache.zookeeper.server.ZKDatabase.commit(ZKDatabase.java:484) at org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:162) at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:101) Then many subsequent exceptions like: 2013-01-16 15:02:23,984 - ERROR [main:Util@239] - Last transaction was partial. 2013-01-16 15:02:23,985 - ERROR [main:ZooKeeperServerMain@63] - Unexpected exception, exiting abnormally java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) at org.apache.zookeeper.server.persistence.FileHeader.deserialize(FileHeader.java:64) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.inStreamCreated(FileTxnLog.java:558) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.createInputArchive(FileTxnLog.java:577) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.goToNextLog(FileTxnLog.java:543) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:625) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:529) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:504) at org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:341) at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:130) at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223) at org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:259) at org.apache.zookeeper.server.ZooKeeperServer.startdata(ZooKeeperServer.java:386) at org.apache.zookeeper.server.NIOServerCnxnFactory.startup(NIOServerCnxnFactory.java:138) at org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:112) at org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:86) at org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:52) at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:116) at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78) It seems to me that writing the transaction log should be fully atomic to avoid such situations. Is this not the case? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1621) ZooKeeper does not recover from crash when disk was full
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13555192#comment-13555192 ] Mahadev konar commented on ZOOKEEPER-1621: -- David, I thought you said it does not recover when disk was full, but looks like the disk is still full? No? ZooKeeper does not recover from crash when disk was full Key: ZOOKEEPER-1621 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1621 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.4.3 Environment: Ubuntu 12.04, Amazon EC2 instance Reporter: David Arthur Fix For: 3.5.0 The disk that ZooKeeper was using filled up. During a snapshot write, I got the following exception 2013-01-16 03:11:14,098 - ERROR [SyncThread:0:SyncRequestProcessor@151] - Severe unrecoverable error, exiting java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:282) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123) at org.apache.zookeeper.server.persistence.FileTxnLog.commit(FileTxnLog.java:309) at org.apache.zookeeper.server.persistence.FileTxnSnapLog.commit(FileTxnSnapLog.java:306) at org.apache.zookeeper.server.ZKDatabase.commit(ZKDatabase.java:484) at org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:162) at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:101) Then many subsequent exceptions like: 2013-01-16 15:02:23,984 - ERROR [main:Util@239] - Last transaction was partial. 2013-01-16 15:02:23,985 - ERROR [main:ZooKeeperServerMain@63] - Unexpected exception, exiting abnormally java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) at org.apache.zookeeper.server.persistence.FileHeader.deserialize(FileHeader.java:64) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.inStreamCreated(FileTxnLog.java:558) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.createInputArchive(FileTxnLog.java:577) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.goToNextLog(FileTxnLog.java:543) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:625) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:529) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:504) at org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:341) at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:130) at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223) at org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:259) at org.apache.zookeeper.server.ZooKeeperServer.startdata(ZooKeeperServer.java:386) at org.apache.zookeeper.server.NIOServerCnxnFactory.startup(NIOServerCnxnFactory.java:138) at org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:112) at org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:86) at org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:52) at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:116) at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78) It seems to me that writing the transaction log should be fully atomic to avoid such situations. Is this not the case? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (ZOOKEEPER-1612) Zookeeper unable to recover and start once datadir disk is full and disk space cleared
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar resolved ZOOKEEPER-1612. -- Resolution: Duplicate Duplicate of ZOOKEEPER-1621. Zookeeper unable to recover and start once datadir disk is full and disk space cleared -- Key: ZOOKEEPER-1612 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1612 Project: ZooKeeper Issue Type: Bug Affects Versions: 3.4.3 Reporter: suja s Once zookeeper data dir disk becomes full, the process gets shut down. {noformat} 2012-12-14 13:22:26,959 [myid:2] - ERROR [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@276] - Severe unrecoverable error, exiting java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:282) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) at java.util.zip.CheckedOutputStream.write(CheckedOutputStream.java:56) at java.io.DataOutputStream.write(DataOutputStream.java:90) at java.io.FilterOutputStream.write(FilterOutputStream.java:80) at org.apache.jute.BinaryOutputArchive.writeBuffer(BinaryOutputArchive.java:119) at org.apache.zookeeper.server.DataNode.serialize(DataNode.java:168) at org.apache.jute.BinaryOutputArchive.writeRecord(BinaryOutputArchive.java:123) at org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:1115) at org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:1130) at org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:1130) at org.apache.zookeeper.server.DataTree.serialize(DataTree.java:1179) at org.apache.zookeeper.server.util.SerializeUtils.serializeSnapshot(SerializeUtils.java:138) at org.apache.zookeeper.server.persistence.FileSnap.serialize(FileSnap.java:213) at org.apache.zookeeper.server.persistence.FileSnap.serialize(FileSnap.java:230) at org.apache.zookeeper.server.persistence.FileTxnSnapLog.save(FileTxnSnapLog.java:242) at org.apache.zookeeper.server.ZooKeeperServer.takeSnapshot(ZooKeeperServer.java:274) at org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:407) at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:82) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:759) {noformat} Later disk space is cleared and zk started again. Startup of zk fails as it is not able to read snapshot properly. (Since load from disk failed it is not able to join peers in the quorum and get a snapshot diff) {noformat} 2012-12-14 16:20:31,489 [myid:2] - INFO [main:FileSnap@83] - Reading snapshot ../dataDir/version-2/snapshot.100042 2012-12-14 16:20:31,564 [myid:2] - ERROR [main:QuorumPeer@472] - Unable to load database on disk java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) at org.apache.zookeeper.server.persistence.FileHeader.deserialize(FileHeader.java:64) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.inStreamCreated(FileTxnLog.java:558) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.createInputArchive(FileTxnLog.java:577) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.goToNextLog(FileTxnLog.java:543) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:625) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:529) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:504) at org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:341) at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:132) at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223) at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:436) at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:428) at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:152) at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:111) at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78) 2012-12-14 16:20:31,566 [myid:2] - ERROR [main:QuorumPeerMain@89] -
[jira] [Commented] (ZOOKEEPER-1621) ZooKeeper does not recover from crash when disk was full
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13555318#comment-13555318 ] Mahadev konar commented on ZOOKEEPER-1621: -- Ill makr 1612 as dup. Thanks for pointing that out Edward. ZooKeeper does not recover from crash when disk was full Key: ZOOKEEPER-1621 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1621 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.4.3 Environment: Ubuntu 12.04, Amazon EC2 instance Reporter: David Arthur Fix For: 3.5.0 Attachments: zookeeper.log.gz The disk that ZooKeeper was using filled up. During a snapshot write, I got the following exception 2013-01-16 03:11:14,098 - ERROR [SyncThread:0:SyncRequestProcessor@151] - Severe unrecoverable error, exiting java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:282) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123) at org.apache.zookeeper.server.persistence.FileTxnLog.commit(FileTxnLog.java:309) at org.apache.zookeeper.server.persistence.FileTxnSnapLog.commit(FileTxnSnapLog.java:306) at org.apache.zookeeper.server.ZKDatabase.commit(ZKDatabase.java:484) at org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:162) at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:101) Then many subsequent exceptions like: 2013-01-16 15:02:23,984 - ERROR [main:Util@239] - Last transaction was partial. 2013-01-16 15:02:23,985 - ERROR [main:ZooKeeperServerMain@63] - Unexpected exception, exiting abnormally java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) at org.apache.zookeeper.server.persistence.FileHeader.deserialize(FileHeader.java:64) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.inStreamCreated(FileTxnLog.java:558) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.createInputArchive(FileTxnLog.java:577) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.goToNextLog(FileTxnLog.java:543) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:625) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:529) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:504) at org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:341) at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:130) at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223) at org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:259) at org.apache.zookeeper.server.ZooKeeperServer.startdata(ZooKeeperServer.java:386) at org.apache.zookeeper.server.NIOServerCnxnFactory.startup(NIOServerCnxnFactory.java:138) at org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:112) at org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:86) at org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:52) at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:116) at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78) It seems to me that writing the transaction log should be fully atomic to avoid such situations. Is this not the case? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1622) session ids will be negative in the year 2022
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13555698#comment-13555698 ] Mahadev konar commented on ZOOKEEPER-1622: -- Nice catch Eric! I think we do document that id be between 0 and 255 but maybe we should error out if that is not the case. session ids will be negative in the year 2022 - Key: ZOOKEEPER-1622 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1622 Project: ZooKeeper Issue Type: Bug Reporter: Eric Newton Priority: Trivial Someone decided to use a large number for their myid file. This cause session ids to go negative, and our software (Apache Accumulo) did not handle this very well. While diagnosing the problem, I noticed this in SessionImpl: {noformat} public static long initializeNextSession(long id) { long nextSid = 0; nextSid = (System.currentTimeMillis() 24) 8; nextSid = nextSid | (id 56); return nextSid; } {noformat} When the 40th bit in System.currentTimeMillis() is a one, sign extension will fill the upper 8 bytes of nextSid, and id will not make the session id unique. I recommend changing the right shift to the logical shift: {noformat} public static long initializeNextSession(long id) { long nextSid = 0; nextSid = (System.currentTimeMillis() 24) 8; nextSid = nextSid | (id 56); return nextSid; } {noformat} But, we have until the year 2022 before we have to worry about it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1147) Add support for local sessions
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13554819#comment-13554819 ] Mahadev konar commented on ZOOKEEPER-1147: -- [~thawan] this helps. Thanks for the information. I still have a couple of more questions: - Will a read only client always get a session expiration if a disconnect happens even though its not tried all the other servers? - Is the local session id the same as global session id when its created (I mean as the long value)? If its the same I think we have a problem with the shifting of client between servers.. bq. When a client reconnects to B, its sessionId won’t exist in B’s local session tracker. So B will send validation packet. If CreateSession issued by A is committed before validation packet arrive the client will be able to connect. Otherwise, the client will get session expired because the quorum hasn’t know about this session yet. If the client also tries to connect back to A again, the session is already removed from local session tracker. So A will need to send a validation packet to the leader. The outcome should be the same as B depending on the timing of the request. Add support for local sessions -- Key: ZOOKEEPER-1147 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1147 Project: ZooKeeper Issue Type: Improvement Components: server Affects Versions: 3.3.3 Reporter: Vishal Kathuria Assignee: Thawan Kooburat Labels: api-change, scaling Fix For: 3.5.0 Attachments: ZOOKEEPER-1147.patch Original Estimate: 840h Remaining Estimate: 840h This improvement is in the bucket of making ZooKeeper work at a large scale. We are planning on having about a 1 million clients connect to a ZooKeeper ensemble through a set of 50-100 observers. Majority of these clients are read only - ie they do not do any updates or create ephemeral nodes. In ZooKeeper today, the client creates a session and the session creation is handled like any other update. In the above use case, the session create/drop workload can easily overwhelm an ensemble. The following is a proposal for a local session, to support a larger number of connections. 1. The idea is to introduce a new type of session - local session. A local session doesn't have a full functionality of a normal session. 2. Local sessions cannot create ephemeral nodes. 3. Once a local session is lost, you cannot re-establish it using the session-id/password. The session and its watches are gone for good. 4. When a local session connects, the session info is only maintained on the zookeeper server (in this case, an observer) that it is connected to. The leader is not aware of the creation of such a session and there is no state written to disk. 5. The pings and expiration is handled by the server that the session is connected to. With the above changes, we can make ZooKeeper scale to a much larger number of clients without making the core ensemble a bottleneck. In terms of API, there are two options that are being considered 1. Let the client specify at the connect time which kind of session do they want. 2. All sessions connect as local sessions and automatically get promoted to global sessions when they do an operation that requires a global session (e.g. creating an ephemeral node) Chubby took the approach of lazily promoting all sessions to global, but I don't think that would work in our case, where we want to keep sessions which never create ephemeral nodes as always local. Option 2 would make it more broadly usable but option 1 would be easier to implement. We are thinking of implementing option 1 as the first cut. There would be a client flag, IsLocalSession (much like the current readOnly flag) that would be used to determine whether to create a local session or a global session. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1549) Data inconsistency when follower is receiving a DIFF with a dirty snapshot
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-1549: - Fix Version/s: 3.4.6 Data inconsistency when follower is receiving a DIFF with a dirty snapshot -- Key: ZOOKEEPER-1549 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1549 Project: ZooKeeper Issue Type: Bug Components: quorum Affects Versions: 3.4.3 Reporter: Jacky007 Priority: Blocker Fix For: 3.4.6 Attachments: case.patch, ZOOKEEPER-1549-learner.patch the trunc code (from ZOOKEEPER-1154?) cannot work correct if the snapshot is not correct. here is scenario(similar to 1154): Initial Condition 1.Lets say there are three nodes in the ensemble A,B,C with A being the leader 2.The current epoch is 7. 3.For simplicity of the example, lets say zxid is a two digit number, with epoch being the first digit. 4.The zxid is 73 5.All the nodes have seen the change 73 and have persistently logged it. Step 1 Request with zxid 74 is issued. The leader A writes it to the log but there is a crash of the entire ensemble and B,C never write the change 74 to their log. Step 2 A,B restart, A is elected as the new leader, and A will load data and take a clean snapshot(change 74 is in it), then send diff to B, but B died before sync with A. A died later. Step 3 B,C restart, A is still down B,C form the quorum B is the new leader. Lets say B minCommitLog is 71 and maxCommitLog is 73 epoch is now 8, zxid is 80 Request with zxid 81 is successful. On B, minCommitLog is now 71, maxCommitLog is 81 Step 4 A starts up. It applies the change in request with zxid 74 to its in-memory data tree A contacts B to registerAsFollower and provides 74 as its ZxId Since 71=74=81, B decides to send A the diff. Problem: The problem with the above sequence is that after truncate the log, A will load the snapshot again which is not correct. In 3.3 branch, FileTxnSnapLog.restore does not call listener(ZOOKEEPER-874), the leader will send a snapshot to follower, it will not be a problem. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1549) Data inconsistency when follower is receiving a DIFF with a dirty snapshot
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-1549: - Assignee: Thawan Kooburat Data inconsistency when follower is receiving a DIFF with a dirty snapshot -- Key: ZOOKEEPER-1549 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1549 Project: ZooKeeper Issue Type: Bug Components: quorum Affects Versions: 3.4.3 Reporter: Jacky007 Assignee: Thawan Kooburat Priority: Blocker Fix For: 3.4.6 Attachments: case.patch, ZOOKEEPER-1549-learner.patch the trunc code (from ZOOKEEPER-1154?) cannot work correct if the snapshot is not correct. here is scenario(similar to 1154): Initial Condition 1.Lets say there are three nodes in the ensemble A,B,C with A being the leader 2.The current epoch is 7. 3.For simplicity of the example, lets say zxid is a two digit number, with epoch being the first digit. 4.The zxid is 73 5.All the nodes have seen the change 73 and have persistently logged it. Step 1 Request with zxid 74 is issued. The leader A writes it to the log but there is a crash of the entire ensemble and B,C never write the change 74 to their log. Step 2 A,B restart, A is elected as the new leader, and A will load data and take a clean snapshot(change 74 is in it), then send diff to B, but B died before sync with A. A died later. Step 3 B,C restart, A is still down B,C form the quorum B is the new leader. Lets say B minCommitLog is 71 and maxCommitLog is 73 epoch is now 8, zxid is 80 Request with zxid 81 is successful. On B, minCommitLog is now 71, maxCommitLog is 81 Step 4 A starts up. It applies the change in request with zxid 74 to its in-memory data tree A contacts B to registerAsFollower and provides 74 as its ZxId Since 71=74=81, B decides to send A the diff. Problem: The problem with the above sequence is that after truncate the log, A will load the snapshot again which is not correct. In 3.3 branch, FileTxnSnapLog.restore does not call listener(ZOOKEEPER-874), the leader will send a snapshot to follower, it will not be a problem. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1549) Data inconsistency when follower is receiving a DIFF with a dirty snapshot
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13553574#comment-13553574 ] Mahadev konar commented on ZOOKEEPER-1549: -- Thanks [~thawan]! Data inconsistency when follower is receiving a DIFF with a dirty snapshot -- Key: ZOOKEEPER-1549 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1549 Project: ZooKeeper Issue Type: Bug Components: quorum Affects Versions: 3.4.3 Reporter: Jacky007 Assignee: Thawan Kooburat Priority: Blocker Fix For: 3.4.6 Attachments: case.patch, ZOOKEEPER-1549-learner.patch the trunc code (from ZOOKEEPER-1154?) cannot work correct if the snapshot is not correct. here is scenario(similar to 1154): Initial Condition 1.Lets say there are three nodes in the ensemble A,B,C with A being the leader 2.The current epoch is 7. 3.For simplicity of the example, lets say zxid is a two digit number, with epoch being the first digit. 4.The zxid is 73 5.All the nodes have seen the change 73 and have persistently logged it. Step 1 Request with zxid 74 is issued. The leader A writes it to the log but there is a crash of the entire ensemble and B,C never write the change 74 to their log. Step 2 A,B restart, A is elected as the new leader, and A will load data and take a clean snapshot(change 74 is in it), then send diff to B, but B died before sync with A. A died later. Step 3 B,C restart, A is still down B,C form the quorum B is the new leader. Lets say B minCommitLog is 71 and maxCommitLog is 73 epoch is now 8, zxid is 80 Request with zxid 81 is successful. On B, minCommitLog is now 71, maxCommitLog is 81 Step 4 A starts up. It applies the change in request with zxid 74 to its in-memory data tree A contacts B to registerAsFollower and provides 74 as its ZxId Since 71=74=81, B decides to send A the diff. Problem: The problem with the above sequence is that after truncate the log, A will load the snapshot again which is not correct. In 3.3 branch, FileTxnSnapLog.restore does not call listener(ZOOKEEPER-874), the leader will send a snapshot to follower, it will not be a problem. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1603) StaticHostProviderTest testUpdateClientMigrateOrNot hangs
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13536229#comment-13536229 ] Mahadev konar commented on ZOOKEEPER-1603: -- Pat, Not sure why we had this. Seems like an over sight. StaticHostProviderTest testUpdateClientMigrateOrNot hangs - Key: ZOOKEEPER-1603 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1603 Project: ZooKeeper Issue Type: Bug Components: tests Affects Versions: 3.5.0 Reporter: Patrick Hunt Assignee: Alexander Shraer Priority: Blocker Fix For: 3.5.0 Attachments: ZOOKEEPER-1603-ver1.patch, ZOOKEEPER-1603-ver2.patch StaticHostProviderTest method testUpdateClientMigrateOrNot hangs forever. On my laptop getHostName for 10.10.10.* takes 5+ seconds per call. As a result this method effectively runs forever. Every time I run this test it hangs. Consistent. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1504) Multi-thread NIOServerCnxn
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13534246#comment-13534246 ] Mahadev konar commented on ZOOKEEPER-1504: -- Pat, Makes sense. We can do it in a separate jira. Multi-thread NIOServerCnxn -- Key: ZOOKEEPER-1504 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1504 Project: ZooKeeper Issue Type: Improvement Components: server Affects Versions: 3.4.3, 3.4.4, 3.5.0 Reporter: Jay Shrauner Assignee: Jay Shrauner Labels: performance, scaling Fix For: 3.5.0 Attachments: ZOOKEEPER-1504.patch, ZOOKEEPER-1504.patch, ZOOKEEPER-1504.patch, ZOOKEEPER-1504.patch, ZOOKEEPER-1504.patch NIOServerCnxnFactory is single threaded, which doesn't scale well to large numbers of clients. This is particularly noticeable when thousands of clients connect. I propose multi-threading this code as follows: - 1 acceptor thread, for accepting new connections - 1-N selector threads - 0-M I/O worker threads Numbers of threads are configurable, with defaults scaling according to number of cores. Communication with the selector threads is handled via LinkedBlockingQueues, and connections are permanently assigned to a particular selector thread so that all potentially blocking SelectionKey operations can be performed solely by the selector thread. An ExecutorService is used for the worker threads. On a 32 core machine running Linux 2.6.38, achieved best performance with 4 selector threads and 64 worker threads for a 70% +/- 5% improvement in throughput. This patch incorporates and supersedes the patches for https://issues.apache.org/jira/browse/ZOOKEEPER-517 https://issues.apache.org/jira/browse/ZOOKEEPER-1444 New classes introduced in this patch are: - ExpiryQueue (from ZOOKEEPER-1444): factor out the logic from SessionTrackerImpl used to expire sessions so that the same logic can be used to expire connections - RateLogger (from ZOOKEEPER-517): rate limit error message logging, currently only used to throttle rate of logging out of file descriptors errors - WorkerService (also in ZOOKEEPER-1505): ExecutorService wrapper that makes worker threads daemon threads and names then in an easily debuggable manner. Supports assignable threads (as used by CommitProcessor) and non-assignable threads (as used here). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-575) remove System.exit calls to make the server more container friendly
[ https://issues.apache.org/jira/browse/ZOOKEEPER-575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-575: Attachment: ZOOKEEPER-575_4.patch Updated the patch for trunk. This would be really be nice to get in and make it cleaner to embed ZK. remove System.exit calls to make the server more container friendly --- Key: ZOOKEEPER-575 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-575 Project: ZooKeeper Issue Type: Improvement Components: server Affects Versions: 3.4.0 Reporter: Patrick Hunt Assignee: Andrew Finnell Fix For: 3.5.0 Attachments: ZOOKEEPER-575-2.patch, ZOOKEEPER-575-3.patch, ZOOKEEPER-575_4.patch, ZOOKEEPER-575.patch There are a handful of places left in the code that still use System.exit, we should remove these to make the server more container friendly. There are some legitimate places for the exits - in *Main.java for example should be fine - these are the command line main routines. Containers should be embedding code that runs just below this layer (or we should refactor so that it would). The tricky bit is ensuring the server shuts down in case of an unrecoverable error occurring, afaik these are the locations where we still have sys exit calls. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1335) Add support for --config to zkEnv.sh to specify a config directory different than what is expected
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13533666#comment-13533666 ] Mahadev konar commented on ZOOKEEPER-1335: -- +1 for the patch. Looks good to me. Pat doesnt look like we have much documentation in forrest for zkServer.sh so I dont think we need any forrest docs update. Add support for --config to zkEnv.sh to specify a config directory different than what is expected -- Key: ZOOKEEPER-1335 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1335 Project: ZooKeeper Issue Type: Improvement Reporter: Arpit Gupta Assignee: Arpit Gupta Fix For: 3.5.0 Attachments: ZOOKEEPER-1335.patch, ZOOKEEPER-1335.patch zkEnv.sh expects ZOOCFGDIR env variable set. If not it looks for the conf dir in the ZOOKEEPER_PREFIX dir or in /etc/zookeeper. It would be great if we can support --config option where at run time you could specify a different config directory. We do the same thing in hadoop. With this you should be able to do /usr/sbin/zkServer.sh --config /some/conf/dir start|stop -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1593) Add Debian style /etc/default/zookeeper support to init script
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13533673#comment-13533673 ] Mahadev konar commented on ZOOKEEPER-1593: -- Michi/Dirkjan, Unfortunately these package files are mostly unused and we probably should be getting rid of them given BigTop is doing all the packaging work. Dirkjan are you using the packaging in production? Do you think BigTop packaging might be of help to you? Add Debian style /etc/default/zookeeper support to init script -- Key: ZOOKEEPER-1593 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1593 Project: ZooKeeper Issue Type: Improvement Components: scripts Affects Versions: 3.4.5 Environment: Debian Linux 6.0 Reporter: Dirkjan Bussink Priority: Minor Attachments: zookeeper_debian_default.patch In our configuration we use a different data directory for Zookeeper. The problem is that the current Debian init.d script has the default location hardcoded: ZOOPIDDIR=/var/lib/zookeeper/data ZOOPIDFILE=${ZOOPIDDIR}/zookeeper_server.pid By using the standard Debian practice of allowing for a /etc/default/zookeeper we can redefine these variables to point to the correct location: ZOOPIDDIR=/var/lib/zookeeper/data ZOOPIDFILE=${ZOOPIDDIR}/zookeeper_server.pid [ -r /etc/default/zookeeper ] . /etc/default/zookeeper This currently can't be done through /usr/libexec/zkEnv.sh, since that is loaded before ZOOPIDDIR and ZOOPIDFILE are set. Any change there would therefore undo the setup made in for example /etc/zookeeper/zookeeper-env.sh. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1488) Some links are not working in the Zookeeper Documentation
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13533674#comment-13533674 ] Mahadev konar commented on ZOOKEEPER-1488: -- bq. By the way, I have just seen that the PDF generated in the in the docs section still has a 2008 copyright notice (Copyright © 2008 The Apache Software Foundation. All rights reserved). Should I open a ticket to update this? Or may I try to include in this patch? Thanks for pointing that out Edward. Please open a jira for that. Some links are not working in the Zookeeper Documentation - Key: ZOOKEEPER-1488 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1488 Project: ZooKeeper Issue Type: Bug Components: documentation Affects Versions: 3.4.3 Reporter: Kiran BC Assignee: Edward Ribeiro Priority: Minor Attachments: ZOOKEEPER-1488.patch, ZOOKEEPER-1488.patch There are some internal link errors in the Zookeeper documentation. The list is as follows: docs\zookeeperAdmin.html - tickTime and datadir docs\zookeeperOver.html - fg_zkComponents, fg_zkPerfReliability and fg_zkPerfRW docs\zookeeperStarted.html - Logging -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1552) Enable sync request processor in Observer
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13533676#comment-13533676 ] Mahadev konar commented on ZOOKEEPER-1552: -- Thawan, This is a good idea. As for the patch, I think we have too many system properties spread around in the source code. Its best if we can use the ZooKeeper config file for this. What do others think? Enable sync request processor in Observer - Key: ZOOKEEPER-1552 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1552 Project: ZooKeeper Issue Type: Improvement Components: server Affects Versions: 3.4.3 Reporter: Thawan Kooburat Assignee: Thawan Kooburat Fix For: 3.5.0 Attachments: ZOOKEEPER-1552.patch, ZOOKEEPER-1552.patch Observer doesn't forward its txns to SyncRequestProcessor. So it never persists the txns onto disk or periodically creates snapshots. This increases the start-up time since it will get the entire snapshot if the observer has be running for a long time. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (ZOOKEEPER-1552) Enable sync request processor in Observer
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13533676#comment-13533676 ] Mahadev konar edited comment on ZOOKEEPER-1552 at 12/17/12 6:33 AM: Thawan, This is a good idea. As for the patch, I think we have too many system properties spread around in the source code. Its best if we can use the ZooKeeper config file for this. What do others think? Other than that, the patch looks good. was (Author: mahadev): Thawan, This is a good idea. As for the patch, I think we have too many system properties spread around in the source code. Its best if we can use the ZooKeeper config file for this. What do others think? Enable sync request processor in Observer - Key: ZOOKEEPER-1552 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1552 Project: ZooKeeper Issue Type: Improvement Components: server Affects Versions: 3.4.3 Reporter: Thawan Kooburat Assignee: Thawan Kooburat Fix For: 3.5.0 Attachments: ZOOKEEPER-1552.patch, ZOOKEEPER-1552.patch Observer doesn't forward its txns to SyncRequestProcessor. So it never persists the txns onto disk or periodically creates snapshots. This increases the start-up time since it will get the entire snapshot if the observer has be running for a long time. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1480) ClientCnxn(1161) can't get the current zk server add, so that - Session 0x for server null, unexpected error
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13533678#comment-13533678 ] Mahadev konar commented on ZOOKEEPER-1480: -- Hey Leader, There are quite a few chinese characters in the patch. Can you please remove those? Also, can you please create a patch against trunk? Thanks ClientCnxn(1161) can't get the current zk server add, so that - Session 0x for server null, unexpected error Key: ZOOKEEPER-1480 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1480 Project: ZooKeeper Issue Type: Bug Components: java client Affects Versions: 3.4.3 Reporter: Leader Ni Assignee: Leader Ni Labels: client, getCurrentZooKeeperAddr Fix For: 3.5.0 Attachments: getCurrentZooKeeperAddr_for_3.4.3.patch, getCurrentZooKeeperAddr_for_branch3.4.patch When zookeeper occur an unexpected error( Not SessionExpiredException, SessionTimeoutException and EndOfStreamException), ClientCnxn(1161) will log such as the formart Session 0x for server null, unexpected error, closing socket connection and attempting reconnect . The log at line 1161 in zookeeper-3.3.3 We found that, zookeeper use ((SocketChannel)sockKey.channel()).socket().getRemoteSocketAddress() to get zookeeper addr. But,Sometimes, it logs Session 0x for server null, you know, if log null, developer can't determine the current zookeeper addr that client is connected or connecting. I add a method in Class SendThread:InetSocketAddress org.apache.zookeeper.ClientCnxn.SendThread.getCurrentZooKeeperAddr(). Here: /** * Returns the address to which the socket is connected. * * @return ip address of the remote side of the connection or null if not * connected */ @Override SocketAddress getRemoteSocketAddress() { // a lot could go wrong here, so rather than put in a bunch of code // to check for nulls all down the chain let's do it the simple // yet bulletproof way . -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1504) Multi-thread NIOServerCnxn
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13533687#comment-13533687 ] Mahadev konar commented on ZOOKEEPER-1504: -- Thawan, I was looking at the patch and it looks like you always have one acceptor thread. Is one acceptor thread enough when we have 1000's of immediate connections to the ZK servers in case of bootstrap or network glitches? Did you never see an issue with this? Read through the patch as well. Looks good to me otherwise. Multi-thread NIOServerCnxn -- Key: ZOOKEEPER-1504 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1504 Project: ZooKeeper Issue Type: Improvement Components: server Affects Versions: 3.4.3, 3.4.4, 3.5.0 Reporter: Jay Shrauner Assignee: Jay Shrauner Labels: performance, scaling Fix For: 3.5.0 Attachments: ZOOKEEPER-1504.patch, ZOOKEEPER-1504.patch, ZOOKEEPER-1504.patch, ZOOKEEPER-1504.patch, ZOOKEEPER-1504.patch NIOServerCnxnFactory is single threaded, which doesn't scale well to large numbers of clients. This is particularly noticeable when thousands of clients connect. I propose multi-threading this code as follows: - 1 acceptor thread, for accepting new connections - 1-N selector threads - 0-M I/O worker threads Numbers of threads are configurable, with defaults scaling according to number of cores. Communication with the selector threads is handled via LinkedBlockingQueues, and connections are permanently assigned to a particular selector thread so that all potentially blocking SelectionKey operations can be performed solely by the selector thread. An ExecutorService is used for the worker threads. On a 32 core machine running Linux 2.6.38, achieved best performance with 4 selector threads and 64 worker threads for a 70% +/- 5% improvement in throughput. This patch incorporates and supersedes the patches for https://issues.apache.org/jira/browse/ZOOKEEPER-517 https://issues.apache.org/jira/browse/ZOOKEEPER-1444 New classes introduced in this patch are: - ExpiryQueue (from ZOOKEEPER-1444): factor out the logic from SessionTrackerImpl used to expire sessions so that the same logic can be used to expire connections - RateLogger (from ZOOKEEPER-517): rate limit error message logging, currently only used to throttle rate of logging out of file descriptors errors - WorkerService (also in ZOOKEEPER-1505): ExecutorService wrapper that makes worker threads daemon threads and names then in an easily debuggable manner. Supports assignable threads (as used by CommitProcessor) and non-assignable threads (as used here). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1569) support upsert: setData if the node exists, otherwise, create a new node
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13533692#comment-13533692 ] Mahadev konar commented on ZOOKEEPER-1569: -- Jimmy, Can you please explain the semantics of such an operation? What would a return value be? When would this operation fail? When would it succeed? support upsert: setData if the node exists, otherwise, create a new node Key: ZOOKEEPER-1569 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1569 Project: ZooKeeper Issue Type: Improvement Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: zk-1569.patch, zk-1569_v1.1.patch, zk-1569_v2.patch Currently, ZooKeeper supports setData and create. If it can support upsert like in SQL, it will be great. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1578) org.apache.zookeeper.server.quorum.Zab1_0Test failed due to hard code with 33556 port
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13533695#comment-13533695 ] Mahadev konar commented on ZOOKEEPER-1578: -- +1 the patch looks good. org.apache.zookeeper.server.quorum.Zab1_0Test failed due to hard code with 33556 port - Key: ZOOKEEPER-1578 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1578 Project: ZooKeeper Issue Type: Bug Affects Versions: 3.4.3 Reporter: Li Ping Zhang Assignee: Li Ping Zhang Labels: patch Attachments: ZOOKEEPER-1578-branch-3.4.patch, ZOOKEEPER-1578-trunk.patch Original Estimate: 24h Remaining Estimate: 24h org.apache.zookeeper.server.quorum.Zab1_0Test was failed both with SUN JDK and open JDK. [junit] Running org.apache.zookeeper.server.quorum.Zab1_0Test [junit] Tests run: 8, Failures: 0, Errors: 1, Time elapsed: 18.334 sec [junit] Test org.apache.zookeeper.server.quorum.Zab1_0Test FAILED Zab1_0Test log: Zab1_0Test log: 2012-07-11 23:17:15,579 [myid:] - INFO [main:Leader@427] - Shutdown called java.lang.Exception: shutdown Leader! reason: end of test at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:427) at org.apache.zookeeper.server.quorum.Zab1_0Test.testLastAcceptedEpoch(Zab1_0Test.java:211) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:48) 2012-07-11 23:17:15,584 [myid:] - ERROR [main:Leader@139] - Couldn't bind to port 33556 java.net.BindException: Address already in use at java.net.PlainSocketImpl.bind(PlainSocketImpl.java:402) at java.net.ServerSocket.bind(ServerSocket.java:328) at java.net.ServerSocket.bind(ServerSocket.java:286) at org.apache.zookeeper.server.quorum.Leader.init(Leader.java:137) at org.apache.zookeeper.server.quorum.Zab1_0Test.createLeader(Zab1_0Test.java:810) at org.apache.zookeeper.server.quorum.Zab1_0Test.testLeaderInElectingFollowers(Zab1_0Test.java:224) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 2012-07-11 23:17:20,202 [myid:] - ERROR [LearnerHandler-bdvm039.svl.ibm.com/9.30.122.48:40153:LearnerHandler@559] - Unex pected exception causing shutdown while sock still open java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.read(SocketInputStream.java:129) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at java.io.DataInputStream.readInt(DataInputStream.java:370) at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108) at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:291) 2012-07-11 23:17:20,203 [myid:] - WARN [LearnerHandler-bdvm039.svl.ibm.com/9.30.122.48:40153:LearnerHandler@569] - *** GOODBYE bdvm039.svl.ibm.com/9.30.122.48:40153 2012-07-11 23:17:20,204 [myid:] - INFO [Thread-20:Leader@421] - Shutting down 2012-07-11 23:17:20,204 [myid:] - INFO [Thread-20:Leader@427] - Shutdown called java.lang.Exception: shutdown Leader! reason: lead ended this failure seems 33556 port is already used, but it is not in use with command check in fact. There is a hard code in unit test, we can improve it with code patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1574) mismatched CR/LF endings in text files
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13533708#comment-13533708 ] Mahadev konar commented on ZOOKEEPER-1574: -- Nikita/Raja, So we can just do a prop set and commit then? I tried this: find * | grep java$ | xargs svn propset -R svn:eol-style native and its only changing the properties. Is this all we need to do on 3.4 and trunk? This is definitely better than committing the diff. mismatched CR/LF endings in text files -- Key: ZOOKEEPER-1574 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1574 Project: ZooKeeper Issue Type: Bug Reporter: Raja Aluri Assignee: Raja Aluri Attachments: ZOOKEEPER-1574.branch-3.4.patch, ZOOKEEPER-1574.trunk.patch Source code in zookeeper repo has a bunch of files that have CRLF endings. With more development happening on windows there is a higher chance of more CRLF files getting into the source tree. I would like to avoid that by creating .gitattributes file which prevents sources from having CRLF entries in text files. But before adding the .gitattributes file we need to normalize the existing tree, so that people when they sync after .giattributes change wont end up with a bunch of modified files in their workspace. I am adding a couple of links here to give more primer on what exactly is the issue and how we are trying to fix it. [http://git-scm.com/docs/gitattributes#_checking_out_and_checking_in] [http://stackoverflow.com/questions/170961/whats-the-best-crlf-handling-strategy-with-git] I will submit a separate bug and patch for .gitattributes -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1572) Add an async interface for multi request
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-1572: - Fix Version/s: (was: 3.4.6) Add an async interface for multi request Key: ZOOKEEPER-1572 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1572 Project: ZooKeeper Issue Type: Improvement Components: java client Reporter: Sijie Guo Assignee: Sijie Guo Fix For: 3.5.0 Attachments: ZOOKEEPER-1572.diff, ZOOKEEPER-1572.diff Currently there is no async interface for multi request in ZooKeeper java client. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1572) Add an async interface for multi request
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13533710#comment-13533710 ] Mahadev konar commented on ZOOKEEPER-1572: -- Removing it from 3.4 branch. We shouldnt commit new features in 3.4 branch. Add an async interface for multi request Key: ZOOKEEPER-1572 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1572 Project: ZooKeeper Issue Type: Improvement Components: java client Reporter: Sijie Guo Assignee: Sijie Guo Fix For: 3.5.0, 3.4.6 Attachments: ZOOKEEPER-1572.diff, ZOOKEEPER-1572.diff Currently there is no async interface for multi request in ZooKeeper java client. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1572) Add an async interface for multi request
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13533712#comment-13533712 ] Mahadev konar commented on ZOOKEEPER-1572: -- Flavio/Sejie, I am taking a look at this. Might need a day or 2 (maximum until tuesday) to review this. Add an async interface for multi request Key: ZOOKEEPER-1572 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1572 Project: ZooKeeper Issue Type: Improvement Components: java client Reporter: Sijie Guo Assignee: Sijie Guo Fix For: 3.5.0 Attachments: ZOOKEEPER-1572.diff, ZOOKEEPER-1572.diff Currently there is no async interface for multi request in ZooKeeper java client. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1557) jenkins jdk7 test failure in testBadSaslAuthNotifiesWatch
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13471676#comment-13471676 ] Mahadev konar commented on ZOOKEEPER-1557: -- Thanks Eugene .. Interesting jenkins jdk7 test failure in testBadSaslAuthNotifiesWatch - Key: ZOOKEEPER-1557 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1557 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.5.0, 3.4.5 Reporter: Patrick Hunt Assignee: Eugene Koontz Fix For: 3.5.0, 3.4.6 Attachments: jstack.out, SaslAuthFailTest.log, ZOOKEEPER-1557.patch Failure of testBadSaslAuthNotifiesWatch on the jenkins jdk7 job: https://builds.apache.org/job/ZooKeeper-trunk-jdk7/407/ haven't seen this before. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1557) jenkins jdk7 test failure in testBadSaslAuthNotifiesWatch
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-1557: - Fix Version/s: (was: 3.4.5) 3.4.6 jenkins jdk7 test failure in testBadSaslAuthNotifiesWatch - Key: ZOOKEEPER-1557 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1557 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.5.0, 3.4.5 Reporter: Patrick Hunt Assignee: Eugene Koontz Fix For: 3.5.0, 3.4.6 Attachments: jstack.out, SaslAuthFailTest.log, ZOOKEEPER-1557.patch Failure of testBadSaslAuthNotifiesWatch on the jenkins jdk7 job: https://builds.apache.org/job/ZooKeeper-trunk-jdk7/407/ haven't seen this before. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (ZOOKEEPER-1557) jenkins jdk7 test failure in testBadSaslAuthNotifiesWatch
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13471403#comment-13471403 ] Mahadev konar edited comment on ZOOKEEPER-1557 at 10/8/12 5:04 AM: --- Thanks Eugene for taking a look at it. Given your analysis above it doesnt look like we have a full knowledge of whats causing the issue. Given that this is not SASL related and could be related to how our test framework runs, I think we can move this out to 3.4.6 and get 3.4.5 out the door with what we have now. What do you think? was (Author: mahadev): Thanks Eugene for taking a look at it. Given your any analysis above it doesnt look like we have a full knowledge of whats causing the issue. Given that this is not SASL related and could be related to how our test framework runs, I think we can move this out to 3.4.6 and get 3.4.5 out the door with what we have now. What do you think? jenkins jdk7 test failure in testBadSaslAuthNotifiesWatch - Key: ZOOKEEPER-1557 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1557 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.5.0, 3.4.5 Reporter: Patrick Hunt Assignee: Eugene Koontz Fix For: 3.5.0, 3.4.6 Attachments: jstack.out, SaslAuthFailTest.log, ZOOKEEPER-1557.patch Failure of testBadSaslAuthNotifiesWatch on the jenkins jdk7 job: https://builds.apache.org/job/ZooKeeper-trunk-jdk7/407/ haven't seen this before. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1477) Test failures with Java 7 on Mac OS X
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-1477: - Priority: Major (was: Blocker) Downgrading to Major given the recent updates on this jira. Test failures with Java 7 on Mac OS X - Key: ZOOKEEPER-1477 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1477 Project: ZooKeeper Issue Type: Bug Components: server, tests Affects Versions: 3.4.3 Environment: Mac OS X Lion (10.7.4) Java version: java version 1.7.0_04 Java(TM) SE Runtime Environment (build 1.7.0_04-b21) Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode) Reporter: Diwaker Gupta Fix For: 3.4.6 Attachments: with-ZK-1550.txt I downloaded ZK 3.4.3 sources and ran {{ant test}}. Many of the tests failed, including ZooKeeperTest. A common symptom was spurious {{ConnectionLossException}}: {code} 2012-06-01 12:01:23,420 [myid:] - INFO [main:JUnit4ZKTestRunner$LoggedInvokeMethod@54] - TEST METHOD FAILED testDeleteRecursiveAsync org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for / at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1246) at org.apache.zookeeper.ZooKeeperTest.testDeleteRecursiveAsync(ZooKeeperTest.java:77) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ... (snipped) {code} As background, I was actually investigating some non-deterministic failures when using Netflix's Curator with Java 7 (see https://github.com/Netflix/curator/issues/79). After a while, I figured I should establish a clean ZK baseline first and realized it is actually a ZK issue, not a Curator issue. We are trying to migrate to Java 7 but this is a blocking issue for us right now. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1477) Test failures with Java 7 on Mac OS X
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13465077#comment-13465077 ] Mahadev konar commented on ZOOKEEPER-1477: -- Diwaker, Would you be able to run the tests along with Eugenes patch on ZOOKEEPER-1550 ? If not please let me know. I can go ahead and run it. Test failures with Java 7 on Mac OS X - Key: ZOOKEEPER-1477 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1477 Project: ZooKeeper Issue Type: Bug Components: server, tests Affects Versions: 3.4.3 Environment: Mac OS X Lion (10.7.4) Java version: java version 1.7.0_04 Java(TM) SE Runtime Environment (build 1.7.0_04-b21) Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode) Reporter: Diwaker Gupta Priority: Blocker Fix For: 3.4.5 I downloaded ZK 3.4.3 sources and ran {{ant test}}. Many of the tests failed, including ZooKeeperTest. A common symptom was spurious {{ConnectionLossException}}: {code} 2012-06-01 12:01:23,420 [myid:] - INFO [main:JUnit4ZKTestRunner$LoggedInvokeMethod@54] - TEST METHOD FAILED testDeleteRecursiveAsync org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for / at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1246) at org.apache.zookeeper.ZooKeeperTest.testDeleteRecursiveAsync(ZooKeeperTest.java:77) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ... (snipped) {code} As background, I was actually investigating some non-deterministic failures when using Netflix's Curator with Java 7 (see https://github.com/Netflix/curator/issues/79). After a while, I figured I should establish a clean ZK baseline first and realized it is actually a ZK issue, not a Curator issue. We are trying to migrate to Java 7 but this is a blocking issue for us right now. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1477) Test failures with Java 7 on Mac OS X
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13465097#comment-13465097 ] Mahadev konar commented on ZOOKEEPER-1477: -- Thanks Diwaker. Could you please upload a summary of the tests failing and the logs as well? Test failures with Java 7 on Mac OS X - Key: ZOOKEEPER-1477 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1477 Project: ZooKeeper Issue Type: Bug Components: server, tests Affects Versions: 3.4.3 Environment: Mac OS X Lion (10.7.4) Java version: java version 1.7.0_04 Java(TM) SE Runtime Environment (build 1.7.0_04-b21) Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode) Reporter: Diwaker Gupta Priority: Blocker Fix For: 3.4.5 I downloaded ZK 3.4.3 sources and ran {{ant test}}. Many of the tests failed, including ZooKeeperTest. A common symptom was spurious {{ConnectionLossException}}: {code} 2012-06-01 12:01:23,420 [myid:] - INFO [main:JUnit4ZKTestRunner$LoggedInvokeMethod@54] - TEST METHOD FAILED testDeleteRecursiveAsync org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for / at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1246) at org.apache.zookeeper.ZooKeeperTest.testDeleteRecursiveAsync(ZooKeeperTest.java:77) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ... (snipped) {code} As background, I was actually investigating some non-deterministic failures when using Netflix's Curator with Java 7 (see https://github.com/Netflix/curator/issues/79). After a while, I figured I should establish a clean ZK baseline first and realized it is actually a ZK issue, not a Curator issue. We are trying to migrate to Java 7 but this is a blocking issue for us right now. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1477) Test failures with Java 7 on Mac OS X
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13465108#comment-13465108 ] Mahadev konar commented on ZOOKEEPER-1477: -- Diwaker, The usual time on a linux box is around 40 mins or so. Test failures with Java 7 on Mac OS X - Key: ZOOKEEPER-1477 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1477 Project: ZooKeeper Issue Type: Bug Components: server, tests Affects Versions: 3.4.3 Environment: Mac OS X Lion (10.7.4) Java version: java version 1.7.0_04 Java(TM) SE Runtime Environment (build 1.7.0_04-b21) Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode) Reporter: Diwaker Gupta Priority: Blocker Fix For: 3.4.5 I downloaded ZK 3.4.3 sources and ran {{ant test}}. Many of the tests failed, including ZooKeeperTest. A common symptom was spurious {{ConnectionLossException}}: {code} 2012-06-01 12:01:23,420 [myid:] - INFO [main:JUnit4ZKTestRunner$LoggedInvokeMethod@54] - TEST METHOD FAILED testDeleteRecursiveAsync org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for / at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1246) at org.apache.zookeeper.ZooKeeperTest.testDeleteRecursiveAsync(ZooKeeperTest.java:77) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ... (snipped) {code} As background, I was actually investigating some non-deterministic failures when using Netflix's Curator with Java 7 (see https://github.com/Netflix/curator/issues/79). After a while, I figured I should establish a clean ZK baseline first and realized it is actually a ZK issue, not a Curator issue. We are trying to migrate to Java 7 but this is a blocking issue for us right now. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1477) Test failures with Java 7 on Mac OS X
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13465243#comment-13465243 ] Mahadev konar commented on ZOOKEEPER-1477: -- Thats fine Diwaker. Ill downgrade this jira to a major and mark it for the next release. We can just ship 3.4.5 with fix for ZOOKEEPER-1550. Itll be good to upload the tests logs for those that fail but its not urgent. We can do it later for 3.4.6. Thanks. Test failures with Java 7 on Mac OS X - Key: ZOOKEEPER-1477 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1477 Project: ZooKeeper Issue Type: Bug Components: server, tests Affects Versions: 3.4.3 Environment: Mac OS X Lion (10.7.4) Java version: java version 1.7.0_04 Java(TM) SE Runtime Environment (build 1.7.0_04-b21) Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode) Reporter: Diwaker Gupta Priority: Blocker Fix For: 3.4.5 Attachments: with-ZK-1550.txt I downloaded ZK 3.4.3 sources and ran {{ant test}}. Many of the tests failed, including ZooKeeperTest. A common symptom was spurious {{ConnectionLossException}}: {code} 2012-06-01 12:01:23,420 [myid:] - INFO [main:JUnit4ZKTestRunner$LoggedInvokeMethod@54] - TEST METHOD FAILED testDeleteRecursiveAsync org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for / at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1246) at org.apache.zookeeper.ZooKeeperTest.testDeleteRecursiveAsync(ZooKeeperTest.java:77) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ... (snipped) {code} As background, I was actually investigating some non-deterministic failures when using Netflix's Curator with Java 7 (see https://github.com/Netflix/curator/issues/79). After a while, I figured I should establish a clean ZK baseline first and realized it is actually a ZK issue, not a Curator issue. We are trying to migrate to Java 7 but this is a blocking issue for us right now. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1550) ZooKeeperSaslClient does not finish anonymous login on OpenJDK
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-1550: - Fix Version/s: 3.4.5 ZooKeeperSaslClient does not finish anonymous login on OpenJDK -- Key: ZOOKEEPER-1550 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1550 Project: ZooKeeper Issue Type: Bug Components: java client Affects Versions: 3.4.4 Reporter: Robert Macomber Fix For: 3.4.5 On OpenJDK, {{javax.security.auth.login.Configuration.getConfiguration}} does not throw an exception. {{ZooKeeperSaslClient.clientTunneledAuthenticationInProgress}} uses an exception from that method as a proxy for this client is not configured to use SASL and as a result no commands can be sent, since it is still waiting for auth to complete. [Link to mailing list discussion|http://comments.gmane.org/gmane.comp.java.zookeeper.user/2667] The relevant bit of logs from OpenJDK and Oracle versions of 'connect and do getChildren(/)': {code:title=OpenJDK} INFO [main] 2012-09-25 14:02:24,545 com.socrata.Main Waiting for connection... DEBUG [main] 2012-09-25 14:02:24,548 com.socrata.zookeeper.ZooKeeperProvider Waiting for connected-state... INFO [main-SendThread(mike.local:2181)] 2012-09-25 14:02:24,576 org.apache.zookeeper.ClientCnxn Opening socket connection to server mike.local/10.0.2.106:2181. Will not attempt to authenticate using SASL (unknown error) INFO [main-SendThread(mike.local:2181)] 2012-09-25 14:02:24,584 org.apache.zookeeper.ClientCnxn Socket connection established to mike.local/10.0.2.106:2181, initiating session DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:24,586 org.apache.zookeeper.ClientCnxn Session establishment request sent on mike.local/10.0.2.106:2181 INFO [main-SendThread(mike.local:2181)] 2012-09-25 14:02:24,600 org.apache.zookeeper.ClientCnxn Session establishment complete on server mike.local/10.0.2.106:2181, sessionid = 0x139ff2e85b60005, negotiated timeout = 4 DEBUG [main-EventThread] 2012-09-25 14:02:24,614 com.socrata.zookeeper.ZooKeeperProvider ConnectionStateChanged(Connected) DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:24,636 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:/ serverPath:/ finished:false header:: 0,12 replyHeader:: 0,0,0 request:: '/,F response:: v{} until SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:37,923 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:/ serverPath:/ finished:false header:: 0,12 replyHeader:: 0,0,0 request:: '/,F response:: v{} until SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:37,924 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:/ serverPath:/ finished:false header:: 0,12 replyHeader:: 0,0,0 request:: '/,F response:: v{} until SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:37,924 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:null serverPath:null finished:false header:: -2,11 replyHeader:: null request:: null response:: nulluntil SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:51,260 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:/ serverPath:/ finished:false header:: 0,12 replyHeader:: 0,0,0 request:: '/,F response:: v{} until SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:51,260 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:null serverPath:null finished:false header:: -2,11 replyHeader:: null request:: null response:: nulluntil SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:51,261 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:/ serverPath:/ finished:false header:: 0,12 replyHeader:: 0,0,0 request:: '/,F response:: v{} until SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:51,261 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:null serverPath:null finished:false header:: -2,11 replyHeader:: null request:: null response:: nulluntil SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:51,261 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:null serverPath:null finished:false header:: -2,11 replyHeader:: null request:: null response:: nulluntil SASL authentication completes. DEBUG
[jira] [Updated] (ZOOKEEPER-1550) ZooKeeperSaslClient does not finish anonymous login on OpenJDK
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-1550: - Priority: Blocker (was: Major) ZooKeeperSaslClient does not finish anonymous login on OpenJDK -- Key: ZOOKEEPER-1550 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1550 Project: ZooKeeper Issue Type: Bug Components: java client Affects Versions: 3.4.4 Reporter: Robert Macomber Priority: Blocker Fix For: 3.4.5 On OpenJDK, {{javax.security.auth.login.Configuration.getConfiguration}} does not throw an exception. {{ZooKeeperSaslClient.clientTunneledAuthenticationInProgress}} uses an exception from that method as a proxy for this client is not configured to use SASL and as a result no commands can be sent, since it is still waiting for auth to complete. [Link to mailing list discussion|http://comments.gmane.org/gmane.comp.java.zookeeper.user/2667] The relevant bit of logs from OpenJDK and Oracle versions of 'connect and do getChildren(/)': {code:title=OpenJDK} INFO [main] 2012-09-25 14:02:24,545 com.socrata.Main Waiting for connection... DEBUG [main] 2012-09-25 14:02:24,548 com.socrata.zookeeper.ZooKeeperProvider Waiting for connected-state... INFO [main-SendThread(mike.local:2181)] 2012-09-25 14:02:24,576 org.apache.zookeeper.ClientCnxn Opening socket connection to server mike.local/10.0.2.106:2181. Will not attempt to authenticate using SASL (unknown error) INFO [main-SendThread(mike.local:2181)] 2012-09-25 14:02:24,584 org.apache.zookeeper.ClientCnxn Socket connection established to mike.local/10.0.2.106:2181, initiating session DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:24,586 org.apache.zookeeper.ClientCnxn Session establishment request sent on mike.local/10.0.2.106:2181 INFO [main-SendThread(mike.local:2181)] 2012-09-25 14:02:24,600 org.apache.zookeeper.ClientCnxn Session establishment complete on server mike.local/10.0.2.106:2181, sessionid = 0x139ff2e85b60005, negotiated timeout = 4 DEBUG [main-EventThread] 2012-09-25 14:02:24,614 com.socrata.zookeeper.ZooKeeperProvider ConnectionStateChanged(Connected) DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:24,636 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:/ serverPath:/ finished:false header:: 0,12 replyHeader:: 0,0,0 request:: '/,F response:: v{} until SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:37,923 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:/ serverPath:/ finished:false header:: 0,12 replyHeader:: 0,0,0 request:: '/,F response:: v{} until SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:37,924 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:/ serverPath:/ finished:false header:: 0,12 replyHeader:: 0,0,0 request:: '/,F response:: v{} until SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:37,924 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:null serverPath:null finished:false header:: -2,11 replyHeader:: null request:: null response:: nulluntil SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:51,260 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:/ serverPath:/ finished:false header:: 0,12 replyHeader:: 0,0,0 request:: '/,F response:: v{} until SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:51,260 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:null serverPath:null finished:false header:: -2,11 replyHeader:: null request:: null response:: nulluntil SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:51,261 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:/ serverPath:/ finished:false header:: 0,12 replyHeader:: 0,0,0 request:: '/,F response:: v{} until SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:51,261 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:null serverPath:null finished:false header:: -2,11 replyHeader:: null request:: null response:: nulluntil SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:51,261 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:null serverPath:null finished:false header:: -2,11 replyHeader:: null request:: null response:: nulluntil SASL authentication
[jira] [Commented] (ZOOKEEPER-1550) ZooKeeperSaslClient does not finish anonymous login on OpenJDK
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13464339#comment-13464339 ] Mahadev konar commented on ZOOKEEPER-1550: -- Thanks Eugene. Robert, can you verify this patch as well? Thanks ZooKeeperSaslClient does not finish anonymous login on OpenJDK -- Key: ZOOKEEPER-1550 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1550 Project: ZooKeeper Issue Type: Bug Components: java client Affects Versions: 3.4.4 Reporter: Robert Macomber Assignee: Eugene Koontz Priority: Blocker Fix For: 3.4.5 Attachments: ZOOKEEPER-1550.patch On OpenJDK, {{javax.security.auth.login.Configuration.getConfiguration}} does not throw an exception. {{ZooKeeperSaslClient.clientTunneledAuthenticationInProgress}} uses an exception from that method as a proxy for this client is not configured to use SASL and as a result no commands can be sent, since it is still waiting for auth to complete. [Link to mailing list discussion|http://comments.gmane.org/gmane.comp.java.zookeeper.user/2667] The relevant bit of logs from OpenJDK and Oracle versions of 'connect and do getChildren(/)': {code:title=OpenJDK} INFO [main] 2012-09-25 14:02:24,545 com.socrata.Main Waiting for connection... DEBUG [main] 2012-09-25 14:02:24,548 com.socrata.zookeeper.ZooKeeperProvider Waiting for connected-state... INFO [main-SendThread(mike.local:2181)] 2012-09-25 14:02:24,576 org.apache.zookeeper.ClientCnxn Opening socket connection to server mike.local/10.0.2.106:2181. Will not attempt to authenticate using SASL (unknown error) INFO [main-SendThread(mike.local:2181)] 2012-09-25 14:02:24,584 org.apache.zookeeper.ClientCnxn Socket connection established to mike.local/10.0.2.106:2181, initiating session DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:24,586 org.apache.zookeeper.ClientCnxn Session establishment request sent on mike.local/10.0.2.106:2181 INFO [main-SendThread(mike.local:2181)] 2012-09-25 14:02:24,600 org.apache.zookeeper.ClientCnxn Session establishment complete on server mike.local/10.0.2.106:2181, sessionid = 0x139ff2e85b60005, negotiated timeout = 4 DEBUG [main-EventThread] 2012-09-25 14:02:24,614 com.socrata.zookeeper.ZooKeeperProvider ConnectionStateChanged(Connected) DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:24,636 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:/ serverPath:/ finished:false header:: 0,12 replyHeader:: 0,0,0 request:: '/,F response:: v{} until SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:37,923 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:/ serverPath:/ finished:false header:: 0,12 replyHeader:: 0,0,0 request:: '/,F response:: v{} until SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:37,924 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:/ serverPath:/ finished:false header:: 0,12 replyHeader:: 0,0,0 request:: '/,F response:: v{} until SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:37,924 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:null serverPath:null finished:false header:: -2,11 replyHeader:: null request:: null response:: nulluntil SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:51,260 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:/ serverPath:/ finished:false header:: 0,12 replyHeader:: 0,0,0 request:: '/,F response:: v{} until SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:51,260 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:null serverPath:null finished:false header:: -2,11 replyHeader:: null request:: null response:: nulluntil SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:51,261 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:/ serverPath:/ finished:false header:: 0,12 replyHeader:: 0,0,0 request:: '/,F response:: v{} until SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:51,261 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:null serverPath:null finished:false header:: -2,11 replyHeader:: null request:: null response:: nulluntil SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:51,261
[jira] [Commented] (ZOOKEEPER-1550) ZooKeeperSaslClient does not finish anonymous login on OpenJDK
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13464357#comment-13464357 ] Mahadev konar commented on ZOOKEEPER-1550: -- Awesome, Ill check this in and kick of the builds on jdk 7 and see if it all works. ZooKeeperSaslClient does not finish anonymous login on OpenJDK -- Key: ZOOKEEPER-1550 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1550 Project: ZooKeeper Issue Type: Bug Components: java client Affects Versions: 3.4.4 Reporter: Robert Macomber Assignee: Eugene Koontz Priority: Blocker Fix For: 3.4.5 Attachments: ZOOKEEPER-1550.patch On OpenJDK, {{javax.security.auth.login.Configuration.getConfiguration}} does not throw an exception. {{ZooKeeperSaslClient.clientTunneledAuthenticationInProgress}} uses an exception from that method as a proxy for this client is not configured to use SASL and as a result no commands can be sent, since it is still waiting for auth to complete. [Link to mailing list discussion|http://comments.gmane.org/gmane.comp.java.zookeeper.user/2667] The relevant bit of logs from OpenJDK and Oracle versions of 'connect and do getChildren(/)': {code:title=OpenJDK} INFO [main] 2012-09-25 14:02:24,545 com.socrata.Main Waiting for connection... DEBUG [main] 2012-09-25 14:02:24,548 com.socrata.zookeeper.ZooKeeperProvider Waiting for connected-state... INFO [main-SendThread(mike.local:2181)] 2012-09-25 14:02:24,576 org.apache.zookeeper.ClientCnxn Opening socket connection to server mike.local/10.0.2.106:2181. Will not attempt to authenticate using SASL (unknown error) INFO [main-SendThread(mike.local:2181)] 2012-09-25 14:02:24,584 org.apache.zookeeper.ClientCnxn Socket connection established to mike.local/10.0.2.106:2181, initiating session DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:24,586 org.apache.zookeeper.ClientCnxn Session establishment request sent on mike.local/10.0.2.106:2181 INFO [main-SendThread(mike.local:2181)] 2012-09-25 14:02:24,600 org.apache.zookeeper.ClientCnxn Session establishment complete on server mike.local/10.0.2.106:2181, sessionid = 0x139ff2e85b60005, negotiated timeout = 4 DEBUG [main-EventThread] 2012-09-25 14:02:24,614 com.socrata.zookeeper.ZooKeeperProvider ConnectionStateChanged(Connected) DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:24,636 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:/ serverPath:/ finished:false header:: 0,12 replyHeader:: 0,0,0 request:: '/,F response:: v{} until SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:37,923 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:/ serverPath:/ finished:false header:: 0,12 replyHeader:: 0,0,0 request:: '/,F response:: v{} until SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:37,924 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:/ serverPath:/ finished:false header:: 0,12 replyHeader:: 0,0,0 request:: '/,F response:: v{} until SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:37,924 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:null serverPath:null finished:false header:: -2,11 replyHeader:: null request:: null response:: nulluntil SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:51,260 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:/ serverPath:/ finished:false header:: 0,12 replyHeader:: 0,0,0 request:: '/,F response:: v{} until SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:51,260 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:null serverPath:null finished:false header:: -2,11 replyHeader:: null request:: null response:: nulluntil SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:51,261 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:/ serverPath:/ finished:false header:: 0,12 replyHeader:: 0,0,0 request:: '/,F response:: v{} until SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:51,261 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:null serverPath:null finished:false header:: -2,11 replyHeader:: null request:: null response:: nulluntil SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:51,261
[jira] [Commented] (ZOOKEEPER-1550) ZooKeeperSaslClient does not finish anonymous login on OpenJDK
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13464370#comment-13464370 ] Mahadev konar commented on ZOOKEEPER-1550: -- Eugene, Looks like the sasl test failed. Can you please take a look? ZooKeeperSaslClient does not finish anonymous login on OpenJDK -- Key: ZOOKEEPER-1550 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1550 Project: ZooKeeper Issue Type: Bug Components: java client Affects Versions: 3.4.4 Reporter: Robert Macomber Assignee: Eugene Koontz Priority: Blocker Fix For: 3.4.5 Attachments: ZOOKEEPER-1550.patch On OpenJDK, {{javax.security.auth.login.Configuration.getConfiguration}} does not throw an exception. {{ZooKeeperSaslClient.clientTunneledAuthenticationInProgress}} uses an exception from that method as a proxy for this client is not configured to use SASL and as a result no commands can be sent, since it is still waiting for auth to complete. [Link to mailing list discussion|http://comments.gmane.org/gmane.comp.java.zookeeper.user/2667] The relevant bit of logs from OpenJDK and Oracle versions of 'connect and do getChildren(/)': {code:title=OpenJDK} INFO [main] 2012-09-25 14:02:24,545 com.socrata.Main Waiting for connection... DEBUG [main] 2012-09-25 14:02:24,548 com.socrata.zookeeper.ZooKeeperProvider Waiting for connected-state... INFO [main-SendThread(mike.local:2181)] 2012-09-25 14:02:24,576 org.apache.zookeeper.ClientCnxn Opening socket connection to server mike.local/10.0.2.106:2181. Will not attempt to authenticate using SASL (unknown error) INFO [main-SendThread(mike.local:2181)] 2012-09-25 14:02:24,584 org.apache.zookeeper.ClientCnxn Socket connection established to mike.local/10.0.2.106:2181, initiating session DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:24,586 org.apache.zookeeper.ClientCnxn Session establishment request sent on mike.local/10.0.2.106:2181 INFO [main-SendThread(mike.local:2181)] 2012-09-25 14:02:24,600 org.apache.zookeeper.ClientCnxn Session establishment complete on server mike.local/10.0.2.106:2181, sessionid = 0x139ff2e85b60005, negotiated timeout = 4 DEBUG [main-EventThread] 2012-09-25 14:02:24,614 com.socrata.zookeeper.ZooKeeperProvider ConnectionStateChanged(Connected) DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:24,636 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:/ serverPath:/ finished:false header:: 0,12 replyHeader:: 0,0,0 request:: '/,F response:: v{} until SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:37,923 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:/ serverPath:/ finished:false header:: 0,12 replyHeader:: 0,0,0 request:: '/,F response:: v{} until SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:37,924 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:/ serverPath:/ finished:false header:: 0,12 replyHeader:: 0,0,0 request:: '/,F response:: v{} until SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:37,924 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:null serverPath:null finished:false header:: -2,11 replyHeader:: null request:: null response:: nulluntil SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:51,260 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:/ serverPath:/ finished:false header:: 0,12 replyHeader:: 0,0,0 request:: '/,F response:: v{} until SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:51,260 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:null serverPath:null finished:false header:: -2,11 replyHeader:: null request:: null response:: nulluntil SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:51,261 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:/ serverPath:/ finished:false header:: 0,12 replyHeader:: 0,0,0 request:: '/,F response:: v{} until SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:51,261 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:null serverPath:null finished:false header:: -2,11 replyHeader:: null request:: null response:: nulluntil SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:51,261
[jira] [Commented] (ZOOKEEPER-1550) ZooKeeperSaslClient does not finish anonymous login on OpenJDK
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13464387#comment-13464387 ] Mahadev konar commented on ZOOKEEPER-1550: -- Eugene, Still failing :)... ZooKeeperSaslClient does not finish anonymous login on OpenJDK -- Key: ZOOKEEPER-1550 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1550 Project: ZooKeeper Issue Type: Bug Components: java client Affects Versions: 3.4.4 Reporter: Robert Macomber Assignee: Eugene Koontz Priority: Blocker Fix For: 3.4.5 Attachments: ZOOKEEPER-1550.patch, ZOOKEEPER-1550.patch On OpenJDK, {{javax.security.auth.login.Configuration.getConfiguration}} does not throw an exception. {{ZooKeeperSaslClient.clientTunneledAuthenticationInProgress}} uses an exception from that method as a proxy for this client is not configured to use SASL and as a result no commands can be sent, since it is still waiting for auth to complete. [Link to mailing list discussion|http://comments.gmane.org/gmane.comp.java.zookeeper.user/2667] The relevant bit of logs from OpenJDK and Oracle versions of 'connect and do getChildren(/)': {code:title=OpenJDK} INFO [main] 2012-09-25 14:02:24,545 com.socrata.Main Waiting for connection... DEBUG [main] 2012-09-25 14:02:24,548 com.socrata.zookeeper.ZooKeeperProvider Waiting for connected-state... INFO [main-SendThread(mike.local:2181)] 2012-09-25 14:02:24,576 org.apache.zookeeper.ClientCnxn Opening socket connection to server mike.local/10.0.2.106:2181. Will not attempt to authenticate using SASL (unknown error) INFO [main-SendThread(mike.local:2181)] 2012-09-25 14:02:24,584 org.apache.zookeeper.ClientCnxn Socket connection established to mike.local/10.0.2.106:2181, initiating session DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:24,586 org.apache.zookeeper.ClientCnxn Session establishment request sent on mike.local/10.0.2.106:2181 INFO [main-SendThread(mike.local:2181)] 2012-09-25 14:02:24,600 org.apache.zookeeper.ClientCnxn Session establishment complete on server mike.local/10.0.2.106:2181, sessionid = 0x139ff2e85b60005, negotiated timeout = 4 DEBUG [main-EventThread] 2012-09-25 14:02:24,614 com.socrata.zookeeper.ZooKeeperProvider ConnectionStateChanged(Connected) DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:24,636 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:/ serverPath:/ finished:false header:: 0,12 replyHeader:: 0,0,0 request:: '/,F response:: v{} until SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:37,923 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:/ serverPath:/ finished:false header:: 0,12 replyHeader:: 0,0,0 request:: '/,F response:: v{} until SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:37,924 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:/ serverPath:/ finished:false header:: 0,12 replyHeader:: 0,0,0 request:: '/,F response:: v{} until SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:37,924 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:null serverPath:null finished:false header:: -2,11 replyHeader:: null request:: null response:: nulluntil SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:51,260 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:/ serverPath:/ finished:false header:: 0,12 replyHeader:: 0,0,0 request:: '/,F response:: v{} until SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:51,260 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:null serverPath:null finished:false header:: -2,11 replyHeader:: null request:: null response:: nulluntil SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:51,261 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:/ serverPath:/ finished:false header:: 0,12 replyHeader:: 0,0,0 request:: '/,F response:: v{} until SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:51,261 org.apache.zookeeper.ClientCnxnSocketNIO deferring non-priming packet: clientPath:null serverPath:null finished:false header:: -2,11 replyHeader:: null request:: null response:: nulluntil SASL authentication completes. DEBUG [main-SendThread(mike.local:2181)] 2012-09-25 14:02:51,261 org.apache.zookeeper.ClientCnxnSocketNIO
[jira] [Updated] (ZOOKEEPER-1477) Test failures with Java 7 on Mac OS X
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-1477: - Priority: Blocker (was: Critical) Test failures with Java 7 on Mac OS X - Key: ZOOKEEPER-1477 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1477 Project: ZooKeeper Issue Type: Bug Components: server, tests Affects Versions: 3.4.3 Environment: Mac OS X Lion (10.7.4) Java version: java version 1.7.0_04 Java(TM) SE Runtime Environment (build 1.7.0_04-b21) Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode) Reporter: Diwaker Gupta Priority: Blocker Fix For: 3.4.5 I downloaded ZK 3.4.3 sources and ran {{ant test}}. Many of the tests failed, including ZooKeeperTest. A common symptom was spurious {{ConnectionLossException}}: {code} 2012-06-01 12:01:23,420 [myid:] - INFO [main:JUnit4ZKTestRunner$LoggedInvokeMethod@54] - TEST METHOD FAILED testDeleteRecursiveAsync org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for / at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1246) at org.apache.zookeeper.ZooKeeperTest.testDeleteRecursiveAsync(ZooKeeperTest.java:77) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ... (snipped) {code} As background, I was actually investigating some non-deterministic failures when using Netflix's Curator with Java 7 (see https://github.com/Netflix/curator/issues/79). After a while, I figured I should establish a clean ZK baseline first and realized it is actually a ZK issue, not a Curator issue. We are trying to migrate to Java 7 but this is a blocking issue for us right now. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1496) Ephemeral node not getting cleared even after client has exited
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13456860#comment-13456860 ] Mahadev konar commented on ZOOKEEPER-1496: -- Rakesh, The patch looks good to me. Ill wait for hudson to check this in. We are good to go for 3.4 RC now! Thanks Rakesh! Ephemeral node not getting cleared even after client has exited --- Key: ZOOKEEPER-1496 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1496 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.4.3 Reporter: suja s Assignee: Rakesh R Priority: Critical Fix For: 3.4.4, 3.5.0 Attachments: Logs.rar, ZOOKEEPER-1496.1.patch, ZOOKEEPER-1496.2.patch, ZOOKEEPER-1496.3.patch, ZOOKEEPER-1496.patch In one of the tests we performed, came across a case where the ephemeral node was not getting cleared from zookeeper though the client exited. Zk version: 3.4.3 Ephemeral node still exists in Zookeeper: HOST-xx-xx-xx-55:/home/Jun25_LR/install/zookeeper/bin # date Tue Jun 26 16:07:04 IST 2012 HOST-xx-xx-xx-55:/home/Jun25_LR/install/zookeeper/bin # ./zkCli.sh -server xx.xx.xx.55:2182 Connecting to xx.xx.xx.55:2182 Welcome to ZooKeeper! JLine support is enabled [zk: xx.xx.xx.55:2182(CONNECTING) 0] WATCHER:: WatchedEvent state:SyncConnected type:None path:null [zk: xx.xx.xx.55:2182(CONNECTED) 0] get /hadoop-ha/hacluster/ActiveStandbyElectorLock haclusternn2HOSt-xx-xx-xx-102 �� cZxid = 0x20075 ctime = Tue Jun 26 13:10:19 IST 2012 mZxid = 0x20075 mtime = Tue Jun 26 13:10:19 IST 2012 pZxid = 0x20075 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x1382791d4e50004 dataLength = 42 numChildren = 0 [zk: xx.xx.xx.55:2182(CONNECTED) 1] Grepped logs at ZK side for session 0x1382791d4e50004 - close session and later create coming before closesession processed. HOSt-xx-xx-xx-91:/home/Jun25_LR/install/zookeeper/logs # grep -E /hadoop-ha/hacluster/ActiveStandbyElectorLock|0x1382791d4e50004 *|grep 0x20074 2012-06-26 13:10:18,834 [myid:3] - DEBUG [ProcessThread(sid:3 cport:-1)::CommitProcessor@171] - Processing request:: sessionid:0x1382791d4e50004 type:closeSession cxid:0x0 zxid:0x20074 txntype:-11 reqpath:n/a 2012-06-26 13:10:19,892 [myid:3] - DEBUG [ProcessThread(sid:3 cport:-1)::Leader@716] - Proposing:: sessionid:0x1382791d4e50004 type:closeSession cxid:0x0 zxid:0x20074 txntype:-11 reqpath:n/a 2012-06-26 13:10:19,919 [myid:3] - DEBUG [LearnerHandler-/xx.xx.xx.102:13846:CommitProcessor@161] - Committing request:: sessionid:0x1382791d4e50004 type:closeSession cxid:0x0 zxid:0x20074 txntype:-11 reqpath:n/a 2012-06-26 13:10:20,608 [myid:3] - DEBUG [CommitProcessor:3:FinalRequestProcessor@88] - Processing request:: sessionid:0x1382791d4e50004 type:closeSession cxid:0x0 zxid:0x20074 txntype:-11 reqpath:n/a HOSt-xx-xx-xx-91:/home/Jun25_LR/install/zookeeper/logs # grep -E /hadoop-ha/hacluster/ActiveStandbyElectorLock|0x1382791d4e50004 *|grep 0x20075 2012-06-26 13:10:19,893 [myid:3] - DEBUG [ProcessThread(sid:3 cport:-1)::CommitProcessor@171] - Processing request:: sessionid:0x1382791d4e50004 type:create cxid:0x2 zxid:0x20075 txntype:1 reqpath:n/a 2012-06-26 13:10:19,920 [myid:3] - DEBUG [ProcessThread(sid:3 cport:-1)::Leader@716] - Proposing:: sessionid:0x1382791d4e50004 type:create cxid:0x2 zxid:0x20075 txntype:1 reqpath:n/a 2012-06-26 13:10:20,278 [myid:3] - DEBUG [LearnerHandler-/xx.xx.xx.102:13846:CommitProcessor@161] - Committing request:: sessionid:0x1382791d4e50004 type:create cxid:0x2 zxid:0x20075 txntype:1 reqpath:n/a 2012-06-26 13:10:20,752 [myid:3] - DEBUG [CommitProcessor:3:FinalRequestProcessor@88] - Processing request:: sessionid:0x1382791d4e50004 type:create cxid:0x2 zxid:0x20075 txntype:1 reqpath:n/a Close session and create requests coming almost parallely. Env: Hadoop setup. We were using Namenode HA with bookkeeper as shared storage and auto failover enabled. NN102 was active and NN55 was standby. FailoverController at 102 got shut down due to ZK connection error. The lock-ActiveStandbyElectorLock created (ephemeral node) by this failovercontroller is not cleared from ZK -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1105) c client zookeeper_close not send CLOSE_OP request to server
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13456681#comment-13456681 ] Mahadev konar commented on ZOOKEEPER-1105: -- Nice catch Michi. I think Ill revert the patch for 3.4 and trunk and we can fix it later. I dont think this looks like a blocker for 3.4 release. Mitchi what do you think? c client zookeeper_close not send CLOSE_OP request to server Key: ZOOKEEPER-1105 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1105 Project: ZooKeeper Issue Type: Bug Components: c client Affects Versions: 3.3.2, 3.4.3 Reporter: jiang guangran Assignee: lincoln.lee Fix For: 3.4.4, 3.5.0 Attachments: zklog.txt, zktest.c, zktest.java, ZOOKEEPER-1105.patch in zookeeper_close function, do adaptor_finish before send CLOSE_OP request to server so the CLOSE_OP request can not be sent to server in server zookeeper.log have many 2011-06-22 00:23:02,323 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] - EndOfStreamException: Unable to read additional data from client sessionid 0x1305970d66d2224, likely client has closed socket 2011-06-22 00:23:02,324 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for client /10.250.8.123:60257 which had sessionid 0x1305970d66d2224 2011-06-22 00:23:02,325 - ERROR [CommitProcessor:1:NIOServerCnxn@445] - Unexpected Exception: java.nio.channels.CancelledKeyException at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55) at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59) at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:418) at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1509) at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:367) at org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73) and java client not have this problem -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1105) c client zookeeper_close not send CLOSE_OP request to server
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13456682#comment-13456682 ] Mahadev konar commented on ZOOKEEPER-1105: -- Michi looks like you reverted the patch for trunk. Can you do that for 3.4 branch as well? If not let me know. I can do it. c client zookeeper_close not send CLOSE_OP request to server Key: ZOOKEEPER-1105 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1105 Project: ZooKeeper Issue Type: Bug Components: c client Affects Versions: 3.3.2, 3.4.3 Reporter: jiang guangran Assignee: lincoln.lee Fix For: 3.4.4, 3.5.0 Attachments: zklog.txt, zktest.c, zktest.java, ZOOKEEPER-1105.patch in zookeeper_close function, do adaptor_finish before send CLOSE_OP request to server so the CLOSE_OP request can not be sent to server in server zookeeper.log have many 2011-06-22 00:23:02,323 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] - EndOfStreamException: Unable to read additional data from client sessionid 0x1305970d66d2224, likely client has closed socket 2011-06-22 00:23:02,324 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for client /10.250.8.123:60257 which had sessionid 0x1305970d66d2224 2011-06-22 00:23:02,325 - ERROR [CommitProcessor:1:NIOServerCnxn@445] - Unexpected Exception: java.nio.channels.CancelledKeyException at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55) at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59) at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:418) at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1509) at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:367) at org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73) and java client not have this problem -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1105) c client zookeeper_close not send CLOSE_OP request to server
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13456684#comment-13456684 ] Mahadev konar commented on ZOOKEEPER-1105: -- Thanks Michi! c client zookeeper_close not send CLOSE_OP request to server Key: ZOOKEEPER-1105 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1105 Project: ZooKeeper Issue Type: Bug Components: c client Affects Versions: 3.3.2, 3.4.3 Reporter: jiang guangran Assignee: lincoln.lee Fix For: 3.5.0 Attachments: zklog.txt, zktest.c, zktest.java, ZOOKEEPER-1105.patch in zookeeper_close function, do adaptor_finish before send CLOSE_OP request to server so the CLOSE_OP request can not be sent to server in server zookeeper.log have many 2011-06-22 00:23:02,323 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] - EndOfStreamException: Unable to read additional data from client sessionid 0x1305970d66d2224, likely client has closed socket 2011-06-22 00:23:02,324 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for client /10.250.8.123:60257 which had sessionid 0x1305970d66d2224 2011-06-22 00:23:02,325 - ERROR [CommitProcessor:1:NIOServerCnxn@445] - Unexpected Exception: java.nio.channels.CancelledKeyException at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55) at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59) at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:418) at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1509) at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:367) at org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73) and java client not have this problem -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1448) Node+Quota creation in transaction log can crash leader startup
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-1448: - I looked at the patch. THe patch looks good except for the point that Pat mentioned above that its moving the test toward log4j than using slf4j. For now I am moving this out to 3.4.5 for getting it done right. Botond, if you have sometime would you please update the patch using slf4j in the testcase. Node+Quota creation in transaction log can crash leader startup --- Key: ZOOKEEPER-1448 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1448 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.3.5 Reporter: Botond Hejj Assignee: Botond Hejj Priority: Critical Fix For: 3.5.0, 3.3.7, 3.4.5 Attachments: ZOOKEEPER-1448_branch3.3.patch, ZOOKEEPER-1448.patch, ZOOKEEPER-1448.patch, ZOOKEEPER-1448.patch, ZOOKEEPER-1448.patch Hi, I've found a bug in zookeeper related to quota creation which can shutdown zookeeper leader on startup. Steps to reproduce: 1. create /quota_bug 2. setquota -n 1 /quota_bug 3. stop the whole ensemble (the previous operations should be in the transaction log) 4. start all the servers 5. the elected leader will shutdown with an exception (Missing stat node for count /zookeeper/quota/quota_bug/zookeeper_ stats) I've debugged a bit what happening and I found the following problem: On startup each server loads the last snapshot and replays the last transaction log. While doing this it fills up the pTrie variable of the DataTree with the path of the nodes which have quota. After the leader is elected the leader servers loads the snapshot and last transaction log but it doesn't clean up the pTrie variable. This means it still contains the /quota_bug path. Now when the create /quota_bug is processed from the transaction log the DataTree already thinks that the quota nodes (/zookeeper/quota/quota_bug/zookeeper_limits and /zookeeper/quota/quota_bug/zookeeper_stats) are created but those node creation actually comes later in the transaction log. This leads to the missing stat node exception. I think clearing the pTrie should solve this problem. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1448) Node+Quota creation in transaction log can crash leader startup
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-1448: - Fix Version/s: (was: 3.4.4) 3.4.5 Node+Quota creation in transaction log can crash leader startup --- Key: ZOOKEEPER-1448 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1448 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.3.5 Reporter: Botond Hejj Assignee: Botond Hejj Priority: Critical Fix For: 3.5.0, 3.3.7, 3.4.5 Attachments: ZOOKEEPER-1448_branch3.3.patch, ZOOKEEPER-1448.patch, ZOOKEEPER-1448.patch, ZOOKEEPER-1448.patch, ZOOKEEPER-1448.patch Hi, I've found a bug in zookeeper related to quota creation which can shutdown zookeeper leader on startup. Steps to reproduce: 1. create /quota_bug 2. setquota -n 1 /quota_bug 3. stop the whole ensemble (the previous operations should be in the transaction log) 4. start all the servers 5. the elected leader will shutdown with an exception (Missing stat node for count /zookeeper/quota/quota_bug/zookeeper_ stats) I've debugged a bit what happening and I found the following problem: On startup each server loads the last snapshot and replays the last transaction log. While doing this it fills up the pTrie variable of the DataTree with the path of the nodes which have quota. After the leader is elected the leader servers loads the snapshot and last transaction log but it doesn't clean up the pTrie variable. This means it still contains the /quota_bug path. Now when the create /quota_bug is processed from the transaction log the DataTree already thinks that the quota nodes (/zookeeper/quota/quota_bug/zookeeper_limits and /zookeeper/quota/quota_bug/zookeeper_stats) are created but those node creation actually comes later in the transaction log. This leads to the missing stat node exception. I think clearing the pTrie should solve this problem. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1548) Cluster fails election loop in new and interesting way
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-1548: - Fix Version/s: 3.4.5 3.5.0 Cluster fails election loop in new and interesting way -- Key: ZOOKEEPER-1548 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1548 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.3 Reporter: Alan Horn Fix For: 3.5.0, 3.4.5 Attachments: 1-follower, 2-follower, 3-leader Hi, We have a five node cluster, recently upgraded from 3.3.5 to 3.4.3. Was running fine for a few weeks after the upgrade, then the following sequence of events occurred : 1. All servers stopped responding to 'ruok' at the same time 2. Our local supervisor process restarted all of them at the same time (yes, this is bad, we didn't expect it to fail this way :) 3. The cluster would not serve requests after this. Appeared to be unable to complete an election. We tried various things at this point, none of which worked : * Moved around the restart order of the nodes (e.g. 4 thru 0, instead of 0 thru 4) * Reduced number of running nodes from 5 - 3 to simplify the quorum, by only starting up 0, 1 2, in one test, and 0, 2 4 in the other * Removed the *Epoch files from version-2/ snapshot directory * Put the same version2/snapshot.x file on each server in the cluster * Added the (same on all nodes) last txlog onto each cluster * Kept only the last snapshot plus txlog unique on each server * Moved leaderServes=no to leaderServes=yes * Removed all files and started up with empty data as a control. This worked, but of course isn't terribly useful :) Finally, I brought the data up on a single node running in standalone and this worked (yay!) So at this point we brought the single node back into service and have kept the other four available to debug why the election is failing. We downgraded the four nodes to 3.3.5, and then they completed the election and started serving as expected. We did a rolling upgrade to 3.4.3, and everything was fine until we restarted the leader, whereupon we encountered the same re-election loop as before. We're a bit out of ideas at this point, so I was hoping someone from this list might have some useful input. Output from two followers and a leader during this condition are attached. Cheers, Al -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1496) Ephemeral node not getting cleared even after client has exited
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13456758#comment-13456758 ] Mahadev konar commented on ZOOKEEPER-1496: -- Rakesh, I looked at the patch and it looks good, except for this one: {code} -set = sessionSets.remove(nextExpirationTime); +SessionSet set = sessionSets.get(nextExpirationTime); {code} I think the remove still needs to happen else the session sets will keep growing in the hashset. Also, {code} if (s != null) { -sessionSets.get(s.tickTime).sessions.remove(s); +SessionSet sessionSet = sessionSets.get(s.tickTime); +sessionSet.sessions.remove(s); +// Cleanup sessionSets, if no session exists +if (sessionSet.sessions.size() == 0) { +sessionSets.remove(s.tickTime); +} } } {code} I see that you are removing the sessionSet once the session cleans up but I think we need to still do the remove the session set when iterating for expiry. Does that make sense? Ephemeral node not getting cleared even after client has exited --- Key: ZOOKEEPER-1496 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1496 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.4.3 Reporter: suja s Assignee: Rakesh R Priority: Critical Fix For: 3.4.4, 3.5.0 Attachments: Logs.rar, ZOOKEEPER-1496.1.patch, ZOOKEEPER-1496.2.patch, ZOOKEEPER-1496.patch In one of the tests we performed, came across a case where the ephemeral node was not getting cleared from zookeeper though the client exited. Zk version: 3.4.3 Ephemeral node still exists in Zookeeper: HOST-xx-xx-xx-55:/home/Jun25_LR/install/zookeeper/bin # date Tue Jun 26 16:07:04 IST 2012 HOST-xx-xx-xx-55:/home/Jun25_LR/install/zookeeper/bin # ./zkCli.sh -server xx.xx.xx.55:2182 Connecting to xx.xx.xx.55:2182 Welcome to ZooKeeper! JLine support is enabled [zk: xx.xx.xx.55:2182(CONNECTING) 0] WATCHER:: WatchedEvent state:SyncConnected type:None path:null [zk: xx.xx.xx.55:2182(CONNECTED) 0] get /hadoop-ha/hacluster/ActiveStandbyElectorLock haclusternn2HOSt-xx-xx-xx-102 �� cZxid = 0x20075 ctime = Tue Jun 26 13:10:19 IST 2012 mZxid = 0x20075 mtime = Tue Jun 26 13:10:19 IST 2012 pZxid = 0x20075 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x1382791d4e50004 dataLength = 42 numChildren = 0 [zk: xx.xx.xx.55:2182(CONNECTED) 1] Grepped logs at ZK side for session 0x1382791d4e50004 - close session and later create coming before closesession processed. HOSt-xx-xx-xx-91:/home/Jun25_LR/install/zookeeper/logs # grep -E /hadoop-ha/hacluster/ActiveStandbyElectorLock|0x1382791d4e50004 *|grep 0x20074 2012-06-26 13:10:18,834 [myid:3] - DEBUG [ProcessThread(sid:3 cport:-1)::CommitProcessor@171] - Processing request:: sessionid:0x1382791d4e50004 type:closeSession cxid:0x0 zxid:0x20074 txntype:-11 reqpath:n/a 2012-06-26 13:10:19,892 [myid:3] - DEBUG [ProcessThread(sid:3 cport:-1)::Leader@716] - Proposing:: sessionid:0x1382791d4e50004 type:closeSession cxid:0x0 zxid:0x20074 txntype:-11 reqpath:n/a 2012-06-26 13:10:19,919 [myid:3] - DEBUG [LearnerHandler-/xx.xx.xx.102:13846:CommitProcessor@161] - Committing request:: sessionid:0x1382791d4e50004 type:closeSession cxid:0x0 zxid:0x20074 txntype:-11 reqpath:n/a 2012-06-26 13:10:20,608 [myid:3] - DEBUG [CommitProcessor:3:FinalRequestProcessor@88] - Processing request:: sessionid:0x1382791d4e50004 type:closeSession cxid:0x0 zxid:0x20074 txntype:-11 reqpath:n/a HOSt-xx-xx-xx-91:/home/Jun25_LR/install/zookeeper/logs # grep -E /hadoop-ha/hacluster/ActiveStandbyElectorLock|0x1382791d4e50004 *|grep 0x20075 2012-06-26 13:10:19,893 [myid:3] - DEBUG [ProcessThread(sid:3 cport:-1)::CommitProcessor@171] - Processing request:: sessionid:0x1382791d4e50004 type:create cxid:0x2 zxid:0x20075 txntype:1 reqpath:n/a 2012-06-26 13:10:19,920 [myid:3] - DEBUG [ProcessThread(sid:3 cport:-1)::Leader@716] - Proposing:: sessionid:0x1382791d4e50004 type:create cxid:0x2 zxid:0x20075 txntype:1 reqpath:n/a 2012-06-26 13:10:20,278 [myid:3] - DEBUG [LearnerHandler-/xx.xx.xx.102:13846:CommitProcessor@161] - Committing request:: sessionid:0x1382791d4e50004 type:create cxid:0x2 zxid:0x20075 txntype:1 reqpath:n/a 2012-06-26 13:10:20,752 [myid:3] - DEBUG [CommitProcessor:3:FinalRequestProcessor@88] - Processing request:: sessionid:0x1382791d4e50004 type:create cxid:0x2 zxid:0x20075 txntype:1 reqpath:n/a Close session and create requests coming almost parallely.
[jira] [Updated] (ZOOKEEPER-1361) Leader.lead iterates over 'learners' set without proper synchronisation
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-1361: - Attachment: ZOOKEEPER-1361-3.4.patch Thanks Camille/Ross/Henry, I am committing Ross's patch that is a straightforward port from trunk to the 3.4 branch. Attaching a cleaned up version of Ross's patch (removing CHANGES.txt changes). Leader.lead iterates over 'learners' set without proper synchronisation --- Key: ZOOKEEPER-1361 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1361 Project: ZooKeeper Issue Type: Bug Affects Versions: 3.4.2 Reporter: Henry Robinson Assignee: Henry Robinson Fix For: 3.4.4, 3.5.0 Attachments: zk-memory-leak-fix.patch, ZOOKEEPER-1361-3.4.patch, ZOOKEEPER-1361-3.4.patch, ZOOKEEPER-1361-no-whitespace.patch, ZOOKEEPER-1361.patch This block: {code} HashSetLong followerSet = new HashSetLong(); for(LearnerHandler f : learners) followerSet.add(f.getSid()); {code} is executed without holding the lock on learners, so if there were ever a condition where a new learner was added during the initial sync phase, I'm pretty sure we'd see a concurrent modification exception. Certainly other parts of the code are very careful to lock on learners when iterating. It would be nice to use a {{ConcurrentHashMap}} to hold the learners instead, but I can't convince myself that this wouldn't introduce some correctness bugs. For example the following: Learners contains A, B, C, D Thread 1 iterates over learners, and gets as far as B. Thread 2 removes A, and adds E. Thread 1 continues iterating and sees a learner view of A, B, C, D, E This may be a bug if Thread 1 is counting the number of synced followers for a quorum count, since at no point was A, B, C, D, E a correct view of the quorum. In practice, I think this is actually ok, because I don't think ZK makes any strong ordering guarantees on learners joining or leaving (so we don't need a strong serialisability guarantee on learners) but I don't think I'll make that change for this patch. Instead I want to clean up the locking protocols on the follower / learner sets - to avoid another easy deadlock like the one we saw in ZOOKEEPER-1294 - and to do less with the lock held; i.e. to copy and then iterate over the copy rather than iterate over a locked set. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1469) Adding Cross-Realm support for secure Zookeeper client authentication
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-1469: - Fix Version/s: (was: 3.4.4) Moving it out of 3.4 release. Adding Cross-Realm support for secure Zookeeper client authentication - Key: ZOOKEEPER-1469 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1469 Project: ZooKeeper Issue Type: Improvement Components: documentation Affects Versions: 3.4.3 Reporter: Himanshu Vashishtha Assignee: Eugene Koontz Fix For: 3.5.0 Attachments: SaslServerCallBackHandlerException.patch There is a use case where one needs to support cross realm authentication for zookeeper cluster. One use case is HBase Replication: HBase supports replicating data to multiple slave clusters, where the later might be running in different realms. With current zookeeper security, the region server of master HBase cluster are not able to query the zookeeper quorum members of the slave cluster. This jira is about adding such Xrealm support. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1478) Small bug in QuorumTest.testFollowersStartAfterLeader( )
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-1478: - Fix Version/s: (was: 3.4.4) Moving it out to 3.5 since the bugfix isnt a really critical one. Small bug in QuorumTest.testFollowersStartAfterLeader( ) Key: ZOOKEEPER-1478 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1478 Project: ZooKeeper Issue Type: Bug Components: tests Affects Versions: 3.4.3 Reporter: Alexander Shraer Assignee: Alexander Shraer Priority: Minor Fix For: 3.5.0 Attachments: ZOOKEEPER-1478.patch, ZOOKEEPER-1478.patch, ZOOKEEPER-1478.patch, ZOOKEEPER-1478.patch The following code appears in QuorumTest.testFollowersStartAfterLeader( ): for (int i = 0; i 30; i++) { try { zk.create(/test, test.getBytes(), ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT); break; } catch(KeeperException.ConnectionLossException e) { Thread.sleep(1000); } // test fails if we still can't connect to the quorum after 30 seconds. Assert.fail(client could not connect to reestablished quorum: giving up after 30+ seconds.); } From the comment it looks like the intention was to try to reconnect 30 times and only then trigger the Assert, but that's not what this does. After we fail to connect once and Thread.sleep is executed, Assert.fail will be executed without retrying create. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1478) Small bug in QuorumTest.testFollowersStartAfterLeader( )
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-1478: - Attachment: ZOOKEEPER-1478.patch Re uploading the patch for hudson. Small bug in QuorumTest.testFollowersStartAfterLeader( ) Key: ZOOKEEPER-1478 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1478 Project: ZooKeeper Issue Type: Bug Components: tests Affects Versions: 3.4.3 Reporter: Alexander Shraer Assignee: Alexander Shraer Priority: Minor Fix For: 3.5.0 Attachments: ZOOKEEPER-1478.patch, ZOOKEEPER-1478.patch, ZOOKEEPER-1478.patch, ZOOKEEPER-1478.patch, ZOOKEEPER-1478.patch The following code appears in QuorumTest.testFollowersStartAfterLeader( ): for (int i = 0; i 30; i++) { try { zk.create(/test, test.getBytes(), ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT); break; } catch(KeeperException.ConnectionLossException e) { Thread.sleep(1000); } // test fails if we still can't connect to the quorum after 30 seconds. Assert.fail(client could not connect to reestablished quorum: giving up after 30+ seconds.); } From the comment it looks like the intention was to try to reconnect 30 times and only then trigger the Assert, but that's not what this does. After we fail to connect once and Thread.sleep is executed, Assert.fail will be executed without retrying create. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1494) C client: socket leak after receive timeout in zookeeper_interest()
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-1494: - Attachment: ZOOKEEPER-1494.patch C client: socket leak after receive timeout in zookeeper_interest() --- Key: ZOOKEEPER-1494 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1494 Project: ZooKeeper Issue Type: Bug Components: c client Affects Versions: 3.4.2, 3.3.5 Reporter: Michi Mutsuzaki Assignee: Michi Mutsuzaki Fix For: 3.4.4, 3.5.0 Attachments: ZOOKEEPER-1494-3.4.patch, ZOOKEEPER-1494.patch, ZOOKEEPER-1494.patch In zookeeper_interest(), we set zk-fd to -1 without closing it when timeout happens. Instead we should let handle_socket_error_msg() function take care of closing the socket properly. --Michi -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1538) Improve space handling in zkServer.sh and zkEnv.sh
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-1538: - Fix Version/s: (was: 3.4.4) Moving it out of 3.4 branch. The patch looks good. Ill go ahead and commit this to trunk. Improve space handling in zkServer.sh and zkEnv.sh -- Key: ZOOKEEPER-1538 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1538 Project: ZooKeeper Issue Type: Bug Affects Versions: 3.4.3 Reporter: Andrew Ferguson Assignee: Andrew Ferguson Priority: Trivial Fix For: 3.5.0 Attachments: ZOOKEEPER-1538.patch Running `bin/zkServer.sh start` from a freshly-built copy of trunk fails if the source code is checked-out to a directory with spaces in the name. I'll include a small fix to fix this problem. thanks! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1462) Read-only server does not initialize database properly
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-1462: - Fix Version/s: (was: 3.4.4) 3.4.5 Moving it out since we do not have a patch. Read-only server does not initialize database properly -- Key: ZOOKEEPER-1462 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1462 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.4.3 Reporter: Thawan Kooburat Assignee: Thawan Kooburat Priority: Critical Fix For: 3.5.0, 3.4.5 Attachments: ZOOKEEPER-1462.patch Brief Description: When a participant or observer get partitioned and restart as Read-only server. ZkDb doesn't get reinitialized. This causes the RO server to drop any incoming request with zxid 0 Error message: Refusing session request for client /xx.xx.xx.xx:39875 as it has seen zxid 0x2e00405fd9 our last zxid is 0x0 client must try another server Steps to reproduce: Start an RO-enabled observer connecting to an ensemble. Kill the ensemble and wait until the observer restart in RO mode. Zxid of this observer should be 0. Description: Before a server transition into LOOKING state, its database get closed as part of shutdown sequence. The database of leader, follower and observer get initialized as a side effect of participating in leader election protocol. (eg. observer will call registerWithLeader() and call getLastLoggedZxid() which initialize the db if not already). However, RO server does not participate in this protocol so its DB doesn't get initialized properly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1387) Wrong epoch file created
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-1387: - Fix Version/s: (was: 3.4.4) 3.4.5 Moving it out since its not a blocker. Wrong epoch file created Key: ZOOKEEPER-1387 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1387 Project: ZooKeeper Issue Type: Bug Components: quorum Affects Versions: 3.4.2 Reporter: Benjamin Busjaeger Assignee: Benjamin Reed Priority: Minor Fix For: 3.5.0, 3.4.5 Attachments: ZOOKEEPER-1387.patch It looks like line 443 in QuorumPeer [1] may need to change from: writeLongToFile(CURRENT_EPOCH_FILENAME, acceptedEpoch); to writeLongToFile(ACCEPTED_EPOCH_FILENAME, acceptedEpoch); I only noticed this reading the code, so I may be wrong and I don't know yet if/how this affects the runtime. [1] https://github.com/apache/zookeeper/blob/trunk/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java#L443 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1328) Misplaced assertion for the test case 'FLELostMessageTest' and not identifying misfunctions
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13446038#comment-13446038 ] Mahadev konar commented on ZOOKEEPER-1328: -- Thanks for fixing that Rakesh. Ill run it through hudson again and will commit as soon as it +1's. Misplaced assertion for the test case 'FLELostMessageTest' and not identifying misfunctions --- Key: ZOOKEEPER-1328 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1328 Project: ZooKeeper Issue Type: Test Components: leaderElection Affects Versions: 3.4.0 Reporter: Rakesh R Assignee: Rakesh R Fix For: 3.5.0 Attachments: ZOOKEEPER-1328.1.patch, ZOOKEEPER-1328.2.patch, ZOOKEEPER-1328.patch Assertion for testLostMessage is kept inside the thread.run() method. Due to this the assertion failure will not be reflected in the main testcase. I have observed the test case is still passing in case of the assert failure or misfunction. Instead, the assertion can be moved to the test case - testLostMessage. {noformat} class LEThread extends Thread { public void run(){ peer.setCurrentVote(v); LOG.info(Finished election: + i + , + v.getId()); Assert.assertTrue(State is not leading., peer.getPeerState() == ServerState.LEADING); } {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1328) Misplaced assertion for the test case 'FLELostMessageTest' and not identifying misfunctions
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13445060#comment-13445060 ] Mahadev konar commented on ZOOKEEPER-1328: -- Thanks for the patience and quick response Rakesh. Really appreciate that. The patch looks good to me. Ill let hudson run through it and will go ahead and commit once it +1's. Misplaced assertion for the test case 'FLELostMessageTest' and not identifying misfunctions --- Key: ZOOKEEPER-1328 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1328 Project: ZooKeeper Issue Type: Test Components: leaderElection Affects Versions: 3.4.0 Reporter: Rakesh R Assignee: Rakesh R Fix For: 3.5.0 Attachments: ZOOKEEPER-1328.1.patch, ZOOKEEPER-1328.patch Assertion for testLostMessage is kept inside the thread.run() method. Due to this the assertion failure will not be reflected in the main testcase. I have observed the test case is still passing in case of the assert failure or misfunction. Instead, the assertion can be moved to the test case - testLostMessage. {noformat} class LEThread extends Thread { public void run(){ peer.setCurrentVote(v); LOG.info(Finished election: + i + , + v.getId()); Assert.assertTrue(State is not leading., peer.getPeerState() == ServerState.LEADING); } {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1536) c client : memory leak in winport.c
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13445172#comment-13445172 ] Mahadev konar commented on ZOOKEEPER-1536: -- Michi is this committed to 3.4 branch as well? c client : memory leak in winport.c --- Key: ZOOKEEPER-1536 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1536 Project: ZooKeeper Issue Type: Bug Components: c client Affects Versions: 3.4.3 Environment: windows7 Reporter: brooklin Assignee: brooklin Fix For: 3.4.4 Attachments: winport.c.patch At line 99 in winport.c, use windows API InitializeCriticalSection but never call DeleteCriticalSection -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1497) Allow server-side SASL login with JAAS configuration to be programmatically set (rather than only by reading JAAS configuration file)
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13445180#comment-13445180 ] Mahadev konar commented on ZOOKEEPER-1497: -- Pat, was this committed to 3.4 branch? I dont see it. Maybe I missed it? Allow server-side SASL login with JAAS configuration to be programmatically set (rather than only by reading JAAS configuration file) - Key: ZOOKEEPER-1497 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1497 Project: ZooKeeper Issue Type: Improvement Components: server Affects Versions: 3.4.3, 3.5.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Labels: security Fix For: 3.4.4, 3.5.0 Attachments: ZOOKEEPER-1497-v1.patch, ZOOKEEPER-1497-v2.patch, ZOOKEEPER-1497-v3.patch, ZOOKEEPER-1497-v4.patch, ZOOKEEPER-1497-v5.patch Currently the CnxnFactory checks for java.security.auth.login.config to decide whether or not enable SASL. * zookeeper/server/NIOServerCnxnFactory.java * zookeeper/server/NettyServerCnxnFactory.java ** configure() checks for java.security.auth.login.config *** If present start the new Login(Server, SaslServerCallbackHandler(conf)) But since the SaslServerCallbackHandler does the right thing just checking if getAppConfigurationEntry() is empty, we can allow SASL with JAAS configuration to be programmatically just checking weather or not a configuration entry is present instead of java.security.auth.login.config. (Something quite similar was done for the SaslClient in ZOOKEEPER-1373) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1497) Allow server-side SASL login with JAAS configuration to be programmatically set (rather than only by reading JAAS configuration file)
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13445394#comment-13445394 ] Mahadev konar commented on ZOOKEEPER-1497: -- Nevermind, I see it now. Mistake on my side! Allow server-side SASL login with JAAS configuration to be programmatically set (rather than only by reading JAAS configuration file) - Key: ZOOKEEPER-1497 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1497 Project: ZooKeeper Issue Type: Improvement Components: server Affects Versions: 3.4.3, 3.5.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Labels: security Fix For: 3.4.4, 3.5.0 Attachments: ZOOKEEPER-1497-v1.patch, ZOOKEEPER-1497-v2.patch, ZOOKEEPER-1497-v3.patch, ZOOKEEPER-1497-v4.patch, ZOOKEEPER-1497-v5.patch Currently the CnxnFactory checks for java.security.auth.login.config to decide whether or not enable SASL. * zookeeper/server/NIOServerCnxnFactory.java * zookeeper/server/NettyServerCnxnFactory.java ** configure() checks for java.security.auth.login.config *** If present start the new Login(Server, SaslServerCallbackHandler(conf)) But since the SaslServerCallbackHandler does the right thing just checking if getAppConfigurationEntry() is empty, we can allow SASL with JAAS configuration to be programmatically just checking weather or not a configuration entry is present instead of java.security.auth.login.config. (Something quite similar was done for the SaslClient in ZOOKEEPER-1373) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1359) ZkCli create command data and acl parts should be optional.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-1359: - Fix Version/s: (was: 3.4.4) 3.4.5 Moving it out since its not a blocker. ZkCli create command data and acl parts should be optional. --- Key: ZOOKEEPER-1359 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1359 Project: ZooKeeper Issue Type: Bug Components: java client Reporter: kavita sharma Assignee: kavita sharma Priority: Trivial Labels: new Fix For: 3.5.0, 3.4.5 In zkCli if we create a node without data then also node is getting created but if we will see in the commandMap it shows that {noformat} commandMap.put(create, [-s] [-e] path data acl); {noformat} that means data and acl parts are not optional .we need to change these parts as optional. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1378) Provide option to turn off sending of diffs
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-1378: - Fix Version/s: (was: 3.4.4) 3.5.0 Moving it out to 3.5. I think we should mark it as wont fix but Ill keep it open for now. Provide option to turn off sending of diffs --- Key: ZOOKEEPER-1378 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1378 Project: ZooKeeper Issue Type: Task Reporter: Zhihong Ted Yu Fix For: 3.5.0 From Patrick: we need to have an option to turn off sending of diffs. There are a couple of really strong reasons I can think of to do this: 1) 3.3.x is broken in a similar way, there is an upgrade problem we can't solve short of having ppl first upgrade to a fixed 3.3 (3.3.5 say) and then upgrading to 3.4.x. If we could turn off diff sending this would address the problem. 2) safety valve. Say we find another new problem with diff sending in 3.4/3/5. Having an option to turn it off would be useful for people as a workaround until a fix is found and released. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1462) Read-only server does not initialize database properly
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-1462: - Fix Version/s: (was: 3.4.3) 3.4.4 Thawan, would you be able to add a unit test for this? Read-only server does not initialize database properly -- Key: ZOOKEEPER-1462 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1462 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.4.3 Reporter: Thawan Kooburat Assignee: Thawan Kooburat Priority: Critical Fix For: 3.4.4, 3.5.0 Attachments: ZOOKEEPER-1462.patch Brief Description: When a participant or observer get partitioned and restart as Read-only server. ZkDb doesn't get reinitialized. This causes the RO server to drop any incoming request with zxid 0 Error message: Refusing session request for client /xx.xx.xx.xx:39875 as it has seen zxid 0x2e00405fd9 our last zxid is 0x0 client must try another server Steps to reproduce: Start an RO-enabled observer connecting to an ensemble. Kill the ensemble and wait until the observer restart in RO mode. Zxid of this observer should be 0. Description: Before a server transition into LOOKING state, its database get closed as part of shutdown sequence. The database of leader, follower and observer get initialized as a side effect of participating in leader election protocol. (eg. observer will call registerWithLeader() and call getLastLoggedZxid() which initialize the db if not already). However, RO server does not participate in this protocol so its DB doesn't get initialized properly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1496) Ephemeral node not getting cleared even after client has exited
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-1496: - Priority: Critical (was: Major) Ephemeral node not getting cleared even after client has exited --- Key: ZOOKEEPER-1496 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1496 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.4.3 Reporter: suja s Assignee: Rakesh R Priority: Critical Fix For: 3.4.4, 3.5.0 Attachments: Logs.rar, ZOOKEEPER-1496.1.patch, ZOOKEEPER-1496.2.patch, ZOOKEEPER-1496.patch In one of the tests we performed, came across a case where the ephemeral node was not getting cleared from zookeeper though the client exited. Zk version: 3.4.3 Ephemeral node still exists in Zookeeper: HOST-xx-xx-xx-55:/home/Jun25_LR/install/zookeeper/bin # date Tue Jun 26 16:07:04 IST 2012 HOST-xx-xx-xx-55:/home/Jun25_LR/install/zookeeper/bin # ./zkCli.sh -server xx.xx.xx.55:2182 Connecting to xx.xx.xx.55:2182 Welcome to ZooKeeper! JLine support is enabled [zk: xx.xx.xx.55:2182(CONNECTING) 0] WATCHER:: WatchedEvent state:SyncConnected type:None path:null [zk: xx.xx.xx.55:2182(CONNECTED) 0] get /hadoop-ha/hacluster/ActiveStandbyElectorLock haclusternn2HOSt-xx-xx-xx-102 �� cZxid = 0x20075 ctime = Tue Jun 26 13:10:19 IST 2012 mZxid = 0x20075 mtime = Tue Jun 26 13:10:19 IST 2012 pZxid = 0x20075 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x1382791d4e50004 dataLength = 42 numChildren = 0 [zk: xx.xx.xx.55:2182(CONNECTED) 1] Grepped logs at ZK side for session 0x1382791d4e50004 - close session and later create coming before closesession processed. HOSt-xx-xx-xx-91:/home/Jun25_LR/install/zookeeper/logs # grep -E /hadoop-ha/hacluster/ActiveStandbyElectorLock|0x1382791d4e50004 *|grep 0x20074 2012-06-26 13:10:18,834 [myid:3] - DEBUG [ProcessThread(sid:3 cport:-1)::CommitProcessor@171] - Processing request:: sessionid:0x1382791d4e50004 type:closeSession cxid:0x0 zxid:0x20074 txntype:-11 reqpath:n/a 2012-06-26 13:10:19,892 [myid:3] - DEBUG [ProcessThread(sid:3 cport:-1)::Leader@716] - Proposing:: sessionid:0x1382791d4e50004 type:closeSession cxid:0x0 zxid:0x20074 txntype:-11 reqpath:n/a 2012-06-26 13:10:19,919 [myid:3] - DEBUG [LearnerHandler-/xx.xx.xx.102:13846:CommitProcessor@161] - Committing request:: sessionid:0x1382791d4e50004 type:closeSession cxid:0x0 zxid:0x20074 txntype:-11 reqpath:n/a 2012-06-26 13:10:20,608 [myid:3] - DEBUG [CommitProcessor:3:FinalRequestProcessor@88] - Processing request:: sessionid:0x1382791d4e50004 type:closeSession cxid:0x0 zxid:0x20074 txntype:-11 reqpath:n/a HOSt-xx-xx-xx-91:/home/Jun25_LR/install/zookeeper/logs # grep -E /hadoop-ha/hacluster/ActiveStandbyElectorLock|0x1382791d4e50004 *|grep 0x20075 2012-06-26 13:10:19,893 [myid:3] - DEBUG [ProcessThread(sid:3 cport:-1)::CommitProcessor@171] - Processing request:: sessionid:0x1382791d4e50004 type:create cxid:0x2 zxid:0x20075 txntype:1 reqpath:n/a 2012-06-26 13:10:19,920 [myid:3] - DEBUG [ProcessThread(sid:3 cport:-1)::Leader@716] - Proposing:: sessionid:0x1382791d4e50004 type:create cxid:0x2 zxid:0x20075 txntype:1 reqpath:n/a 2012-06-26 13:10:20,278 [myid:3] - DEBUG [LearnerHandler-/xx.xx.xx.102:13846:CommitProcessor@161] - Committing request:: sessionid:0x1382791d4e50004 type:create cxid:0x2 zxid:0x20075 txntype:1 reqpath:n/a 2012-06-26 13:10:20,752 [myid:3] - DEBUG [CommitProcessor:3:FinalRequestProcessor@88] - Processing request:: sessionid:0x1382791d4e50004 type:create cxid:0x2 zxid:0x20075 txntype:1 reqpath:n/a Close session and create requests coming almost parallely. Env: Hadoop setup. We were using Namenode HA with bookkeeper as shared storage and auto failover enabled. NN102 was active and NN55 was standby. FailoverController at 102 got shut down due to ZK connection error. The lock-ActiveStandbyElectorLock created (ephemeral node) by this failovercontroller is not cleared from ZK -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1496) Ephemeral node not getting cleared even after client has exited
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-1496: - Component/s: server Ephemeral node not getting cleared even after client has exited --- Key: ZOOKEEPER-1496 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1496 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.4.3 Reporter: suja s Assignee: Rakesh R Priority: Critical Fix For: 3.4.4, 3.5.0 Attachments: Logs.rar, ZOOKEEPER-1496.1.patch, ZOOKEEPER-1496.2.patch, ZOOKEEPER-1496.patch In one of the tests we performed, came across a case where the ephemeral node was not getting cleared from zookeeper though the client exited. Zk version: 3.4.3 Ephemeral node still exists in Zookeeper: HOST-xx-xx-xx-55:/home/Jun25_LR/install/zookeeper/bin # date Tue Jun 26 16:07:04 IST 2012 HOST-xx-xx-xx-55:/home/Jun25_LR/install/zookeeper/bin # ./zkCli.sh -server xx.xx.xx.55:2182 Connecting to xx.xx.xx.55:2182 Welcome to ZooKeeper! JLine support is enabled [zk: xx.xx.xx.55:2182(CONNECTING) 0] WATCHER:: WatchedEvent state:SyncConnected type:None path:null [zk: xx.xx.xx.55:2182(CONNECTED) 0] get /hadoop-ha/hacluster/ActiveStandbyElectorLock haclusternn2HOSt-xx-xx-xx-102 �� cZxid = 0x20075 ctime = Tue Jun 26 13:10:19 IST 2012 mZxid = 0x20075 mtime = Tue Jun 26 13:10:19 IST 2012 pZxid = 0x20075 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x1382791d4e50004 dataLength = 42 numChildren = 0 [zk: xx.xx.xx.55:2182(CONNECTED) 1] Grepped logs at ZK side for session 0x1382791d4e50004 - close session and later create coming before closesession processed. HOSt-xx-xx-xx-91:/home/Jun25_LR/install/zookeeper/logs # grep -E /hadoop-ha/hacluster/ActiveStandbyElectorLock|0x1382791d4e50004 *|grep 0x20074 2012-06-26 13:10:18,834 [myid:3] - DEBUG [ProcessThread(sid:3 cport:-1)::CommitProcessor@171] - Processing request:: sessionid:0x1382791d4e50004 type:closeSession cxid:0x0 zxid:0x20074 txntype:-11 reqpath:n/a 2012-06-26 13:10:19,892 [myid:3] - DEBUG [ProcessThread(sid:3 cport:-1)::Leader@716] - Proposing:: sessionid:0x1382791d4e50004 type:closeSession cxid:0x0 zxid:0x20074 txntype:-11 reqpath:n/a 2012-06-26 13:10:19,919 [myid:3] - DEBUG [LearnerHandler-/xx.xx.xx.102:13846:CommitProcessor@161] - Committing request:: sessionid:0x1382791d4e50004 type:closeSession cxid:0x0 zxid:0x20074 txntype:-11 reqpath:n/a 2012-06-26 13:10:20,608 [myid:3] - DEBUG [CommitProcessor:3:FinalRequestProcessor@88] - Processing request:: sessionid:0x1382791d4e50004 type:closeSession cxid:0x0 zxid:0x20074 txntype:-11 reqpath:n/a HOSt-xx-xx-xx-91:/home/Jun25_LR/install/zookeeper/logs # grep -E /hadoop-ha/hacluster/ActiveStandbyElectorLock|0x1382791d4e50004 *|grep 0x20075 2012-06-26 13:10:19,893 [myid:3] - DEBUG [ProcessThread(sid:3 cport:-1)::CommitProcessor@171] - Processing request:: sessionid:0x1382791d4e50004 type:create cxid:0x2 zxid:0x20075 txntype:1 reqpath:n/a 2012-06-26 13:10:19,920 [myid:3] - DEBUG [ProcessThread(sid:3 cport:-1)::Leader@716] - Proposing:: sessionid:0x1382791d4e50004 type:create cxid:0x2 zxid:0x20075 txntype:1 reqpath:n/a 2012-06-26 13:10:20,278 [myid:3] - DEBUG [LearnerHandler-/xx.xx.xx.102:13846:CommitProcessor@161] - Committing request:: sessionid:0x1382791d4e50004 type:create cxid:0x2 zxid:0x20075 txntype:1 reqpath:n/a 2012-06-26 13:10:20,752 [myid:3] - DEBUG [CommitProcessor:3:FinalRequestProcessor@88] - Processing request:: sessionid:0x1382791d4e50004 type:create cxid:0x2 zxid:0x20075 txntype:1 reqpath:n/a Close session and create requests coming almost parallely. Env: Hadoop setup. We were using Namenode HA with bookkeeper as shared storage and auto failover enabled. NN102 was active and NN55 was standby. FailoverController at 102 got shut down due to ZK connection error. The lock-ActiveStandbyElectorLock created (ephemeral node) by this failovercontroller is not cleared from ZK -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1496) Ephemeral node not getting cleared even after client has exited
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-1496: - Fix Version/s: 3.4.4 This looks like a critical bugfix. Ephemeral node not getting cleared even after client has exited --- Key: ZOOKEEPER-1496 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1496 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.4.3 Reporter: suja s Assignee: Rakesh R Fix For: 3.4.4, 3.5.0 Attachments: Logs.rar, ZOOKEEPER-1496.1.patch, ZOOKEEPER-1496.2.patch, ZOOKEEPER-1496.patch In one of the tests we performed, came across a case where the ephemeral node was not getting cleared from zookeeper though the client exited. Zk version: 3.4.3 Ephemeral node still exists in Zookeeper: HOST-xx-xx-xx-55:/home/Jun25_LR/install/zookeeper/bin # date Tue Jun 26 16:07:04 IST 2012 HOST-xx-xx-xx-55:/home/Jun25_LR/install/zookeeper/bin # ./zkCli.sh -server xx.xx.xx.55:2182 Connecting to xx.xx.xx.55:2182 Welcome to ZooKeeper! JLine support is enabled [zk: xx.xx.xx.55:2182(CONNECTING) 0] WATCHER:: WatchedEvent state:SyncConnected type:None path:null [zk: xx.xx.xx.55:2182(CONNECTED) 0] get /hadoop-ha/hacluster/ActiveStandbyElectorLock haclusternn2HOSt-xx-xx-xx-102 �� cZxid = 0x20075 ctime = Tue Jun 26 13:10:19 IST 2012 mZxid = 0x20075 mtime = Tue Jun 26 13:10:19 IST 2012 pZxid = 0x20075 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x1382791d4e50004 dataLength = 42 numChildren = 0 [zk: xx.xx.xx.55:2182(CONNECTED) 1] Grepped logs at ZK side for session 0x1382791d4e50004 - close session and later create coming before closesession processed. HOSt-xx-xx-xx-91:/home/Jun25_LR/install/zookeeper/logs # grep -E /hadoop-ha/hacluster/ActiveStandbyElectorLock|0x1382791d4e50004 *|grep 0x20074 2012-06-26 13:10:18,834 [myid:3] - DEBUG [ProcessThread(sid:3 cport:-1)::CommitProcessor@171] - Processing request:: sessionid:0x1382791d4e50004 type:closeSession cxid:0x0 zxid:0x20074 txntype:-11 reqpath:n/a 2012-06-26 13:10:19,892 [myid:3] - DEBUG [ProcessThread(sid:3 cport:-1)::Leader@716] - Proposing:: sessionid:0x1382791d4e50004 type:closeSession cxid:0x0 zxid:0x20074 txntype:-11 reqpath:n/a 2012-06-26 13:10:19,919 [myid:3] - DEBUG [LearnerHandler-/xx.xx.xx.102:13846:CommitProcessor@161] - Committing request:: sessionid:0x1382791d4e50004 type:closeSession cxid:0x0 zxid:0x20074 txntype:-11 reqpath:n/a 2012-06-26 13:10:20,608 [myid:3] - DEBUG [CommitProcessor:3:FinalRequestProcessor@88] - Processing request:: sessionid:0x1382791d4e50004 type:closeSession cxid:0x0 zxid:0x20074 txntype:-11 reqpath:n/a HOSt-xx-xx-xx-91:/home/Jun25_LR/install/zookeeper/logs # grep -E /hadoop-ha/hacluster/ActiveStandbyElectorLock|0x1382791d4e50004 *|grep 0x20075 2012-06-26 13:10:19,893 [myid:3] - DEBUG [ProcessThread(sid:3 cport:-1)::CommitProcessor@171] - Processing request:: sessionid:0x1382791d4e50004 type:create cxid:0x2 zxid:0x20075 txntype:1 reqpath:n/a 2012-06-26 13:10:19,920 [myid:3] - DEBUG [ProcessThread(sid:3 cport:-1)::Leader@716] - Proposing:: sessionid:0x1382791d4e50004 type:create cxid:0x2 zxid:0x20075 txntype:1 reqpath:n/a 2012-06-26 13:10:20,278 [myid:3] - DEBUG [LearnerHandler-/xx.xx.xx.102:13846:CommitProcessor@161] - Committing request:: sessionid:0x1382791d4e50004 type:create cxid:0x2 zxid:0x20075 txntype:1 reqpath:n/a 2012-06-26 13:10:20,752 [myid:3] - DEBUG [CommitProcessor:3:FinalRequestProcessor@88] - Processing request:: sessionid:0x1382791d4e50004 type:create cxid:0x2 zxid:0x20075 txntype:1 reqpath:n/a Close session and create requests coming almost parallely. Env: Hadoop setup. We were using Namenode HA with bookkeeper as shared storage and auto failover enabled. NN102 was active and NN55 was standby. FailoverController at 102 got shut down due to ZK connection error. The lock-ActiveStandbyElectorLock created (ephemeral node) by this failovercontroller is not cleared from ZK -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1496) Ephemeral node not getting cleared even after client has exited
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-1496: - Fix Version/s: 3.5.0 Ephemeral node not getting cleared even after client has exited --- Key: ZOOKEEPER-1496 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1496 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.4.3 Reporter: suja s Assignee: Rakesh R Priority: Critical Fix For: 3.4.4, 3.5.0 Attachments: Logs.rar, ZOOKEEPER-1496.1.patch, ZOOKEEPER-1496.2.patch, ZOOKEEPER-1496.patch In one of the tests we performed, came across a case where the ephemeral node was not getting cleared from zookeeper though the client exited. Zk version: 3.4.3 Ephemeral node still exists in Zookeeper: HOST-xx-xx-xx-55:/home/Jun25_LR/install/zookeeper/bin # date Tue Jun 26 16:07:04 IST 2012 HOST-xx-xx-xx-55:/home/Jun25_LR/install/zookeeper/bin # ./zkCli.sh -server xx.xx.xx.55:2182 Connecting to xx.xx.xx.55:2182 Welcome to ZooKeeper! JLine support is enabled [zk: xx.xx.xx.55:2182(CONNECTING) 0] WATCHER:: WatchedEvent state:SyncConnected type:None path:null [zk: xx.xx.xx.55:2182(CONNECTED) 0] get /hadoop-ha/hacluster/ActiveStandbyElectorLock haclusternn2HOSt-xx-xx-xx-102 �� cZxid = 0x20075 ctime = Tue Jun 26 13:10:19 IST 2012 mZxid = 0x20075 mtime = Tue Jun 26 13:10:19 IST 2012 pZxid = 0x20075 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x1382791d4e50004 dataLength = 42 numChildren = 0 [zk: xx.xx.xx.55:2182(CONNECTED) 1] Grepped logs at ZK side for session 0x1382791d4e50004 - close session and later create coming before closesession processed. HOSt-xx-xx-xx-91:/home/Jun25_LR/install/zookeeper/logs # grep -E /hadoop-ha/hacluster/ActiveStandbyElectorLock|0x1382791d4e50004 *|grep 0x20074 2012-06-26 13:10:18,834 [myid:3] - DEBUG [ProcessThread(sid:3 cport:-1)::CommitProcessor@171] - Processing request:: sessionid:0x1382791d4e50004 type:closeSession cxid:0x0 zxid:0x20074 txntype:-11 reqpath:n/a 2012-06-26 13:10:19,892 [myid:3] - DEBUG [ProcessThread(sid:3 cport:-1)::Leader@716] - Proposing:: sessionid:0x1382791d4e50004 type:closeSession cxid:0x0 zxid:0x20074 txntype:-11 reqpath:n/a 2012-06-26 13:10:19,919 [myid:3] - DEBUG [LearnerHandler-/xx.xx.xx.102:13846:CommitProcessor@161] - Committing request:: sessionid:0x1382791d4e50004 type:closeSession cxid:0x0 zxid:0x20074 txntype:-11 reqpath:n/a 2012-06-26 13:10:20,608 [myid:3] - DEBUG [CommitProcessor:3:FinalRequestProcessor@88] - Processing request:: sessionid:0x1382791d4e50004 type:closeSession cxid:0x0 zxid:0x20074 txntype:-11 reqpath:n/a HOSt-xx-xx-xx-91:/home/Jun25_LR/install/zookeeper/logs # grep -E /hadoop-ha/hacluster/ActiveStandbyElectorLock|0x1382791d4e50004 *|grep 0x20075 2012-06-26 13:10:19,893 [myid:3] - DEBUG [ProcessThread(sid:3 cport:-1)::CommitProcessor@171] - Processing request:: sessionid:0x1382791d4e50004 type:create cxid:0x2 zxid:0x20075 txntype:1 reqpath:n/a 2012-06-26 13:10:19,920 [myid:3] - DEBUG [ProcessThread(sid:3 cport:-1)::Leader@716] - Proposing:: sessionid:0x1382791d4e50004 type:create cxid:0x2 zxid:0x20075 txntype:1 reqpath:n/a 2012-06-26 13:10:20,278 [myid:3] - DEBUG [LearnerHandler-/xx.xx.xx.102:13846:CommitProcessor@161] - Committing request:: sessionid:0x1382791d4e50004 type:create cxid:0x2 zxid:0x20075 txntype:1 reqpath:n/a 2012-06-26 13:10:20,752 [myid:3] - DEBUG [CommitProcessor:3:FinalRequestProcessor@88] - Processing request:: sessionid:0x1382791d4e50004 type:create cxid:0x2 zxid:0x20075 txntype:1 reqpath:n/a Close session and create requests coming almost parallely. Env: Hadoop setup. We were using Namenode HA with bookkeeper as shared storage and auto failover enabled. NN102 was active and NN55 was standby. FailoverController at 102 got shut down due to ZK connection error. The lock-ActiveStandbyElectorLock created (ephemeral node) by this failovercontroller is not cleared from ZK -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira