[jira] [Commented] (ZOOKEEPER-1435) cap space usage of default log4j rolling policy
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241499#comment-13241499 ] Mahadev konar commented on ZOOKEEPER-1435: -- +1 for the patch. Looks good to me! cap space usage of default log4j rolling policy --- Key: ZOOKEEPER-1435 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1435 Project: ZooKeeper Issue Type: Improvement Components: scripts Affects Versions: 3.4.3, 3.3.5, 3.5.0 Reporter: Patrick Hunt Assignee: Patrick Hunt Fix For: 3.5.0 Attachments: ZOOKEEPER-1435.patch Our current log4j log rolling policy (for ROLLINGFILE) doesn't cap the max logging space used. This can be a problem in production systems. See similar improvements recently made in hadoop: HADOOP-8149 For ROLLINGFILE only, I believe we should change the default threshold to INFO and cap the max space to something reasonable, say 5g (max file size of 256mb, max file count of 20). These will be the defaults in log4j.properties, which you would also be able to override from the command line. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1433) improve ZxidRolloverTest (test seems flakey)
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241504#comment-13241504 ] Mahadev konar commented on ZOOKEEPER-1433: -- +1 looks good to me... Thanks for fixing this Pat! improve ZxidRolloverTest (test seems flakey) Key: ZOOKEEPER-1433 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1433 Project: ZooKeeper Issue Type: Improvement Components: tests Affects Versions: 3.3.5 Reporter: Wing Yew Poon Assignee: Patrick Hunt Fix For: 3.3.6, 3.4.4, 3.5.0 Attachments: ZOOKEEPER-1433.patch, ZOOKEEPER-1433_test.out In our jenkins job to run the ZooKeeper unit tests, org.apache.zookeeper.server.ZxidRolloverTest sometimes fails. E.g., {noformat} org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /foo0 at org.apache.zookeeper.KeeperException.create(KeeperException.java:90) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:815) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:843) at org.apache.zookeeper.server.ZxidRolloverTest.checkNodes(ZxidRolloverTest.java:154) at org.apache.zookeeper.server.ZxidRolloverTest.testRolloverThenRestart(ZxidRolloverTest.java:211) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13229941#comment-13229941 ] Mahadev konar commented on ZOOKEEPER-1277: -- +1 on the patches. Looked through all 3. Good to go! Thanks Pat! servers stop serving when lower 32bits of zxid roll over Key: ZOOKEEPER-1277 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.3.3 Reporter: Patrick Hunt Assignee: Patrick Hunt Priority: Critical Fix For: 3.3.5, 3.4.4, 3.5.0 Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_trunk.patch, ZOOKEEPER-1277_trunk.patch When the lower 32bits of a zxid roll over (zxid is a 64 bit number, however the upper 32 are considered the epoch number) the epoch number (upper 32 bits) are incremented and the lower 32 start at 0 again. This should work fine, however in the current 3.3 branch the followers see this as a NEWLEADER message, which it's not, and effectively stop serving clients. Attached clients seem to eventually time out given that heartbeats (or any operation) are no longer processed. The follower doesn't recover from this. I've tested this out on 3.3 branch and confirmed this problem, however I haven't tried it on 3.4/3.5. It may not happen on the newer branches due to ZOOKEEPER-335, however there is certainly an issue with updating the acceptedEpoch files contained in the datadir. (I'll enter a separate jira for that) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13229708#comment-13229708 ] Mahadev konar commented on ZOOKEEPER-1277: -- Ahh... That makes more sense! Updated comments would be good. Thanks! servers stop serving when lower 32bits of zxid roll over Key: ZOOKEEPER-1277 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.3.3 Reporter: Patrick Hunt Assignee: Patrick Hunt Priority: Critical Fix For: 3.3.6 Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch When the lower 32bits of a zxid roll over (zxid is a 64 bit number, however the upper 32 are considered the epoch number) the epoch number (upper 32 bits) are incremented and the lower 32 start at 0 again. This should work fine, however in the current 3.3 branch the followers see this as a NEWLEADER message, which it's not, and effectively stop serving clients. Attached clients seem to eventually time out given that heartbeats (or any operation) are no longer processed. The follower doesn't recover from this. I've tested this out on 3.3 branch and confirmed this problem, however I haven't tried it on 3.4/3.5. It may not happen on the newer branches due to ZOOKEEPER-335, however there is certainly an issue with updating the acceptedEpoch files contained in the datadir. (I'll enter a separate jira for that) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1373) Hardcoded SASL login context name clashes with Hadoop security configuration override
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13201136#comment-13201136 ] Mahadev konar commented on ZOOKEEPER-1373: -- Javadoc warning is due to ZOOKEEPER-1386. Hardcoded SASL login context name clashes with Hadoop security configuration override - Key: ZOOKEEPER-1373 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1373 Project: ZooKeeper Issue Type: Bug Components: java client Affects Versions: 3.4.2 Reporter: Thomas Weise Assignee: Eugene Koontz Fix For: 3.4.3, 3.5.0 Attachments: ZOOKEEPER-1373-TW_3_4.patch, ZOOKEEPER-1373.patch, ZOOKEEPER-1373.patch, ZOOKEEPER-1373.patch, ZOOKEEPER-1373.patch, ZOOKEEPER-1373.patch, ZOOKEEPER-1373.patch I'm trying to configure a process with Hadoop security (Hive metastore server) to talk to ZooKeeper 3.4.2 with Kerberos authentication. In this scenario Hadoop controls the SASL configuration (org.apache.hadoop.security.UserGroupInformation.HadoopConfiguration), instead of setting up the ZooKeeper Client loginContext via jaas.conf and system property {{-Djava.security.auth.login.config}} Using the Hadoop configuration would work, except that ZooKeeper client code expects the loginContextName to be Client while Hadoop security will use hadoop-keytab-kerberos. I verified that by changing the name in the debugger the SASL authentication succeeds while otherwise the login configuration cannot be resolved and the connection to ZooKeeper is unauthenticated. To integrate with Hadoop, the following in ZooKeeperSaslClient would need to change to make the name configurable: {{login = new Login(Client,new ClientCallbackHandler(null));}} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1373) Hardcoded SASL login context name clashes with Hadoop security configuration override
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13201139#comment-13201139 ] Mahadev konar commented on ZOOKEEPER-1373: -- @Eugene, The patch looks good, but we should work on cleaning up the security stuff a little. One thing would be to make ClientCnxn a little modular and not pass it arnd everywhere (like we do in ZKSaslClient). Neways thats for later. Ill go ahead and commit this for now. Hardcoded SASL login context name clashes with Hadoop security configuration override - Key: ZOOKEEPER-1373 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1373 Project: ZooKeeper Issue Type: Bug Components: java client Affects Versions: 3.4.2 Reporter: Thomas Weise Assignee: Eugene Koontz Fix For: 3.4.3, 3.5.0 Attachments: ZOOKEEPER-1373-TW_3_4.patch, ZOOKEEPER-1373.patch, ZOOKEEPER-1373.patch, ZOOKEEPER-1373.patch, ZOOKEEPER-1373.patch, ZOOKEEPER-1373.patch, ZOOKEEPER-1373.patch I'm trying to configure a process with Hadoop security (Hive metastore server) to talk to ZooKeeper 3.4.2 with Kerberos authentication. In this scenario Hadoop controls the SASL configuration (org.apache.hadoop.security.UserGroupInformation.HadoopConfiguration), instead of setting up the ZooKeeper Client loginContext via jaas.conf and system property {{-Djava.security.auth.login.config}} Using the Hadoop configuration would work, except that ZooKeeper client code expects the loginContextName to be Client while Hadoop security will use hadoop-keytab-kerberos. I verified that by changing the name in the debugger the SASL authentication succeeds while otherwise the login configuration cannot be resolved and the connection to ZooKeeper is unauthenticated. To integrate with Hadoop, the following in ZooKeeperSaslClient would need to change to make the name configurable: {{login = new Login(Client,new ClientCallbackHandler(null));}} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1373) Hardcoded SASL login context name clashes with Hadoop security configuration override
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13201461#comment-13201461 ] Mahadev konar commented on ZOOKEEPER-1373: -- @Thomas, Yes. The rc is up. Can you try it out: http://people.apache.org/~mahadev/zookeeper-3.4.3-candidate-0/ Hardcoded SASL login context name clashes with Hadoop security configuration override - Key: ZOOKEEPER-1373 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1373 Project: ZooKeeper Issue Type: Bug Components: java client Affects Versions: 3.4.2 Reporter: Thomas Weise Assignee: Eugene Koontz Fix For: 3.4.3, 3.5.0 Attachments: ZOOKEEPER-1373-TW_3_4.patch, ZOOKEEPER-1373.patch, ZOOKEEPER-1373.patch, ZOOKEEPER-1373.patch, ZOOKEEPER-1373.patch, ZOOKEEPER-1373.patch, ZOOKEEPER-1373.patch I'm trying to configure a process with Hadoop security (Hive metastore server) to talk to ZooKeeper 3.4.2 with Kerberos authentication. In this scenario Hadoop controls the SASL configuration (org.apache.hadoop.security.UserGroupInformation.HadoopConfiguration), instead of setting up the ZooKeeper Client loginContext via jaas.conf and system property {{-Djava.security.auth.login.config}} Using the Hadoop configuration would work, except that ZooKeeper client code expects the loginContextName to be Client while Hadoop security will use hadoop-keytab-kerberos. I verified that by changing the name in the debugger the SASL authentication succeeds while otherwise the login configuration cannot be resolved and the connection to ZooKeeper is unauthenticated. To integrate with Hadoop, the following in ZooKeeperSaslClient would need to change to make the name configurable: {{login = new Login(Client,new ClientCallbackHandler(null));}} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1322) Cleanup/fix logging in Quorum code.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13201072#comment-13201072 ] Mahadev konar commented on ZOOKEEPER-1322: -- Pat, Went through the patch. Looks harmless to me. Kicking off hudson again to run through the patch again. Cleanup/fix logging in Quorum code. --- Key: ZOOKEEPER-1322 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1322 Project: ZooKeeper Issue Type: Improvement Components: server Affects Versions: 3.4.0, 3.5.0 Reporter: Patrick Hunt Assignee: Patrick Hunt Fix For: 3.4.3, 3.5.0 Attachments: ZOOKEEPER-1322_br34.patch, ZOOKEEPER-1322_trunk.patch While triaging ZOOKEEPER-1319 I updated the code with the attached patch in order to help debug what was going on with that issue. I think it would be useful to include these changes in the project itself. ff to include in 3.4.1 or push to 3.5.0. You should verify this with TRACE logging turned on in addition to INFO (default). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1353) C client test suite fails consistently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13201110#comment-13201110 ] Mahadev konar commented on ZOOKEEPER-1353: -- Thanks for pointing this out (and also for the patch) Clint. C client test suite fails consistently -- Key: ZOOKEEPER-1353 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1353 Project: ZooKeeper Issue Type: Bug Components: c client, tests Affects Versions: 3.3.4 Environment: Ubuntu precise (dev release), amd64 Reporter: Clint Byrum Assignee: Clint Byrum Priority: Minor Labels: patch, test Fix For: 3.3.5, 3.4.3, 3.5.0 Attachments: fix-broken-c-client-unittest.patch, fix-broken-c-client-unittest.patch Original Estimate: 5m Remaining Estimate: 5m When the c client test suite, zktest-mt, is run, it fails with this: tests/TestZookeeperInit.cc:233: Assertion: equality assertion failed [Expected: 2, Actual : 22] This was also reported in 3.3.1 here: http://www.mail-archive.com/zookeeper-dev@hadoop.apache.org/msg08914.html The C client tests are making some assumptions that are not valid. getaddrinfo may have, at one time, returned ENOENT instead of EINVAL for the host given in the test. The assertion should simply be that EINVAL | ENOENT are given, so that builds on platforms which return ENOENT for this are not broken. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1367) Data inconsistencies and unexpired ephemeral nodes after cluster restart
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13197056#comment-13197056 ] Mahadev konar commented on ZOOKEEPER-1367: -- Great. Go ahead and upload. Ill commit it to the 3.3 branch. Data inconsistencies and unexpired ephemeral nodes after cluster restart Key: ZOOKEEPER-1367 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1367 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.4.2 Environment: Debian Squeeze, 64-bit Reporter: Jeremy Stribling Assignee: Benjamin Reed Priority: Blocker Fix For: 3.3.5, 3.4.3, 3.5.0 Attachments: 1367-3.3.patch, ZOOKEEPER-1367-3.4.patch, ZOOKEEPER-1367.patch, ZOOKEEPER-1367.patch, ZOOKEEPER-1367.tgz In one of our tests, we have a cluster of three ZooKeeper servers. We kill all three, and then restart just two of them. Sometimes we notice that on one of the restarted servers, ephemeral nodes from previous sessions do not get deleted, while on the other server they do. We are effectively running 3.4.2, though technically we are running 3.4.1 with the patch manually applied for ZOOKEEPER-1333 and a C client for 3.4.1 with the patches for ZOOKEEPER-1163. I noticed that when I connected using zkCli.sh to the first node (90.0.0.221, zkid 84), I saw only one znode in a particular path: {quote} [zk: 90.0.0.221:2888(CONNECTED) 0] ls /election/zkrsm [nominee11] [zk: 90.0.0.221:2888(CONNECTED) 1] get /election/zkrsm/nominee11 90.0.0.222: cZxid = 0x40027 ctime = Thu Jan 19 08:18:24 UTC 2012 mZxid = 0x40027 mtime = Thu Jan 19 08:18:24 UTC 2012 pZxid = 0x40027 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0xa234f4f3bc220001 dataLength = 16 numChildren = 0 {quote} However, when I connect zkCli.sh to the second server (90.0.0.222, zkid 251), I saw three znodes under that same path: {quote} [zk: 90.0.0.222:2888(CONNECTED) 2] ls /election/zkrsm nominee06 nominee10 nominee11 [zk: 90.0.0.222:2888(CONNECTED) 2] get /election/zkrsm/nominee11 90.0.0.222: cZxid = 0x40027 ctime = Thu Jan 19 08:18:24 UTC 2012 mZxid = 0x40027 mtime = Thu Jan 19 08:18:24 UTC 2012 pZxid = 0x40027 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0xa234f4f3bc220001 dataLength = 16 numChildren = 0 [zk: 90.0.0.222:2888(CONNECTED) 3] get /election/zkrsm/nominee10 90.0.0.221: cZxid = 0x3014c ctime = Thu Jan 19 07:53:42 UTC 2012 mZxid = 0x3014c mtime = Thu Jan 19 07:53:42 UTC 2012 pZxid = 0x3014c cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0xa234f4f3bc22 dataLength = 16 numChildren = 0 [zk: 90.0.0.222:2888(CONNECTED) 4] get /election/zkrsm/nominee06 90.0.0.223: cZxid = 0x20cab ctime = Thu Jan 19 08:00:30 UTC 2012 mZxid = 0x20cab mtime = Thu Jan 19 08:00:30 UTC 2012 pZxid = 0x20cab cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x5434f5074e040002 dataLength = 16 numChildren = 0 {quote} These never went away for the lifetime of the server, for any clients connected directly to that server. Note that this cluster is configured to have all three servers still, the third one being down (90.0.0.223, zkid 162). I captured the data/snapshot directories for the the two live servers. When I start single-node servers using each directory, I can briefly see that the inconsistent data is present in those logs, though the ephemeral nodes seem to get (correctly) cleaned up pretty soon after I start the server. I will upload a tar containing the debug logs and data directories from the failure. I think we can reproduce it regularly if you need more info. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1367) Data inconsistencies and unexpired ephemeral nodes after cluster restart
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13196739#comment-13196739 ] Mahadev konar commented on ZOOKEEPER-1367: -- Thanks for confirming Jeremy. Ill check this in now. The patch looks good to me though I think we need to clean up our classes so that we have cleaner seperation on what ZKS should be exposing and what ZKDatabase should be exposing. Data inconsistencies and unexpired ephemeral nodes after cluster restart Key: ZOOKEEPER-1367 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1367 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.4.2 Environment: Debian Squeeze, 64-bit Reporter: Jeremy Stribling Assignee: Benjamin Reed Priority: Blocker Fix For: 3.4.3 Attachments: 1367-3.3.patch, ZOOKEEPER-1367-3.4.patch, ZOOKEEPER-1367.patch, ZOOKEEPER-1367.patch, ZOOKEEPER-1367.tgz In one of our tests, we have a cluster of three ZooKeeper servers. We kill all three, and then restart just two of them. Sometimes we notice that on one of the restarted servers, ephemeral nodes from previous sessions do not get deleted, while on the other server they do. We are effectively running 3.4.2, though technically we are running 3.4.1 with the patch manually applied for ZOOKEEPER-1333 and a C client for 3.4.1 with the patches for ZOOKEEPER-1163. I noticed that when I connected using zkCli.sh to the first node (90.0.0.221, zkid 84), I saw only one znode in a particular path: {quote} [zk: 90.0.0.221:2888(CONNECTED) 0] ls /election/zkrsm [nominee11] [zk: 90.0.0.221:2888(CONNECTED) 1] get /election/zkrsm/nominee11 90.0.0.222: cZxid = 0x40027 ctime = Thu Jan 19 08:18:24 UTC 2012 mZxid = 0x40027 mtime = Thu Jan 19 08:18:24 UTC 2012 pZxid = 0x40027 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0xa234f4f3bc220001 dataLength = 16 numChildren = 0 {quote} However, when I connect zkCli.sh to the second server (90.0.0.222, zkid 251), I saw three znodes under that same path: {quote} [zk: 90.0.0.222:2888(CONNECTED) 2] ls /election/zkrsm nominee06 nominee10 nominee11 [zk: 90.0.0.222:2888(CONNECTED) 2] get /election/zkrsm/nominee11 90.0.0.222: cZxid = 0x40027 ctime = Thu Jan 19 08:18:24 UTC 2012 mZxid = 0x40027 mtime = Thu Jan 19 08:18:24 UTC 2012 pZxid = 0x40027 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0xa234f4f3bc220001 dataLength = 16 numChildren = 0 [zk: 90.0.0.222:2888(CONNECTED) 3] get /election/zkrsm/nominee10 90.0.0.221: cZxid = 0x3014c ctime = Thu Jan 19 07:53:42 UTC 2012 mZxid = 0x3014c mtime = Thu Jan 19 07:53:42 UTC 2012 pZxid = 0x3014c cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0xa234f4f3bc22 dataLength = 16 numChildren = 0 [zk: 90.0.0.222:2888(CONNECTED) 4] get /election/zkrsm/nominee06 90.0.0.223: cZxid = 0x20cab ctime = Thu Jan 19 08:00:30 UTC 2012 mZxid = 0x20cab mtime = Thu Jan 19 08:00:30 UTC 2012 pZxid = 0x20cab cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x5434f5074e040002 dataLength = 16 numChildren = 0 {quote} These never went away for the lifetime of the server, for any clients connected directly to that server. Note that this cluster is configured to have all three servers still, the third one being down (90.0.0.223, zkid 162). I captured the data/snapshot directories for the the two live servers. When I start single-node servers using each directory, I can briefly see that the inconsistent data is present in those logs, though the ephemeral nodes seem to get (correctly) cleaned up pretty soon after I start the server. I will upload a tar containing the debug logs and data directories from the failure. I think we can reproduce it regularly if you need more info. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1373) Hardcoded SASL login context name clashes with Hadoop security configuration override
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13196748#comment-13196748 ] Mahadev konar commented on ZOOKEEPER-1373: -- I just hate the way review board updates the comments. Looking at the patch now. Hardcoded SASL login context name clashes with Hadoop security configuration override - Key: ZOOKEEPER-1373 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1373 Project: ZooKeeper Issue Type: Bug Components: java client Affects Versions: 3.4.2 Reporter: Thomas Weise Assignee: Eugene Koontz Fix For: 3.4.3, 3.5.0 Attachments: ZOOKEEPER-1373-TW_3_4.patch, ZOOKEEPER-1373.patch, ZOOKEEPER-1373.patch, ZOOKEEPER-1373.patch, ZOOKEEPER-1373.patch I'm trying to configure a process with Hadoop security (Hive metastore server) to talk to ZooKeeper 3.4.2 with Kerberos authentication. In this scenario Hadoop controls the SASL configuration (org.apache.hadoop.security.UserGroupInformation.HadoopConfiguration), instead of setting up the ZooKeeper Client loginContext via jaas.conf and system property {{-Djava.security.auth.login.config}} Using the Hadoop configuration would work, except that ZooKeeper client code expects the loginContextName to be Client while Hadoop security will use hadoop-keytab-kerberos. I verified that by changing the name in the debugger the SASL authentication succeeds while otherwise the login configuration cannot be resolved and the connection to ZooKeeper is unauthenticated. To integrate with Hadoop, the following in ZooKeeperSaslClient would need to change to make the name configurable: {{login = new Login(Client,new ClientCallbackHandler(null));}} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1373) Hardcoded SASL login context name clashes with Hadoop security configuration override
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13196756#comment-13196756 ] Mahadev konar commented on ZOOKEEPER-1373: -- Took a look at the patch. It looks good overall, like the new test cases. Some minor nits, I think the ClientCnxn code needs to move out a little (ClientCnxn is getting too huge). Can we do a helper class for Security? Something like ZooKeeperSecureUtil where all this code can reside (creating a zk sasl client?). Also its a little painful to see all the config property names spread around. This is probably another jira where we move all the properties in a single place so that we dont have to go hunting arnd for our config properties. Hardcoded SASL login context name clashes with Hadoop security configuration override - Key: ZOOKEEPER-1373 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1373 Project: ZooKeeper Issue Type: Bug Components: java client Affects Versions: 3.4.2 Reporter: Thomas Weise Assignee: Eugene Koontz Fix For: 3.4.3, 3.5.0 Attachments: ZOOKEEPER-1373-TW_3_4.patch, ZOOKEEPER-1373.patch, ZOOKEEPER-1373.patch, ZOOKEEPER-1373.patch, ZOOKEEPER-1373.patch I'm trying to configure a process with Hadoop security (Hive metastore server) to talk to ZooKeeper 3.4.2 with Kerberos authentication. In this scenario Hadoop controls the SASL configuration (org.apache.hadoop.security.UserGroupInformation.HadoopConfiguration), instead of setting up the ZooKeeper Client loginContext via jaas.conf and system property {{-Djava.security.auth.login.config}} Using the Hadoop configuration would work, except that ZooKeeper client code expects the loginContextName to be Client while Hadoop security will use hadoop-keytab-kerberos. I verified that by changing the name in the debugger the SASL authentication succeeds while otherwise the login configuration cannot be resolved and the connection to ZooKeeper is unauthenticated. To integrate with Hadoop, the following in ZooKeeperSaslClient would need to change to make the name configurable: {{login = new Login(Client,new ClientCallbackHandler(null));}} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1366) Zookeeper should be tolerant of clock adjustments
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13195026#comment-13195026 ] Mahadev konar commented on ZOOKEEPER-1366: -- Pat/Ben, I think the issue here is the api with Clock. The static api is one that ruins mocking in it. In Hadoop we make sure we pass around the same clock object when creating all the sub sequent objects (the constructs in MR next gen are more DI compliant). We could try doing that here but again I think its a bit of an effort (should be manual work). But as Henry/Camille mentioned we could do that in another jira. I think thats the right solution instead of creating another layer which hides the longs (as Pat suggested). Zookeeper should be tolerant of clock adjustments - Key: ZOOKEEPER-1366 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1366 Project: ZooKeeper Issue Type: Bug Reporter: Ted Dunning Assignee: Ted Dunning Fix For: 3.5.0 Attachments: ZOOKEEPER-1366-3.3.3.patch, ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch If you want to wreak havoc on a ZK based system just do [date -s +1hour] and watch the mayhem as all sessions expire at once. This shouldn't happen. Zookeeper could easily know handle elapsed times as elapsed times rather than as differences between absolute times. The absolute times are subject to adjustment when the clock is set while a timer is not subject to this problem. In Java, System.currentTimeMillis() gives you absolute time while System.nanoTime() gives you time based on a timer from an arbitrary epoch. I have done this and have been running tests now for some tens of minutes with no failures. I will set up a test machine to redo the build again on Ubuntu and post a patch here for discussion. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1367) Data inconsistencies and unexpired ephemeral nodes after cluster restart
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13195373#comment-13195373 ] Mahadev konar commented on ZOOKEEPER-1367: -- From https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/928//testReport/ {code} org.apache.zookeeper.server.quorum.LearnerTest.syncTest Failing for the past 1 build (Since #928 ) Took 74 ms. Stacktrace java.lang.NullPointerException at org.apache.zookeeper.server.quorum.LearnerZooKeeperServer.createSessionTracker(LearnerZooKeeperServer.java:73) at org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:355) at org.apache.zookeeper.server.quorum.LearnerTest.syncTest(LearnerTest.java:114) at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) {code} Data inconsistencies and unexpired ephemeral nodes after cluster restart Key: ZOOKEEPER-1367 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1367 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.4.2 Environment: Debian Squeeze, 64-bit Reporter: Jeremy Stribling Priority: Blocker Fix For: 3.4.3 Attachments: ZOOKEEPER-1367.patch, ZOOKEEPER-1367.tgz In one of our tests, we have a cluster of three ZooKeeper servers. We kill all three, and then restart just two of them. Sometimes we notice that on one of the restarted servers, ephemeral nodes from previous sessions do not get deleted, while on the other server they do. We are effectively running 3.4.2, though technically we are running 3.4.1 with the patch manually applied for ZOOKEEPER-1333 and a C client for 3.4.1 with the patches for ZOOKEEPER-1163. I noticed that when I connected using zkCli.sh to the first node (90.0.0.221, zkid 84), I saw only one znode in a particular path: {quote} [zk: 90.0.0.221:2888(CONNECTED) 0] ls /election/zkrsm [nominee11] [zk: 90.0.0.221:2888(CONNECTED) 1] get /election/zkrsm/nominee11 90.0.0.222: cZxid = 0x40027 ctime = Thu Jan 19 08:18:24 UTC 2012 mZxid = 0x40027 mtime = Thu Jan 19 08:18:24 UTC 2012 pZxid = 0x40027 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0xa234f4f3bc220001 dataLength = 16 numChildren = 0 {quote} However, when I connect zkCli.sh to the second server (90.0.0.222, zkid 251), I saw three znodes under that same path: {quote} [zk: 90.0.0.222:2888(CONNECTED) 2] ls /election/zkrsm nominee06 nominee10 nominee11 [zk: 90.0.0.222:2888(CONNECTED) 2] get /election/zkrsm/nominee11 90.0.0.222: cZxid = 0x40027 ctime = Thu Jan 19 08:18:24 UTC 2012 mZxid = 0x40027 mtime = Thu Jan 19 08:18:24 UTC 2012 pZxid = 0x40027 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0xa234f4f3bc220001 dataLength = 16 numChildren = 0 [zk: 90.0.0.222:2888(CONNECTED) 3] get /election/zkrsm/nominee10 90.0.0.221: cZxid = 0x3014c ctime = Thu Jan 19 07:53:42 UTC 2012 mZxid = 0x3014c mtime = Thu Jan 19 07:53:42 UTC 2012 pZxid = 0x3014c cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0xa234f4f3bc22 dataLength = 16 numChildren = 0 [zk: 90.0.0.222:2888(CONNECTED) 4] get /election/zkrsm/nominee06 90.0.0.223: cZxid = 0x20cab ctime = Thu Jan 19 08:00:30 UTC 2012 mZxid = 0x20cab mtime = Thu Jan 19 08:00:30 UTC 2012 pZxid = 0x20cab cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x5434f5074e040002 dataLength = 16 numChildren = 0 {quote} These never went away for the lifetime of the server, for any clients connected directly to that server. Note that this cluster is configured to have all three servers still, the third one being down (90.0.0.223, zkid 162). I captured the data/snapshot directories for the the two live servers. When I start single-node servers using each directory, I can briefly see that the inconsistent data is present in those logs, though the ephemeral nodes seem to get (correctly) cleaned up pretty soon after I start the server. I will upload a tar containing the debug logs and data directories from the failure. I think we can reproduce it regularly if you need more info. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1367) Data inconsistencies and unexpired ephemeral nodes after cluster restart
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13195375#comment-13195375 ] Mahadev konar commented on ZOOKEEPER-1367: -- @Ben/Jeremy, Ill kick off a 3.4.3 release with this patch and ZOOKEEPER-1373. Data inconsistencies and unexpired ephemeral nodes after cluster restart Key: ZOOKEEPER-1367 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1367 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.4.2 Environment: Debian Squeeze, 64-bit Reporter: Jeremy Stribling Priority: Blocker Fix For: 3.4.3 Attachments: ZOOKEEPER-1367.patch, ZOOKEEPER-1367.tgz In one of our tests, we have a cluster of three ZooKeeper servers. We kill all three, and then restart just two of them. Sometimes we notice that on one of the restarted servers, ephemeral nodes from previous sessions do not get deleted, while on the other server they do. We are effectively running 3.4.2, though technically we are running 3.4.1 with the patch manually applied for ZOOKEEPER-1333 and a C client for 3.4.1 with the patches for ZOOKEEPER-1163. I noticed that when I connected using zkCli.sh to the first node (90.0.0.221, zkid 84), I saw only one znode in a particular path: {quote} [zk: 90.0.0.221:2888(CONNECTED) 0] ls /election/zkrsm [nominee11] [zk: 90.0.0.221:2888(CONNECTED) 1] get /election/zkrsm/nominee11 90.0.0.222: cZxid = 0x40027 ctime = Thu Jan 19 08:18:24 UTC 2012 mZxid = 0x40027 mtime = Thu Jan 19 08:18:24 UTC 2012 pZxid = 0x40027 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0xa234f4f3bc220001 dataLength = 16 numChildren = 0 {quote} However, when I connect zkCli.sh to the second server (90.0.0.222, zkid 251), I saw three znodes under that same path: {quote} [zk: 90.0.0.222:2888(CONNECTED) 2] ls /election/zkrsm nominee06 nominee10 nominee11 [zk: 90.0.0.222:2888(CONNECTED) 2] get /election/zkrsm/nominee11 90.0.0.222: cZxid = 0x40027 ctime = Thu Jan 19 08:18:24 UTC 2012 mZxid = 0x40027 mtime = Thu Jan 19 08:18:24 UTC 2012 pZxid = 0x40027 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0xa234f4f3bc220001 dataLength = 16 numChildren = 0 [zk: 90.0.0.222:2888(CONNECTED) 3] get /election/zkrsm/nominee10 90.0.0.221: cZxid = 0x3014c ctime = Thu Jan 19 07:53:42 UTC 2012 mZxid = 0x3014c mtime = Thu Jan 19 07:53:42 UTC 2012 pZxid = 0x3014c cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0xa234f4f3bc22 dataLength = 16 numChildren = 0 [zk: 90.0.0.222:2888(CONNECTED) 4] get /election/zkrsm/nominee06 90.0.0.223: cZxid = 0x20cab ctime = Thu Jan 19 08:00:30 UTC 2012 mZxid = 0x20cab mtime = Thu Jan 19 08:00:30 UTC 2012 pZxid = 0x20cab cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x5434f5074e040002 dataLength = 16 numChildren = 0 {quote} These never went away for the lifetime of the server, for any clients connected directly to that server. Note that this cluster is configured to have all three servers still, the third one being down (90.0.0.223, zkid 162). I captured the data/snapshot directories for the the two live servers. When I start single-node servers using each directory, I can briefly see that the inconsistent data is present in those logs, though the ephemeral nodes seem to get (correctly) cleaned up pretty soon after I start the server. I will upload a tar containing the debug logs and data directories from the failure. I think we can reproduce it regularly if you need more info. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1355) Add zk.updateServerList(newServerList)
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13193576#comment-13193576 ] Mahadev konar commented on ZOOKEEPER-1355: -- Ben, I was taking a look at it. Mind waiting till tomm? Add zk.updateServerList(newServerList) --- Key: ZOOKEEPER-1355 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1355 Project: ZooKeeper Issue Type: New Feature Components: java client Reporter: Alexander Shraer Assignee: Alexander Shraer Fix For: 3.5.0 Attachments: ZOOKEEPER-1355-ver2.patch, ZOOKEEPER-1355-ver4.patch, ZOOKEEPER-1355-ver5.patch, ZOOKEEPER=1355-ver3.patch, ZOOOKEEPER-1355-test.patch, ZOOOKEEPER-1355-ver1.patch, ZOOOKEEPER-1355.patch, loadbalancing-more-details.pdf, loadbalancing.pdf When the set of servers changes, we would like to update the server list stored by clients without restarting the clients. Moreover, assuming that the number of clients per server is the same (in expectation) in the old configuration (as guaranteed by the current list shuffling for example), we would like to re-balance client connections across the new set of servers in a way that a) the number of clients per server is the same for all servers (in expectation) and b) there is no excessive/unnecessary client migration. It is simple to achieve (a) without (b) - just re-shuffle the new list of servers at every client. But this would create unnecessary migration, which we'd like to avoid. We propose a simple probabilistic migration scheme that achieves (a) and (b) - each client locally decides whether and where to migrate when the list of servers changes. The attached document describes the scheme and shows an evaluation of it in Zookeeper. We also implemented re-balancing through a consistent-hashing scheme and show a comparison. We derived the probabilistic migration rules from a simple formula that we can also provide, if someone's interested in the proof. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1355) Add zk.updateServerList(newServerList)
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13193583#comment-13193583 ] Mahadev konar commented on ZOOKEEPER-1355: -- Ben/Alex, This adds 2 public api to zookeeper handle (java). Is this intended? What the intent of getCurrentHost? Also, I looked at the pdf (which scares me a little - hate looking at all the math symbols :)). Can you please explain in layman terms what the process for the client to select the server to connect to? What if the server list is incorrect, what happens then? Add zk.updateServerList(newServerList) --- Key: ZOOKEEPER-1355 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1355 Project: ZooKeeper Issue Type: New Feature Components: java client Reporter: Alexander Shraer Assignee: Alexander Shraer Fix For: 3.5.0 Attachments: ZOOKEEPER-1355-ver2.patch, ZOOKEEPER-1355-ver4.patch, ZOOKEEPER-1355-ver5.patch, ZOOKEEPER=1355-ver3.patch, ZOOOKEEPER-1355-test.patch, ZOOOKEEPER-1355-ver1.patch, ZOOOKEEPER-1355.patch, loadbalancing-more-details.pdf, loadbalancing.pdf When the set of servers changes, we would like to update the server list stored by clients without restarting the clients. Moreover, assuming that the number of clients per server is the same (in expectation) in the old configuration (as guaranteed by the current list shuffling for example), we would like to re-balance client connections across the new set of servers in a way that a) the number of clients per server is the same for all servers (in expectation) and b) there is no excessive/unnecessary client migration. It is simple to achieve (a) without (b) - just re-shuffle the new list of servers at every client. But this would create unnecessary migration, which we'd like to avoid. We propose a simple probabilistic migration scheme that achieves (a) and (b) - each client locally decides whether and where to migrate when the list of servers changes. The attached document describes the scheme and shows an evaluation of it in Zookeeper. We also implemented re-balancing through a consistent-hashing scheme and show a comparison. We derived the probabilistic migration rules from a simple formula that we can also provide, if someone's interested in the proof. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1355) Add zk.updateServerList(newServerList)
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13193586#comment-13193586 ] Mahadev konar commented on ZOOKEEPER-1355: -- One more thing, what about the c client? Will we be seeing similar changes to c client? I'd very much like to keep both of them in sync if possible. We are already a little different given the security patches. Add zk.updateServerList(newServerList) --- Key: ZOOKEEPER-1355 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1355 Project: ZooKeeper Issue Type: New Feature Components: java client Reporter: Alexander Shraer Assignee: Alexander Shraer Fix For: 3.5.0 Attachments: ZOOKEEPER-1355-ver2.patch, ZOOKEEPER-1355-ver4.patch, ZOOKEEPER-1355-ver5.patch, ZOOKEEPER=1355-ver3.patch, ZOOOKEEPER-1355-test.patch, ZOOOKEEPER-1355-ver1.patch, ZOOOKEEPER-1355.patch, loadbalancing-more-details.pdf, loadbalancing.pdf When the set of servers changes, we would like to update the server list stored by clients without restarting the clients. Moreover, assuming that the number of clients per server is the same (in expectation) in the old configuration (as guaranteed by the current list shuffling for example), we would like to re-balance client connections across the new set of servers in a way that a) the number of clients per server is the same for all servers (in expectation) and b) there is no excessive/unnecessary client migration. It is simple to achieve (a) without (b) - just re-shuffle the new list of servers at every client. But this would create unnecessary migration, which we'd like to avoid. We propose a simple probabilistic migration scheme that achieves (a) and (b) - each client locally decides whether and where to migrate when the list of servers changes. The attached document describes the scheme and shows an evaluation of it in Zookeeper. We also implemented re-balancing through a consistent-hashing scheme and show a comparison. We derived the probabilistic migration rules from a simple formula that we can also provide, if someone's interested in the proof. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1366) Zookeeper should be tolerant of clock adjustments
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13191384#comment-13191384 ] Mahadev konar commented on ZOOKEEPER-1366: -- @Ted, Seems like a good change, only one issue I see here. I'd like this to go into trunk and not into 3.4 unless its really a bug. I think 3.4 will take sometime to stabilize and would really like to avoid big changes in 3.4. Thoughts? Zookeeper should be tolerant of clock adjustments - Key: ZOOKEEPER-1366 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1366 Project: ZooKeeper Issue Type: Bug Reporter: Ted Dunning Assignee: Ted Dunning Fix For: 3.4.3 Attachments: ZOOKEEPER-1366-3.3.3.patch, ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch If you want to wreak havoc on a ZK based system just do [date -s +1hour] and watch the mayhem as all sessions expire at once. This shouldn't happen. Zookeeper could easily know handle elapsed times as elapsed times rather than as differences between absolute times. The absolute times are subject to adjustment when the clock is set while a timer is not subject to this problem. In Java, System.currentTimeMillis() gives you absolute time while System.nanoTime() gives you time based on a timer from an arbitrary epoch. I have done this and have been running tests now for some tens of minutes with no failures. I will set up a test machine to redo the build again on Ubuntu and post a patch here for discussion. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1373) Hardcoded SASL login context name clashes with Hadoop security configuration override
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13191858#comment-13191858 ] Mahadev konar commented on ZOOKEEPER-1373: -- This is a bug. We should fix to have a login context name configurable. Hardcoded SASL login context name clashes with Hadoop security configuration override - Key: ZOOKEEPER-1373 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1373 Project: ZooKeeper Issue Type: Bug Components: java client Affects Versions: 3.4.2 Reporter: Thomas Weise Fix For: 3.4.3 I'm trying to configure a process with Hadoop security (Hive metastore server) to talk to ZooKeeper 3.4.2 with Kerberos authentication. In this scenario Hadoop controls the SASL configuration (org.apache.hadoop.security.UserGroupInformation.HadoopConfiguration), instead of setting up the ZooKeeper Client loginContext via jaas.conf and system property {{-Djava.security.auth.login.config}} Using the Hadoop configuration would work, except that ZooKeeper client code expects the loginContextName to be Client while Hadoop security will use hadoop-keytab-kerberos. I verified that by changing the name in the debugger the SASL authentication succeeds while otherwise the login configuration cannot be resolved and the connection to ZooKeeper is unauthenticated. To integrate with Hadoop, the following in ZooKeeperSaslClient would need to change to make the name configurable: {{login = new Login(Client,new ClientCallbackHandler(null));}} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1302) patch to create rpm/deb on 3.3 branch
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13188021#comment-13188021 ] Mahadev konar commented on ZOOKEEPER-1302: -- Thanks Giri. It might be useful for folks on 3.3 branch but as Pat mentioned, given the patch is big, we'll have to skip it for 3.3. patch to create rpm/deb on 3.3 branch - Key: ZOOKEEPER-1302 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1302 Project: ZooKeeper Issue Type: Improvement Components: build Affects Versions: 3.3.3 Reporter: Giridharan Kesavan Assignee: Giridharan Kesavan Attachments: ZOOKEEPER-999-3.3-with-setupscript-3.patch, zk-1302-1.patch, zk-1302.patch backport zookeeper-999 patch to 3.3 branch and add zookeeper-setup-conf.sh to enable zk quorum setup -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1333) NPE in FileTxnSnapLog when restarting a cluster
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174256#comment-13174256 ] Mahadev konar commented on ZOOKEEPER-1333: -- @Camille, Agreed. I think the patch as it stands is good to go. The only concern I have is the code is pretty convoluted in processtransaction. We should work on making it cleaner. Ill add some comments for now when committing. NPE in FileTxnSnapLog when restarting a cluster --- Key: ZOOKEEPER-1333 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1333 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.4.0 Reporter: Andrew McNair Assignee: Patrick Hunt Priority: Blocker Fix For: 3.4.2 Attachments: ZOOKEEPER-1333.patch, ZOOKEEPER-1333.patch, test_case.diff, test_case.diff I think a NPE was created in the fix for https://issues.apache.org/jira/browse/ZOOKEEPER-1269 Looking in DataTree.processTxn(TxnHeader header, Record txn) it seems likely that if rc.err != Code.OK then rc.path will be null. I'm currently working on a minimal test case for the bug, I'll attach it to this issue when it's ready. java.lang.NullPointerException at org.apache.zookeeper.server.persistence.FileTxnSnapLog.processTransaction(FileTxnSnapLog.java:203) at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:150) at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223) at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:418) at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:410) at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:151) at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:111) at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1319) Missing data after restarting+expanding a cluster
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13166420#comment-13166420 ] Mahadev konar commented on ZOOKEEPER-1319: -- I am going ahead and checking in Pat's patch. I have opened ZOOKEEPER-1324 to track the duplicate NEWLEADER packets. Just being paranoid here and making minimal changes for the RC. Missing data after restarting+expanding a cluster - Key: ZOOKEEPER-1319 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1319 Project: ZooKeeper Issue Type: Bug Affects Versions: 3.4.0 Environment: Linux (Debian Squeeze) Reporter: Jeremy Stribling Assignee: Patrick Hunt Priority: Blocker Labels: cluster, data Fix For: 3.5.0, 3.4.1 Attachments: ZOOKEEPER-1319.patch, ZOOKEEPER-1319.patch, ZOOKEEPER-1319_trunk.patch, ZOOKEEPER-1319_trunk2.patch, logs.tgz I've been trying to update to ZK 3.4.0 and have had some issues where some data become inaccessible after adding a node to a cluster. My use case is a bit strange (as explained before on this list) in that I try to grow the cluster dynamically by having an external program automatically restart Zookeeper servers in a controlled way whenever the list of participating ZK servers needs to change. This used to work just fine in 3.3.3 (and before), so this represents a regression. The scenario I see is this: 1) Start up a 1-server ZK cluster (the server has ZK ID 0). 2) A client connects to the server, and makes a bunch of znodes, in particular a znode called /membership. 3) Shut down the cluster. 4) Bring up a 2-server ZK cluster, including the original server 0 with its existing data, and a new server with ZK ID 1. 5) Node 0 has the highest zxid and is elected leader. 6) A client connecting to server 1 tries to get /membership and gets back a -101 error code (no such znode). 7) The same client then tries to create /membership and gets back a -110 error code (znode already exists). 8) Clients connecting to server 0 can successfully get /membership. I will attach a tarball with debug logs for both servers, annotating where steps #1 and #4 happen. You can see that the election involves a proposal for zxid 110 from server 0, but immediately following the election server 1 has these lines: 2011-12-05 17:18:48,308 9299 [QuorumPeer[myid=1]/127.0.0.1:2901] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x10001 expected 0x1 2011-12-05 17:18:48,313 9304 [SyncThread:1] INFO org.apache.zookeeper.server.persistence.FileTxnLog - Creating new log file: log.10001 Perhaps that's not relevant, but it struck me as odd. At the end of server 1's log you can see a repeated cycle of getData-create-getData as the client tries to make sense of the inconsistent responses. The other piece of information is that if I try to use the on-disk directories for either of the servers to start a new one-node ZK cluster, all the data are accessible. I haven't tried writing a program outside of my application to reproduce this, but I can do it very easily with some of my app's tests if anyone needs more information. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1319) Missing data after restarting+expanding a cluster
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13165465#comment-13165465 ] Mahadev konar commented on ZOOKEEPER-1319: -- I am more inclined towards what flavio mentioned above. To reduce the number of changes I think its best we dont remove the duplicate NEWLEADER. Ben any thoughts? Missing data after restarting+expanding a cluster - Key: ZOOKEEPER-1319 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1319 Project: ZooKeeper Issue Type: Bug Affects Versions: 3.4.0 Environment: Linux (Debian Squeeze) Reporter: Jeremy Stribling Assignee: Patrick Hunt Priority: Blocker Labels: cluster, data Fix For: 3.5.0, 3.4.1 Attachments: ZOOKEEPER-1319.patch, ZOOKEEPER-1319.patch, ZOOKEEPER-1319_trunk.patch, ZOOKEEPER-1319_trunk2.patch, logs.tgz I've been trying to update to ZK 3.4.0 and have had some issues where some data become inaccessible after adding a node to a cluster. My use case is a bit strange (as explained before on this list) in that I try to grow the cluster dynamically by having an external program automatically restart Zookeeper servers in a controlled way whenever the list of participating ZK servers needs to change. This used to work just fine in 3.3.3 (and before), so this represents a regression. The scenario I see is this: 1) Start up a 1-server ZK cluster (the server has ZK ID 0). 2) A client connects to the server, and makes a bunch of znodes, in particular a znode called /membership. 3) Shut down the cluster. 4) Bring up a 2-server ZK cluster, including the original server 0 with its existing data, and a new server with ZK ID 1. 5) Node 0 has the highest zxid and is elected leader. 6) A client connecting to server 1 tries to get /membership and gets back a -101 error code (no such znode). 7) The same client then tries to create /membership and gets back a -110 error code (znode already exists). 8) Clients connecting to server 0 can successfully get /membership. I will attach a tarball with debug logs for both servers, annotating where steps #1 and #4 happen. You can see that the election involves a proposal for zxid 110 from server 0, but immediately following the election server 1 has these lines: 2011-12-05 17:18:48,308 9299 [QuorumPeer[myid=1]/127.0.0.1:2901] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x10001 expected 0x1 2011-12-05 17:18:48,313 9304 [SyncThread:1] INFO org.apache.zookeeper.server.persistence.FileTxnLog - Creating new log file: log.10001 Perhaps that's not relevant, but it struck me as odd. At the end of server 1's log you can see a repeated cycle of getData-create-getData as the client tries to make sense of the inconsistent responses. The other piece of information is that if I try to use the on-disk directories for either of the servers to start a new one-node ZK cluster, all the data are accessible. I haven't tried writing a program outside of my application to reproduce this, but I can do it very easily with some of my app's tests if anyone needs more information. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-442) need a way to remove watches that are no longer of interest
[ https://issues.apache.org/jira/browse/ZOOKEEPER-442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13164886#comment-13164886 ] Mahadev konar commented on ZOOKEEPER-442: - @Ben, Can you please take a look at this patch? need a way to remove watches that are no longer of interest --- Key: ZOOKEEPER-442 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-442 Project: ZooKeeper Issue Type: Improvement Reporter: Benjamin Reed Assignee: Daniel Gómez Ferro Priority: Critical Fix For: 3.5.0 Attachments: ZOOKEEPER-442.patch, ZOOKEEPER-442.patch, ZOOKEEPER-442.patch, ZOOKEEPER-442.patch, ZOOKEEPER-442.patch, ZOOKEEPER-442.patch, ZOOKEEPER-442.patch currently the only way a watch cleared is to trigger it. we need a way to enumerate the outstanding watch objects, find watch events the objects are watching for, and remove interests in an event. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1319) Missing data after restarting+expanding a cluster
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13164965#comment-13164965 ] Mahadev konar commented on ZOOKEEPER-1319: -- Flavio/Pat, Ben and I had a long discussion on this. Here is the gist - There are 2 NEWLEADER packets one added when the Leader just has become a leader and one added in startforwarding as Flavio mentioned above. We need to skip adding the first one (the one in leader.lead()) to the queue of packets to send to the follower. Flavio is right above that if we skip the adding of NEWLEADER in start forwarding we are good. We need to send the NEWLEADER packet in LearnerHandler (Line 390) because that means the end of all syncing up transactions from the Leader for the follower. Ben has an updated patch and will update the jira soon tonight. Missing data after restarting+expanding a cluster - Key: ZOOKEEPER-1319 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1319 Project: ZooKeeper Issue Type: Bug Affects Versions: 3.4.0 Environment: Linux (Debian Squeeze) Reporter: Jeremy Stribling Assignee: Patrick Hunt Priority: Blocker Labels: cluster, data Fix For: 3.5.0, 3.4.1 Attachments: ZOOKEEPER-1319.patch, ZOOKEEPER-1319.patch, ZOOKEEPER-1319_trunk.patch, logs.tgz I've been trying to update to ZK 3.4.0 and have had some issues where some data become inaccessible after adding a node to a cluster. My use case is a bit strange (as explained before on this list) in that I try to grow the cluster dynamically by having an external program automatically restart Zookeeper servers in a controlled way whenever the list of participating ZK servers needs to change. This used to work just fine in 3.3.3 (and before), so this represents a regression. The scenario I see is this: 1) Start up a 1-server ZK cluster (the server has ZK ID 0). 2) A client connects to the server, and makes a bunch of znodes, in particular a znode called /membership. 3) Shut down the cluster. 4) Bring up a 2-server ZK cluster, including the original server 0 with its existing data, and a new server with ZK ID 1. 5) Node 0 has the highest zxid and is elected leader. 6) A client connecting to server 1 tries to get /membership and gets back a -101 error code (no such znode). 7) The same client then tries to create /membership and gets back a -110 error code (znode already exists). 8) Clients connecting to server 0 can successfully get /membership. I will attach a tarball with debug logs for both servers, annotating where steps #1 and #4 happen. You can see that the election involves a proposal for zxid 110 from server 0, but immediately following the election server 1 has these lines: 2011-12-05 17:18:48,308 9299 [QuorumPeer[myid=1]/127.0.0.1:2901] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x10001 expected 0x1 2011-12-05 17:18:48,313 9304 [SyncThread:1] INFO org.apache.zookeeper.server.persistence.FileTxnLog - Creating new log file: log.10001 Perhaps that's not relevant, but it struck me as odd. At the end of server 1's log you can see a repeated cycle of getData-create-getData as the client tries to make sense of the inconsistent responses. The other piece of information is that if I try to use the on-disk directories for either of the servers to start a new one-node ZK cluster, all the data are accessible. I haven't tried writing a program outside of my application to reproduce this, but I can do it very easily with some of my app's tests if anyone needs more information. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-866) Adding no disk persistence option in zookeeper.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13162856#comment-13162856 ] Mahadev konar commented on ZOOKEEPER-866: - @peter, I didnt. From what I found was that the throughput when writing to disk was as good as the throughput with no persistence, so I didnt bother getting this in. Adding no disk persistence option in zookeeper. --- Key: ZOOKEEPER-866 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-866 Project: ZooKeeper Issue Type: New Feature Reporter: Mahadev konar Assignee: Mahadev konar Fix For: 3.5.0 Attachments: ZOOKEEPER-nodisk.patch Its been seen that some folks would like to use zookeeper for very fine grained locking. Also, in there use case they are fine with loosing all old zookeeper state if they reboot zookeeper or zookeeper goes down. The use case is more of a runtime locking wherein forgetting the state of locks is acceptable in case of a zookeeper reboot. Not logging to disk allows high throughput on and low latency on the writes to zookeeper. This would be a configuration option to set (ofcourse the default would be logging to disk). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1312) Add a getChildrenWithStat operation
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13160494#comment-13160494 ] Mahadev konar commented on ZOOKEEPER-1312: -- Agree. Would be very useful! Add a getChildrenWithStat operation - Key: ZOOKEEPER-1312 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1312 Project: ZooKeeper Issue Type: New Feature Reporter: Daniel Lord It would be extremely useful to be able to have a getChildrenWithStat method. This method would behave exactly the same as getChildren but in addition to returning the list of all child znode names it would also return a Stat for each child. I'm sure there are quite a few use cases for this but it could save a lot of extra reads for my application. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (BOOKKEEPER-31) Need a project logo
[ https://issues.apache.org/jira/browse/BOOKKEEPER-31?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13155481#comment-13155481 ] Mahadev konar commented on BOOKKEEPER-31: - @Ben, Nice one. I like it. Flavio, are you trying to scare people with black background ppts? :) Need a project logo --- Key: BOOKKEEPER-31 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-31 Project: Bookkeeper Issue Type: Improvement Reporter: Benjamin Reed Assignee: Benjamin Reed Attachments: bk_1.jpg, bk_2.jpg, bk_3.jpg, bk_4.jpg, bookeper_black_sm.png, bookeper_white_sm.png we need a logo for the project something that looks good in the big and the small and is easily recognizable. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1301) backport patches related to the zk startup script from 3.4 to 3.3 release
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13151671#comment-13151671 ] Mahadev konar commented on ZOOKEEPER-1301: -- Looking at the patch, I think we should do this: 3) looks fine to me (giri can you just add an echo statement as roman mentioned) 1) giri already fixed. 2) lets revert 4) lets revert backport patches related to the zk startup script from 3.4 to 3.3 release -- Key: ZOOKEEPER-1301 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1301 Project: ZooKeeper Issue Type: Improvement Affects Versions: 3.3.4 Reporter: Giridharan Kesavan Assignee: Giridharan Kesavan Attachments: zookeeper-1301-1.patch, zookeeper-1301.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1301) backport patches related to the zk startup script from 3.4 to 3.3 release
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13151699#comment-13151699 ] Mahadev konar commented on ZOOKEEPER-1301: -- looks good. +1 on the patch. backport patches related to the zk startup script from 3.4 to 3.3 release -- Key: ZOOKEEPER-1301 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1301 Project: ZooKeeper Issue Type: Improvement Affects Versions: 3.3.4 Reporter: Giridharan Kesavan Assignee: Giridharan Kesavan Attachments: zookeeper-1301-1.patch, zookeeper-1301-2.patch, zookeeper-1301.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1239) add logging/stats to identify fsync stalls
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13150658#comment-13150658 ] Mahadev konar commented on ZOOKEEPER-1239: -- Camille, Can you please commit this to 3.4 branch as well? thanks! add logging/stats to identify fsync stalls -- Key: ZOOKEEPER-1239 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1239 Project: ZooKeeper Issue Type: Improvement Components: server Reporter: Patrick Hunt Assignee: Patrick Hunt Fix For: 3.3.4, 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1239_br33.patch, ZOOKEEPER-1239_br34.patch We don't have any logging to identify fsync stalls. It's a somewhat common occurrence (after gc/swap issues) when trying to diagnose pipeline stalls - where outstanding requests start piling up and operational latency increases. We should have some sort of logging around this. e.g. if the fsync time exceeds some limit then log a warning, something like that. It would also be useful to publish stat information related to this. min/avg/max latency for fsync. This should also be exposed through JMX. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1208) Ephemeral node not removed after the client session is long gone
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13149837#comment-13149837 ] Mahadev konar commented on ZOOKEEPER-1208: -- Sorry I meant ZOOKEEPER-1239. Ephemeral node not removed after the client session is long gone Key: ZOOKEEPER-1208 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1208 Project: ZooKeeper Issue Type: Bug Affects Versions: 3.3.3 Reporter: kishore gopalakrishna Assignee: Patrick Hunt Priority: Blocker Fix For: 3.3.4, 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1208_br33.patch, ZOOKEEPER-1208_br33.patch, ZOOKEEPER-1208_br34.patch, ZOOKEEPER-1208_trunk.patch Copying from email thread. We found our ZK server in a state where an ephemeral node still exists after a client session is long gone. I used the cons command on each ZK host to list all connections and couldn't find the ephemeralOwner id. We are using ZK 3.3.3. Has anyone seen this problem? I got the following information from the logs. The node that still exists is /kafka-tracking/consumers/UserPerformanceEvent-host/owners/UserPerformanceEvent/529-7 I saw that the ephemeral owner is 86167322861045079 which is session id 0x13220b93e610550. After searching in the transaction log of one of the ZK servers found that session expired 9/22/11 12:17:57 PM PDT session 0x13220b93e610550 cxid 0x74 zxid 0x601bd36f7 closeSession null On digging further into the logs I found that there were multiple sessions created in quick succession and every session tried to create the same node. But i verified that the sessions were closed and opened in order 9/22/11 12:17:56 PM PDT session 0x13220b93e610550 cxid 0x0 zxid 0x601bd36b5 createSession 6000 9/22/11 12:17:57 PM PDT session 0x13220b93e610550 cxid 0x74 zxid 0x601bd36f7 closeSession null 9/22/11 12:17:58 PM PDT session 0x13220b93e610551 cxid 0x0 zxid 0x601bd36f8 createSession 6000 9/22/11 12:17:59 PM PDT session 0x13220b93e610551 cxid 0x74 zxid 0x601bd373a closeSession null 9/22/11 12:18:00 PM PDT session 0x13220b93e610552 cxid 0x0 zxid 0x601bd373e createSession 6000 9/22/11 12:18:01 PM PDT session 0x13220b93e610552 cxid 0x6c zxid 0x601bd37a0 closeSession null 9/22/11 12:18:02 PM PDT session 0x13220b93e610553 cxid 0x0 zxid 0x601bd37e9 createSession 6000 9/22/11 12:18:03 PM PDT session 0x13220b93e610553 cxid 0x74 zxid 0x601bd382b closeSession null 9/22/11 12:18:04 PM PDT session 0x13220b93e610554 cxid 0x0 zxid 0x601bd383c createSession 6000 9/22/11 12:18:05 PM PDT session 0x13220b93e610554 cxid 0x6a zxid 0x601bd388f closeSession null 9/22/11 12:18:06 PM PDT session 0x13220b93e610555 cxid 0x0 zxid 0x601bd3895 createSession 6000 9/22/11 12:18:07 PM PDT session 0x13220b93e610555 cxid 0x6a zxid 0x601bd38cd closeSession null 9/22/11 12:18:10 PM PDT session 0x13220b93e610556 cxid 0x0 zxid 0x601bd38d1 createSession 6000 9/22/11 12:18:11 PM PDT session 0x13220b93e610557 cxid 0x0 zxid 0x601bd38f2 createSession 6000 9/22/11 12:18:11 PM PDT session 0x13220b93e610557 cxid 0x51 zxid 0x601bd396a closeSession null Here is the log output for the sessions that tried creating the same node 9/22/11 12:17:54 PM PDT session 0x13220b93e61054f cxid 0x42 zxid 0x601bd366b create '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7 9/22/11 12:17:56 PM PDT session 0x13220b93e610550 cxid 0x42 zxid 0x601bd36ce create '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7 9/22/11 12:17:58 PM PDT session 0x13220b93e610551 cxid 0x42 zxid 0x601bd3711 create '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7 9/22/11 12:18:00 PM PDT session 0x13220b93e610552 cxid 0x42 zxid 0x601bd3777 create '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7 9/22/11 12:18:02 PM PDT session 0x13220b93e610553 cxid 0x42 zxid 0x601bd3802 create '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7 9/22/11 12:18:05 PM PDT session 0x13220b93e610554 cxid 0x44 zxid 0x601bd385d create '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7 9/22/11 12:18:07 PM PDT session 0x13220b93e610555 cxid 0x44 zxid 0x601bd38b0 create '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7 9/22/11 12:18:11 PM PDT session 0x13220b93e610557 cxid 0x52 zxid 0x601bd396b create '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7 Let me know if you need additional information. -- This message is automatically generated by JIRA. If you
[jira] [Commented] (ZOOKEEPER-1208) Ephemeral node not removed after the client session is long gone
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13148335#comment-13148335 ] Mahadev konar commented on ZOOKEEPER-1208: -- Sorry for being out of action (blame hadoop world :)). Looks like you found it Pat. About the testcase, I am not sure about the session id being 0. How is it tracking that the same session is being closed and an create on the same session is being sent? Ephemeral node not removed after the client session is long gone Key: ZOOKEEPER-1208 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1208 Project: ZooKeeper Issue Type: Bug Affects Versions: 3.3.3 Reporter: kishore gopalakrishna Assignee: Patrick Hunt Priority: Blocker Fix For: 3.3.4, 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1208_br33.patch, ZOOKEEPER-1208_br33.patch Copying from email thread. We found our ZK server in a state where an ephemeral node still exists after a client session is long gone. I used the cons command on each ZK host to list all connections and couldn't find the ephemeralOwner id. We are using ZK 3.3.3. Has anyone seen this problem? I got the following information from the logs. The node that still exists is /kafka-tracking/consumers/UserPerformanceEvent-host/owners/UserPerformanceEvent/529-7 I saw that the ephemeral owner is 86167322861045079 which is session id 0x13220b93e610550. After searching in the transaction log of one of the ZK servers found that session expired 9/22/11 12:17:57 PM PDT session 0x13220b93e610550 cxid 0x74 zxid 0x601bd36f7 closeSession null On digging further into the logs I found that there were multiple sessions created in quick succession and every session tried to create the same node. But i verified that the sessions were closed and opened in order 9/22/11 12:17:56 PM PDT session 0x13220b93e610550 cxid 0x0 zxid 0x601bd36b5 createSession 6000 9/22/11 12:17:57 PM PDT session 0x13220b93e610550 cxid 0x74 zxid 0x601bd36f7 closeSession null 9/22/11 12:17:58 PM PDT session 0x13220b93e610551 cxid 0x0 zxid 0x601bd36f8 createSession 6000 9/22/11 12:17:59 PM PDT session 0x13220b93e610551 cxid 0x74 zxid 0x601bd373a closeSession null 9/22/11 12:18:00 PM PDT session 0x13220b93e610552 cxid 0x0 zxid 0x601bd373e createSession 6000 9/22/11 12:18:01 PM PDT session 0x13220b93e610552 cxid 0x6c zxid 0x601bd37a0 closeSession null 9/22/11 12:18:02 PM PDT session 0x13220b93e610553 cxid 0x0 zxid 0x601bd37e9 createSession 6000 9/22/11 12:18:03 PM PDT session 0x13220b93e610553 cxid 0x74 zxid 0x601bd382b closeSession null 9/22/11 12:18:04 PM PDT session 0x13220b93e610554 cxid 0x0 zxid 0x601bd383c createSession 6000 9/22/11 12:18:05 PM PDT session 0x13220b93e610554 cxid 0x6a zxid 0x601bd388f closeSession null 9/22/11 12:18:06 PM PDT session 0x13220b93e610555 cxid 0x0 zxid 0x601bd3895 createSession 6000 9/22/11 12:18:07 PM PDT session 0x13220b93e610555 cxid 0x6a zxid 0x601bd38cd closeSession null 9/22/11 12:18:10 PM PDT session 0x13220b93e610556 cxid 0x0 zxid 0x601bd38d1 createSession 6000 9/22/11 12:18:11 PM PDT session 0x13220b93e610557 cxid 0x0 zxid 0x601bd38f2 createSession 6000 9/22/11 12:18:11 PM PDT session 0x13220b93e610557 cxid 0x51 zxid 0x601bd396a closeSession null Here is the log output for the sessions that tried creating the same node 9/22/11 12:17:54 PM PDT session 0x13220b93e61054f cxid 0x42 zxid 0x601bd366b create '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7 9/22/11 12:17:56 PM PDT session 0x13220b93e610550 cxid 0x42 zxid 0x601bd36ce create '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7 9/22/11 12:17:58 PM PDT session 0x13220b93e610551 cxid 0x42 zxid 0x601bd3711 create '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7 9/22/11 12:18:00 PM PDT session 0x13220b93e610552 cxid 0x42 zxid 0x601bd3777 create '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7 9/22/11 12:18:02 PM PDT session 0x13220b93e610553 cxid 0x42 zxid 0x601bd3802 create '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7 9/22/11 12:18:05 PM PDT session 0x13220b93e610554 cxid 0x44 zxid 0x601bd385d create '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7 9/22/11 12:18:07 PM PDT session 0x13220b93e610555 cxid 0x44 zxid 0x601bd38b0 create '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7 9/22/11 12:18:11 PM PDT session 0x13220b93e610557 cxid 0x52 zxid 0x601bd396b create
[jira] [Commented] (ZOOKEEPER-1208) Ephemeral node not removed after the client session is long gone
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13148634#comment-13148634 ] Mahadev konar commented on ZOOKEEPER-1208: -- You are right. I was worried abt the returned sid. Go ahead and upload patches for 3.4 and trunk. Ephemeral node not removed after the client session is long gone Key: ZOOKEEPER-1208 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1208 Project: ZooKeeper Issue Type: Bug Affects Versions: 3.3.3 Reporter: kishore gopalakrishna Assignee: Patrick Hunt Priority: Blocker Fix For: 3.3.4, 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1208_br33.patch, ZOOKEEPER-1208_br33.patch Copying from email thread. We found our ZK server in a state where an ephemeral node still exists after a client session is long gone. I used the cons command on each ZK host to list all connections and couldn't find the ephemeralOwner id. We are using ZK 3.3.3. Has anyone seen this problem? I got the following information from the logs. The node that still exists is /kafka-tracking/consumers/UserPerformanceEvent-host/owners/UserPerformanceEvent/529-7 I saw that the ephemeral owner is 86167322861045079 which is session id 0x13220b93e610550. After searching in the transaction log of one of the ZK servers found that session expired 9/22/11 12:17:57 PM PDT session 0x13220b93e610550 cxid 0x74 zxid 0x601bd36f7 closeSession null On digging further into the logs I found that there were multiple sessions created in quick succession and every session tried to create the same node. But i verified that the sessions were closed and opened in order 9/22/11 12:17:56 PM PDT session 0x13220b93e610550 cxid 0x0 zxid 0x601bd36b5 createSession 6000 9/22/11 12:17:57 PM PDT session 0x13220b93e610550 cxid 0x74 zxid 0x601bd36f7 closeSession null 9/22/11 12:17:58 PM PDT session 0x13220b93e610551 cxid 0x0 zxid 0x601bd36f8 createSession 6000 9/22/11 12:17:59 PM PDT session 0x13220b93e610551 cxid 0x74 zxid 0x601bd373a closeSession null 9/22/11 12:18:00 PM PDT session 0x13220b93e610552 cxid 0x0 zxid 0x601bd373e createSession 6000 9/22/11 12:18:01 PM PDT session 0x13220b93e610552 cxid 0x6c zxid 0x601bd37a0 closeSession null 9/22/11 12:18:02 PM PDT session 0x13220b93e610553 cxid 0x0 zxid 0x601bd37e9 createSession 6000 9/22/11 12:18:03 PM PDT session 0x13220b93e610553 cxid 0x74 zxid 0x601bd382b closeSession null 9/22/11 12:18:04 PM PDT session 0x13220b93e610554 cxid 0x0 zxid 0x601bd383c createSession 6000 9/22/11 12:18:05 PM PDT session 0x13220b93e610554 cxid 0x6a zxid 0x601bd388f closeSession null 9/22/11 12:18:06 PM PDT session 0x13220b93e610555 cxid 0x0 zxid 0x601bd3895 createSession 6000 9/22/11 12:18:07 PM PDT session 0x13220b93e610555 cxid 0x6a zxid 0x601bd38cd closeSession null 9/22/11 12:18:10 PM PDT session 0x13220b93e610556 cxid 0x0 zxid 0x601bd38d1 createSession 6000 9/22/11 12:18:11 PM PDT session 0x13220b93e610557 cxid 0x0 zxid 0x601bd38f2 createSession 6000 9/22/11 12:18:11 PM PDT session 0x13220b93e610557 cxid 0x51 zxid 0x601bd396a closeSession null Here is the log output for the sessions that tried creating the same node 9/22/11 12:17:54 PM PDT session 0x13220b93e61054f cxid 0x42 zxid 0x601bd366b create '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7 9/22/11 12:17:56 PM PDT session 0x13220b93e610550 cxid 0x42 zxid 0x601bd36ce create '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7 9/22/11 12:17:58 PM PDT session 0x13220b93e610551 cxid 0x42 zxid 0x601bd3711 create '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7 9/22/11 12:18:00 PM PDT session 0x13220b93e610552 cxid 0x42 zxid 0x601bd3777 create '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7 9/22/11 12:18:02 PM PDT session 0x13220b93e610553 cxid 0x42 zxid 0x601bd3802 create '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7 9/22/11 12:18:05 PM PDT session 0x13220b93e610554 cxid 0x44 zxid 0x601bd385d create '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7 9/22/11 12:18:07 PM PDT session 0x13220b93e610555 cxid 0x44 zxid 0x601bd38b0 create '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7 9/22/11 12:18:11 PM PDT session 0x13220b93e610557 cxid 0x52 zxid 0x601bd396b create '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7 Let me know if you need additional information. -- This message is automatically generated by
[jira] [Commented] (ZOOKEEPER-1215) C client persisted cache
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13146788#comment-13146788 ] Mahadev konar commented on ZOOKEEPER-1215: -- Marc, Sorry I've been a little busy with 3.4. Would definitely comment on the jira after reading/thinking through this. thanks C client persisted cache Key: ZOOKEEPER-1215 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1215 Project: ZooKeeper Issue Type: New Feature Components: c client Reporter: Marc Celani Assignee: Marc Celani Motivation: 1. Reduce the impact of client restarts on zookeeper by implementing a persisted cache, and only fetching deltas on restart 2. Reduce unnecessary calls to zookeeper. 3. Improve performance of gets by caching on the client 4. Allow for larger caches than in memory caches. Behavior Change: Zookeeper clients will not have the option to specify a folder path where it can cache zookeeper gets. If they do choose to cache results, the zookeeper library will check the persisted cache before actually sending a request to zookeeper. Watches will automatically be placed on all gets in order to invalidate the cache. Alternatively, we can add a cache flag to the get API - thoughts? On reconnect or restart, zookeeper clients will check the version number of each entries into its persisted cache, and will invalidate any old entries. In checking version number, zookeeper clients will also place a watch on those files. In regards to watches, client watch handlers will not fire until the invalidation step is completed, which may slow down client watch handling. Since setting up watches on all files is necessary on initialization, initialization will likely slow down as well. API Change: The zookeeper library will expose a new init interface that specifies a folder path to the cache. A new get API will specify whether or not to use cache, and whether or not stale data is safe to return if the connection is down. Design: The zookeeper handler structure will now include a cache_root_path (possibly null) string to cache all gets, as well as a bool for whether or not it is okay to serve stale data. Old API calls will default to a null path (which signifies no cache), and signify that it is not okay to serve stale data. The cache will be located at a cache_root_path. All files will be placed at cache_root_path/file_path. The cache will be an incomplete copy of everything that is in zookeeper, but everything in the cache will have the same relative path from the cache_root_path that it has as a path in zookeeper. Each file in the cache will include the Statstructure and the file contents. zoo_get will check the zookeeper handler to determine whether or not it has a cache. If it does, it will first go to the path to the persisted cache and append the get path. If the file exists and it is not invalidated, the zookeeper client will read it and return its value. If the file does not exist or is invalidated, the zookeeper library will perform the same get as is currently designed. After getting the results, the library will place the value in the persisted cache for subsequent reads. zoo_set will automatically invalidate the path in the cache. If caching is requested, then on each zoo_get that goes through to zookeeper, a watch will be placed on the path. A cache watch handler will handle all watch events by invalidating the cache, and placing another watch on it. Client watch handlers will handle the watch event after the cache watch handler. The cache watch handler will not call zoo_get, because it is assumed that the client watch handlers will call zoo_get if they need the fresh data as soon as it is invalidated (which is why the cache watch handler must be executed first). All updates to the cache will be done on a separate thread, but will be queued in order to maintain consistency in the cache. In addition, all client watch handlers will not be fired until the cache watch handler completes its invalidation write in order to ensure that client calls to zoo_get in the watch event handler are done after the invalidation step. This means that a client watch handler could be waiting on SEVERAL writes before it can be fired off, since all writes are queued. When a new connection is made, if a zookeeper handler has a cache, then that cache will be scanned in order to find all leaf nodes. Calls will be made to zookeeper to check if all of these nodes still exist, and if they do, what their version number is. Any inconsistencies in version will result in the cache invalidating the out of date files. Any files that no longer exist will be deleted from the
[jira] [Commented] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13144469#comment-13144469 ] Mahadev konar commented on ZOOKEEPER-1264: -- Camille, Are you debugging the test failure in 3.4 or waiting for others to take a look? FollowerResyncConcurrencyTest failing intermittently Key: ZOOKEEPER-1264 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264 Project: ZooKeeper Issue Type: Bug Components: tests Affects Versions: 3.3.3, 3.4.0, 3.5.0 Reporter: Patrick Hunt Assignee: Camille Fournier Priority: Blocker Fix For: 3.3.4, 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1264-branch34.patch, ZOOKEEPER-1264-merge.patch, ZOOKEEPER-1264.patch, ZOOKEEPER-1264.patch, ZOOKEEPER-1264.patch, ZOOKEEPER-1264.patch, ZOOKEEPER-1264_branch33.patch, ZOOKEEPER-1264_branch34.patch, ZOOKEEPER-1264unittest.patch, ZOOKEEPER-1264unittest.patch, followerresyncfailure_log.txt.gz, logs.zip, tmp.zip The FollowerResyncConcurrencyTest test is failing intermittently. saw the following on 3.4: {noformat} junit.framework.AssertionFailedError: Should have same number of ephemerals in both followers expected:11741 but was:14001 at org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400) at org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196) at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1270) testEarlyLeaderAbandonment failing intermittently, quorum formed, no serving.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13144556#comment-13144556 ] Mahadev konar commented on ZOOKEEPER-1270: -- Alex, Can you please upload a patch that applies to trunk and 3.4 branch here? I'd like to get this done tonight. testEarlyLeaderAbandonment failing intermittently, quorum formed, no serving. - Key: ZOOKEEPER-1270 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1270 Project: ZooKeeper Issue Type: Bug Components: server Reporter: Patrick Hunt Assignee: Flavio Junqueira Priority: Blocker Fix For: 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1270-and-1194.patch, ZOOKEEPER-1270.patch, ZOOKEEPER-1270.patch, ZOOKEEPER-1270_br34.patch, ZOOKEEPER-1270tests.patch, ZOOKEEPER-1270tests2.patch, testEarlyLeaderAbandonment.txt.gz, testEarlyLeaderAbandonment2.txt.gz, testEarlyLeaderAbandonment3.txt.gz, testEarlyLeaderAbandonment4.txt.gz Looks pretty serious - quorum is formed but no clients can attach. Will attach logs momentarily. This test was introduced in the following commit (all three jira commit at once): ZOOKEEPER-335. zookeeper servers should commit the new leader txn to their logs. ZOOKEEPER-1081. modify leader/follower code to correctly deal with new leader ZOOKEEPER-1082. modify leader election to correctly take into account current -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1270) testEarlyLeaderAbandonment failing intermittently, quorum formed, no serving.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13144557#comment-13144557 ] Mahadev konar commented on ZOOKEEPER-1270: -- Alex, Please make sure that you grant code changes to Apache. You just have to click on the box that says Grant license to Apache when attaching the patch. Do reattach the patch with the grant. thanks testEarlyLeaderAbandonment failing intermittently, quorum formed, no serving. - Key: ZOOKEEPER-1270 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1270 Project: ZooKeeper Issue Type: Bug Components: server Reporter: Patrick Hunt Assignee: Flavio Junqueira Priority: Blocker Fix For: 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1270-and-1194.patch, ZOOKEEPER-1270.patch, ZOOKEEPER-1270.patch, ZOOKEEPER-1270_br34.patch, ZOOKEEPER-1270tests.patch, ZOOKEEPER-1270tests2.patch, testEarlyLeaderAbandonment.txt.gz, testEarlyLeaderAbandonment2.txt.gz, testEarlyLeaderAbandonment3.txt.gz, testEarlyLeaderAbandonment4.txt.gz Looks pretty serious - quorum is formed but no clients can attach. Will attach logs momentarily. This test was introduced in the following commit (all three jira commit at once): ZOOKEEPER-335. zookeeper servers should commit the new leader txn to their logs. ZOOKEEPER-1081. modify leader/follower code to correctly deal with new leader ZOOKEEPER-1082. modify leader election to correctly take into account current -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1270) testEarlyLeaderAbandonment failing intermittently, quorum formed, no serving.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13144595#comment-13144595 ] Mahadev konar commented on ZOOKEEPER-1270: -- +1 on Alex's suggestion. Lets stick to minimal changes for now :). testEarlyLeaderAbandonment failing intermittently, quorum formed, no serving. - Key: ZOOKEEPER-1270 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1270 Project: ZooKeeper Issue Type: Bug Components: server Reporter: Patrick Hunt Assignee: Flavio Junqueira Priority: Blocker Fix For: 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1270-and-1194-branch34.patch, ZOOKEEPER-1270-and-1194.patch, ZOOKEEPER-1270-and-1194.patch, ZOOKEEPER-1270.patch, ZOOKEEPER-1270.patch, ZOOKEEPER-1270_br34.patch, ZOOKEEPER-1270tests.patch, ZOOKEEPER-1270tests2.patch, testEarlyLeaderAbandonment.txt.gz, testEarlyLeaderAbandonment2.txt.gz, testEarlyLeaderAbandonment3.txt.gz, testEarlyLeaderAbandonment4.txt.gz Looks pretty serious - quorum is formed but no clients can attach. Will attach logs momentarily. This test was introduced in the following commit (all three jira commit at once): ZOOKEEPER-335. zookeeper servers should commit the new leader txn to their logs. ZOOKEEPER-1081. modify leader/follower code to correctly deal with new leader ZOOKEEPER-1082. modify leader election to correctly take into account current -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1270) testEarlyLeaderAbandonment failing intermittently, quorum formed, no serving.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13142709#comment-13142709 ] Mahadev konar commented on ZOOKEEPER-1270: -- Looks like the zookeeperserver does not start running within the Quorum Peers. There is something really wrong which prevents the Followers/leaders to start running the ZooKeeperServers. I suspect, it has something to do with NEWLeader transaction (could be wrong). Need to look deeper. Another pair of eyes would help! testEarlyLeaderAbandonment failing intermittently, quorum formed, no serving. - Key: ZOOKEEPER-1270 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1270 Project: ZooKeeper Issue Type: Bug Components: server Reporter: Patrick Hunt Priority: Blocker Fix For: 3.4.0, 3.5.0 Attachments: testEarlyLeaderAbandonment.txt.gz, testEarlyLeaderAbandonment2.txt.gz Looks pretty serious - quorum is formed but no clients can attach. Will attach logs momentarily. This test was introduced in the following commit (all three jira commit at once): ZOOKEEPER-335. zookeeper servers should commit the new leader txn to their logs. ZOOKEEPER-1081. modify leader/follower code to correctly deal with new leader ZOOKEEPER-1082. modify leader election to correctly take into account current -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1246) Dead code in PrepRequestProcessor catch Exception block
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13140984#comment-13140984 ] Mahadev konar commented on ZOOKEEPER-1246: -- Looks good to me. Camille you want to check this in? Dead code in PrepRequestProcessor catch Exception block --- Key: ZOOKEEPER-1246 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1246 Project: ZooKeeper Issue Type: Sub-task Reporter: Thomas Koch Assignee: Camille Fournier Priority: Blocker Fix For: 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1246.patch, ZOOKEEPER-1246.patch, ZOOKEEPER-1246_trunk.patch, ZOOKEEPER-1246_trunk.patch This is a regression introduced by ZOOKEEPER-965 (multi transactions). The catch(Exception e) block in PrepRequestProcessor.pRequest contains an if block with condition request.getHdr() != null. This condition will always evaluate to false since the changes in ZOOKEEPER-965. This is caused by a change in sequence: Before ZK-965, the txnHeader was set _before_ the deserialization of the request. Afterwards the deserialization happens before request.setHdr is set. So the following RequestProcessors won't see the request as a failed one but as a Read request, since it doesn't have a hdr set. Notes: - it is very bad practice to catch Exception. The block should rather catch IOException - The check whether the TxnHeader is set in the request is used at several places to see whether the request is a read or write request. It isn't obvious for a newby, what it means whether a request has a hdr set or not. - at the beginning of pRequest the hdr and txn of request are set to null. However there is no chance that these fields could ever not be null at this point. The code however suggests that this could be the case. There should rather be an assertion that confirms that these fields are indeed null. The practice of doing things just in case, even if there is no chance that this case could happen, is a very stinky code smell and means that the code isn't understandable or trustworthy. - The multi transaction switch case block in pRequest is very hard to read, because it missuses the request.{hdr|txn} fields as local variables. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141369#comment-13141369 ] Mahadev konar commented on ZOOKEEPER-1264: -- @Ben, sorry to be pestering, I'd like to get 3.4 rc1 out today. Please be back today :). FollowerResyncConcurrencyTest failing intermittently Key: ZOOKEEPER-1264 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264 Project: ZooKeeper Issue Type: Bug Components: tests Affects Versions: 3.3.3, 3.4.0, 3.5.0 Reporter: Patrick Hunt Assignee: Camille Fournier Priority: Blocker Fix For: 3.3.4, 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1264.patch, ZOOKEEPER-1264.patch, ZOOKEEPER-1264_branch33.patch, ZOOKEEPER-1264_branch34.patch, followerresyncfailure_log.txt.gz, logs.zip, tmp.zip The FollowerResyncConcurrencyTest test is failing intermittently. saw the following on 3.4: {noformat} junit.framework.AssertionFailedError: Should have same number of ephemerals in both followers expected:11741 but was:14001 at org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400) at org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196) at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1257) Rename MultiTransactionRecord to MultiRequest
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141572#comment-13141572 ] Mahadev konar commented on ZOOKEEPER-1257: -- Looked through the code, the rename does not change any compatibility story. We can change it anytime we want. Not really a blocker for 3.4. Rename MultiTransactionRecord to MultiRequest - Key: ZOOKEEPER-1257 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1257 Project: ZooKeeper Issue Type: Sub-task Reporter: Thomas Koch Assignee: Thomas Koch Priority: Critical Understanding the code behind multi operations doesn't get any easier when the code violates naming consistency. All other Request classes are called xxxRequest, only for multi its xxxTransactionRecord! Also Transaction is wrong, because there is the concepts of transactions that are transmitted between quorum peers or committed to disc. MultiTransactionRecord however is a _Request_ from a client. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1269) Multi deserialization issues
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141654#comment-13141654 ] Mahadev konar commented on ZOOKEEPER-1269: -- Camille, Should this go into 3.4 or just trunk? Multi deserialization issues Key: ZOOKEEPER-1269 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1269 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.4.0 Reporter: Camille Fournier Assignee: Camille Fournier Attachments: ZOOKEEPER-1269.patch From the mailing list: FileTxnSnapLog.restore contains a code block handling a NODEEXISTS failure during deserialization. The problem is explained there in a code comment. The code block however is only executed for a CREATE txn, not for a multiTxn containing a CREATE. Even if the mentioned code block would also be executed for multi transactions, it needs adaption for multi transactions. What, if after the first failed transaction in a multi txn during deserialization, there would be subsequent transactions in the same multi that would also have failed? We don't know, since the first failed transaction hides the information about the remaining transactions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1100) Killed (or missing) SendThread will cause hanging threads
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141719#comment-13141719 ] Mahadev konar commented on ZOOKEEPER-1100: -- Camille, I dont think we have a dependency on mockito yet. I am adding one in ZOOKEEPER-1271. Killed (or missing) SendThread will cause hanging threads - Key: ZOOKEEPER-1100 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1100 Project: ZooKeeper Issue Type: Bug Components: java client Affects Versions: 3.3.3 Environment: http://mail-archives.apache.org/mod_mbox/zookeeper-user/201106.mbox/%3Citpgb6$2mi$1...@dough.gmane.org%3E Reporter: Gunnar Wagenknecht Assignee: Rakesh R Fix For: 3.5.0 Attachments: ZOOKEEPER-1100.patch, ZOOKEEPER-1100.patch After investigating an issues with [hanging threads|http://mail-archives.apache.org/mod_mbox/zookeeper-user/201106.mbox/%3Citpgb6$2mi$1...@dough.gmane.org%3E] I noticed that any java.lang.Error might silently kill the SendThread. Without a SendThread any thread that wants to send something will hang forever. Currently nobody will recognize a SendThread that died. I think at least a state should be flipped (or flag should be set) that causes all further send attempts to fail or to re-spin the connection loop. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13139970#comment-13139970 ] Mahadev konar commented on ZOOKEEPER-1264: -- Ben/Flavio, Any comments? FollowerResyncConcurrencyTest failing intermittently Key: ZOOKEEPER-1264 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264 Project: ZooKeeper Issue Type: Bug Components: tests Affects Versions: 3.3.3, 3.4.0, 3.5.0 Reporter: Patrick Hunt Assignee: Camille Fournier Priority: Blocker Fix For: 3.3.4, 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1264.patch, ZOOKEEPER-1264_branch33.patch, ZOOKEEPER-1264_branch34.patch, followerresyncfailure_log.txt.gz, logs.zip, tmp.zip The FollowerResyncConcurrencyTest test is failing intermittently. saw the following on 3.4: {noformat} junit.framework.AssertionFailedError: Should have same number of ephemerals in both followers expected:11741 but was:14001 at org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400) at org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196) at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1273) Copy'n'pasted unit test
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13139971#comment-13139971 ] Mahadev konar commented on ZOOKEEPER-1273: -- @Thomas, Might be better to do taht to make sure hudson agrees with the deletion. Copy'n'pasted unit test --- Key: ZOOKEEPER-1273 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1273 Project: ZooKeeper Issue Type: Bug Reporter: Thomas Koch Assignee: Thomas Koch Priority: Trivial Probably caused by the usage of a legacy VCS a code duplication happened when you moved from Sourceforge to Apache (ZOOKEEPER-38). The following file can be deleted: src/java/test/org/apache/zookeeper/server/DataTreeUnitTest.java src/java/test/org/apache/zookeeper/test/DataTreeTest.java was an exact copy of the above until ZOOKEEPER-1046 added an additional test case only to the latter. Do I need to upload a patch file for this? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13138116#comment-13138116 ] Mahadev konar commented on ZOOKEEPER-1264: -- +1 looks good to me. Might want to check on the the hudson tests. Looks like https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/709//testReport/ has observer test failing? Doesnt seem related but no harm in running the trunk patch through hudson again. FollowerResyncConcurrencyTest failing intermittently Key: ZOOKEEPER-1264 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264 Project: ZooKeeper Issue Type: Bug Components: tests Affects Versions: 3.3.3, 3.4.0, 3.5.0 Reporter: Patrick Hunt Assignee: Patrick Hunt Fix For: 3.3.4, 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1264.patch, ZOOKEEPER-1264_branch33.patch, ZOOKEEPER-1264_branch34.patch The FollowerResyncConcurrencyTest test is failing intermittently. saw the following on 3.4: {noformat} junit.framework.AssertionFailedError: Should have same number of ephemerals in both followers expected:11741 but was:14001 at org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400) at org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196) at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1259) central mapping from type to txn record class
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13137264#comment-13137264 ] Mahadev konar commented on ZOOKEEPER-1259: -- @Thomas, You can check the console output for C test failures: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/702//console {noformat} [exec] [exec] /home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-Build/trunk/src/c/tests/TestMulti.cc:574: Assertion: equality assertion failed [Expected: 0, Actual : 709395008] [exec] [exec] Failures !!! [exec] [exec] Run: 57 Failure total: 1 Failures: 1 Errors: 0 [exec] [exec] make: *** [run-check] Error 1 [exec] {noformat} central mapping from type to txn record class - Key: ZOOKEEPER-1259 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1259 Project: ZooKeeper Issue Type: Sub-task Reporter: Thomas Koch Assignee: Thomas Koch Attachments: ZOOKEEPER-1259.patch There are two places where large switch statements do nothing else to get the correct Record class accorging to a txn type. Provided a static map in SerializeUtils from type to Class? extends Record and a method to retrieve a new txn Record instance for a type. Code size reduced by 28 lines. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1242) Repeat add watcher, memory leak
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13133842#comment-13133842 ] Mahadev konar commented on ZOOKEEPER-1242: -- @Peng, The jira seems to be resolved? The patch doesnt seem to be committed, any reason you marked this resolved? Repeat add watcher, memory leak - Key: ZOOKEEPER-1242 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1242 Project: ZooKeeper Issue Type: Bug Components: c client Affects Versions: 3.3.3 Environment: Redhat linux Reporter: Peng Futian Labels: patch Fix For: 3.3.4 Attachments: ZOOKEEPER-1242.patch Original Estimate: 1h Remaining Estimate: 1h When I repeat add watcher , there are a memory leak. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1240) Compiler issue with redhat linux
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13133847#comment-13133847 ] Mahadev konar commented on ZOOKEEPER-1240: -- Peng, You seem to have closed the jira again? Take a look at how to contribute, on the https://cwiki.apache.org/confluence/display/ZOOKEEPER/HowToContribute for guidance on how to upload/review/get it committed. Compiler issue with redhat linux Key: ZOOKEEPER-1240 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1240 Project: ZooKeeper Issue Type: Bug Components: c client Affects Versions: 3.3.3 Environment: Linux phy 2.6.18-53.el5 #1 SMP Wed Oct 10 16:34:19 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux gcc version 4.1.2 20070626 (Red Hat 4.1.2-14) Reporter: Peng Futian Priority: Minor Labels: patch Fix For: 3.3.4 Attachments: ZOOKEEPER-1240.patch Original Estimate: 1h Remaining Estimate: 1h When I compile zookeeper c client in my project, there are some error: ../../../include/zookeeper/recordio.h:70: error:expected unqualified-id before '__extension__' ../../../include/zookeeper/recordio.h:70: error:expected `)' before '__extension__' ../../.. /include/zookeeper/recordio.h:70: error:expected unqualified-id before ')' token -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1197) Incorrect socket handling of 4 letter words for NIO
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13125524#comment-13125524 ] Mahadev konar commented on ZOOKEEPER-1197: -- Camille, What do we want to do then? Closing the connection from client is probably not feasible. Should we just checkin what we have? I am not a big fan of letting the connections linger on the server and then close them later. Incorrect socket handling of 4 letter words for NIO --- Key: ZOOKEEPER-1197 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1197 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.3.3, 3.4.0 Reporter: Camille Fournier Assignee: Camille Fournier Priority: Blocker Fix For: 3.3.4, 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1197.patch When transferring a large amount of information from a 4 letter word, especially in interactive mode (telnet or nc) over a slower network link, the connection can be closed before all of the data has reached the client. This is due to the way we handle nc non-interactive mode, by cancelling the selector key. Instead of cancelling the selector key for 4-letter-words, we should instead flag the NIOServerCnxn to ignore detection of a close condition on that socket (CancelledKeyException, EndOfStreamException). Since the 4lw will close the connection immediately upon completion, this should be safe to do. See ZOOKEEPER-737 for more details -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1210) Can't build ZooKeeper RPM with RPM = 4.6.0 (i.e. on RHEL 6 and Fedora = 10)
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13122593#comment-13122593 ] Mahadev konar commented on ZOOKEEPER-1210: -- Tadeusz, You might want to use --no-prefix for generating the patch. Can't build ZooKeeper RPM with RPM = 4.6.0 (i.e. on RHEL 6 and Fedora = 10) - Key: ZOOKEEPER-1210 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1210 Project: ZooKeeper Issue Type: Bug Components: build Affects Versions: 3.4.0 Environment: Tested to fail on both Centos 6.0 and Fedora 14 Reporter: Tadeusz Andrzej Kadłubowski Priority: Minor Labels: patch Attachments: rpm_buildroot_on_RHEL6.patch I was trying to build the zookeeper RPM (basically, `ant rpm -Dskip.contrib=1`), using build scripts that were recently merged from the work on the ZOOKEEPER-999 issue. The final stage, i.e. running rpmbuild failed. From what I understand it mixed BUILD and BUILDROOT subdirectories in /tmp/zookeeper_package_build_tkadlubo/, leaving BUILDROOT empty, and placing everything in BUILD. The full build log is at http://pastebin.com/0ZvUAKJt (Caution: I cut out long file listings from running tar -xvvf). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1215) C client persisted cache
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13122503#comment-13122503 ] Mahadev konar commented on ZOOKEEPER-1215: -- @Marc, Can you elaborate on the use case for this? What are the issues that you are facing which is creating a need for client side caching? Also, on a restart wont the client cache be invalid? Do you plan to persist the session and make sure you restart within the session expiry? C client persisted cache Key: ZOOKEEPER-1215 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1215 Project: ZooKeeper Issue Type: New Feature Components: c client Reporter: Marc Celani Assignee: Marc Celani Motivation: 1. Reduce the impact of client restarts on zookeeper by implementing a persisted cache, and only fetching deltas on restart 2. Reduce unnecessary calls to zookeeper. 3. Improve performance of gets by caching on the client 4. Allow for larger caches than in memory caches. Behavior Change: Zookeeper clients will not have the option to specify a folder path where it can cache zookeeper gets. If they do choose to cache results, the zookeeper library will check the persisted cache before actually sending a request to zookeeper. Watches will automatically be placed on all gets in order to invalidate the cache. Alternatively, we can add a cache flag to the get API - thoughts? On reconnect or restart, zookeeper clients will check the version number of each entries into its persisted cache, and will invalidate any old entries. In checking version number, zookeeper clients will also place a watch on those files. In regards to watches, client watch handlers will not fire until the invalidation step is completed, which may slow down client watch handling. Since setting up watches on all files is necessary on initialization, initialization will likely slow down as well. API Change: The zookeeper library will expose a new init interface that specifies a folder path to the cache. A new get API will specify whether or not to use cache, and whether or not stale data is safe to return if the connection is down. Design: The zookeeper handler structure will now include a cache_root_path (possibly null) string to cache all gets, as well as a bool for whether or not it is okay to serve stale data. Old API calls will default to a null path (which signifies no cache), and signify that it is not okay to serve stale data. The cache will be located at a cache_root_path. All files will be placed at cache_root_path/file_path. The cache will be an incomplete copy of everything that is in zookeeper, but everything in the cache will have the same relative path from the cache_root_path that it has as a path in zookeeper. Each file in the cache will include the Statstructure and the file contents. zoo_get will check the zookeeper handler to determine whether or not it has a cache. If it does, it will first go to the path to the persisted cache and append the get path. If the file exists and it is not invalidated, the zookeeper client will read it and return its value. If the file does not exist or is invalidated, the zookeeper library will perform the same get as is currently designed. After getting the results, the library will place the value in the persisted cache for subsequent reads. zoo_set will automatically invalidate the path in the cache. If caching is requested, then on each zoo_get that goes through to zookeeper, a watch will be placed on the path. A cache watch handler will handle all watch events by invalidating the cache, and placing another watch on it. Client watch handlers will handle the watch event after the cache watch handler. The cache watch handler will not call zoo_get, because it is assumed that the client watch handlers will call zoo_get if they need the fresh data as soon as it is invalidated (which is why the cache watch handler must be executed first). All updates to the cache will be done on a separate thread, but will be queued in order to maintain consistency in the cache. In addition, all client watch handlers will not be fired until the cache watch handler completes its invalidation write in order to ensure that client calls to zoo_get in the watch event handler are done after the invalidation step. This means that a client watch handler could be waiting on SEVERAL writes before it can be fired off, since all writes are queued. When a new connection is made, if a zookeeper handler has a cache, then that cache will be scanned in order to find all leaf nodes. Calls will be made to zookeeper to check if all of these nodes still exist, and if they do, what their version number is.
[jira] [Commented] (ZOOKEEPER-1112) Add support for C client for SASL authentication
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13122505#comment-13122505 ] Mahadev konar commented on ZOOKEEPER-1112: -- Very glad to see this! Will take a look at the patch sometime this week! Add support for C client for SASL authentication Key: ZOOKEEPER-1112 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1112 Project: ZooKeeper Issue Type: New Feature Reporter: Eugene Koontz Attachments: ZOOKEEPER-1112.patch, zookeeper-c-client-sasl.patch Hopefully this would leverage the SASL server-side support provided by ZOOKEEPER-938. It would be similar to the Java SASL client support also provided in ZOOKEEPER-938. Java has built-in SASL support, but I'm not sure what C libraries are available for SASL and if so, are they compatible with the Apache license. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1189) For an invalid snapshot file(less than 10bytes size) RandomAccessFile stream is leaking.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13115069#comment-13115069 ] Mahadev konar commented on ZOOKEEPER-1189: -- Thanks Rakesh, will go ahead and commit. For an invalid snapshot file(less than 10bytes size) RandomAccessFile stream is leaking. Key: ZOOKEEPER-1189 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1189 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.3.3 Reporter: Rakesh R Assignee: Rakesh R Fix For: 3.3.4, 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1189-branch-3.3.patch, ZOOKEEPER-1189.1.patch, ZOOKEEPER-1189.patch When loading the snapshot, ZooKeeper will consider only the 'snapshots with atleast 10 bytes size'. Otherwsie it will ignore and just return without closing the RandomAccessFile. {noformat} Util.isValidSnapshot() having the following logic. // Check for a valid snapshot RandomAccessFile raf = new RandomAccessFile(f, r); // including the header and the last / bytes // the snapshot should be atleast 10 bytes if (raf.length() 10) { return false; } {noformat} Since the snapshot file validation logic is outside try block, it won't go to the finally block and will be leaked. Suggestion: Move the validation logic to the try/catch block. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1195) SASL authorizedID being incorrectly set: should use getHostName() rather than getServiceName()
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13115136#comment-13115136 ] Mahadev konar commented on ZOOKEEPER-1195: -- Eugene, Should we just incorporate ZOOKEEPER-1201 into 3.4? What do you think? SASL authorizedID being incorrectly set: should use getHostName() rather than getServiceName() -- Key: ZOOKEEPER-1195 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1195 Project: ZooKeeper Issue Type: Bug Affects Versions: 3.4.0 Reporter: Eugene Koontz Assignee: Eugene Koontz Fix For: 3.4.0 Attachments: SaslAuthNamingTest.java, ZOOKEEPER-1195.patch Tom Klonikowski writes: Hello developers, the SaslServerCallbackHandler in trunk changes the principal name service/host@REALM to service/service@REALM (i guess unintentionally). lines 131-133: if (!removeHost() (kerberosName.getHostName() != null)) { userName += / + kerberosName.getServiceName(); } Server Log: SaslServerCallbackHandler@115] - Successfully authenticated client: authenticationID=fetcher/ubook@QUINZOO; authorizationID=fetcher/ubook@QUINZOO. SaslServerCallbackHandler@137] - Setting authorizedID: fetcher/fetcher@QUINZOO -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1181) Fix problems with Kerberos TGT renewal
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13115144#comment-13115144 ] Mahadev konar commented on ZOOKEEPER-1181: -- Eugene, We should write some unit tests for this. I am fine checking this into 3.4 for now. Can you please create a ticket to add a unit test for this? Mockito would be very helpful here. Might make some changes to the patch to get this in ASAP. Fix problems with Kerberos TGT renewal -- Key: ZOOKEEPER-1181 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1181 Project: ZooKeeper Issue Type: Bug Components: java client, server Affects Versions: 3.4.0 Reporter: Eugene Koontz Assignee: Eugene Koontz Labels: kerberos, security Fix For: 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1181.patch, ZOOKEEPER-1181.patch Currently, in Zookeeper trunk, there are two problems with Kerberos TGT renewal: 1. TGTs obtained from a keytab are not refreshed periodically. They should be, just as those from ticket cache are refreshed. 2. Ticket renewal should be retried if it fails. Ticket renewal might fail if two or more separate processes (different JVMs) running as the same user try to renew Kerberos credentials at the same time. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1174) FD leak when network unreachable
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13115159#comment-13115159 ] Mahadev konar commented on ZOOKEEPER-1174: -- Ted, Any update on this? Please let me know. I plan to cut a release soon and would like to get this in. thanks FD leak when network unreachable Key: ZOOKEEPER-1174 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1174 Project: ZooKeeper Issue Type: Bug Components: java client Affects Versions: 3.3.3 Reporter: Ted Dunning Assignee: Ted Dunning Priority: Critical Fix For: 3.3.4, 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1174.patch, ZOOKEEPER-1174.patch, ZOOKEEPER-1174.patch, ZOOKEEPER-1174.patch, ZOOKEEPER-1174.patch, zk-fd-leak.tgz In the socket connection logic there are several errors that result in bad behavior. The basic problem is that a socket is registered with a selector unconditionally when there are nuances that should be dealt with. First, the socket may connect immediately. Secondly, the connect may throw an exception. In either of these two cases, I don't think that the socket should be registered. I will attach a test case that demonstrates the problem. I have been unable to create a unit test that exhibits the problem because I would have to mock the low level socket libraries to do so. It would still be good to do so if somebody can figure out a good way. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1174) FD leak when network unreachable
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13115202#comment-13115202 ] Mahadev konar commented on ZOOKEEPER-1174: -- Wed night my time? FD leak when network unreachable Key: ZOOKEEPER-1174 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1174 Project: ZooKeeper Issue Type: Bug Components: java client Affects Versions: 3.3.3 Reporter: Ted Dunning Assignee: Ted Dunning Priority: Critical Fix For: 3.3.4, 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1174.patch, ZOOKEEPER-1174.patch, ZOOKEEPER-1174.patch, ZOOKEEPER-1174.patch, ZOOKEEPER-1174.patch, zk-fd-leak.tgz In the socket connection logic there are several errors that result in bad behavior. The basic problem is that a socket is registered with a selector unconditionally when there are nuances that should be dealt with. First, the socket may connect immediately. Secondly, the connect may throw an exception. In either of these two cases, I don't think that the socket should be registered. I will attach a test case that demonstrates the problem. I have been unable to create a unit test that exhibits the problem because I would have to mock the low level socket libraries to do so. It would still be good to do so if somebody can figure out a good way. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira