[jira] [Updated] (ZOOKEEPER-1390) some expensive debug code not protected by a check for debug
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Reed updated ZOOKEEPER-1390: - Attachment: ZOOKEEPER-1390.patch this fixes the performance issue. i found that the improvement is anywhere from 5% (with 100% reads) to almost 100% (with 100% writes and 3 servers) no tests since this is not a bug fix and does not add functionality. > some expensive debug code not protected by a check for debug > > > Key: ZOOKEEPER-1390 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1390 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Reporter: Benjamin Reed > Fix For: 3.5.0 > > Attachments: ZOOKEEPER-1390.patch > > > there is some expensive debug code in DataTree.processTxn() that formats > transactions for debugging that are very expensive but are only used when > errors happen and when debugging is turned on. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1390) some expensive debug code not protected by a check for debug
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Reed updated ZOOKEEPER-1390: - Component/s: server Fix Version/s: 3.5.0 > some expensive debug code not protected by a check for debug > > > Key: ZOOKEEPER-1390 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1390 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Reporter: Benjamin Reed > Fix For: 3.5.0 > > Attachments: ZOOKEEPER-1390.patch > > > there is some expensive debug code in DataTree.processTxn() that formats > transactions for debugging that are very expensive but are only used when > errors happen and when debugging is turned on. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1387) Wrong epoch file created
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Reed updated ZOOKEEPER-1387: - Attachment: ZOOKEEPER-1387.patch then makes the change proposed by benjamin and includes a test > Wrong epoch file created > > > Key: ZOOKEEPER-1387 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1387 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.2 >Reporter: Benjamin Busjaeger >Priority: Minor > Attachments: ZOOKEEPER-1387.patch > > > It looks like line 443 in QuorumPeer [1] may need to change from: > writeLongToFile(CURRENT_EPOCH_FILENAME, acceptedEpoch); > to > writeLongToFile(ACCEPTED_EPOCH_FILENAME, acceptedEpoch); > I only noticed this reading the code, so I may be wrong and I don't know yet > if/how this affects the runtime. > [1] > https://github.com/apache/zookeeper/blob/trunk/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java#L443 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1367) Data inconsistencies and unexpired ephemeral nodes after cluster restart
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Reed updated ZOOKEEPER-1367: - Attachment: ZOOKEEPER-1367-3.3.patch here is the patch to fix 3.3 with a testcase that will reproduce the bug in 3.3. > Data inconsistencies and unexpired ephemeral nodes after cluster restart > > > Key: ZOOKEEPER-1367 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1367 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.2 > Environment: Debian Squeeze, 64-bit >Reporter: Jeremy Stribling >Assignee: Benjamin Reed >Priority: Blocker > Fix For: 3.3.5, 3.4.3, 3.5.0 > > Attachments: 1367-3.3.patch, ZOOKEEPER-1367-3.3.patch, > ZOOKEEPER-1367-3.4.patch, ZOOKEEPER-1367.patch, ZOOKEEPER-1367.patch, > ZOOKEEPER-1367.tgz > > > In one of our tests, we have a cluster of three ZooKeeper servers. We kill > all three, and then restart just two of them. Sometimes we notice that on > one of the restarted servers, ephemeral nodes from previous sessions do not > get deleted, while on the other server they do. We are effectively running > 3.4.2, though technically we are running 3.4.1 with the patch manually > applied for ZOOKEEPER-1333 and a C client for 3.4.1 with the patches for > ZOOKEEPER-1163. > I noticed that when I connected using zkCli.sh to the first node (90.0.0.221, > zkid 84), I saw only one znode in a particular path: > {quote} > [zk: 90.0.0.221:2888(CONNECTED) 0] ls /election/zkrsm > [nominee11] > [zk: 90.0.0.221:2888(CONNECTED) 1] get /election/zkrsm/nominee11 > 90.0.0.222: > cZxid = 0x40027 > ctime = Thu Jan 19 08:18:24 UTC 2012 > mZxid = 0x40027 > mtime = Thu Jan 19 08:18:24 UTC 2012 > pZxid = 0x40027 > cversion = 0 > dataVersion = 0 > aclVersion = 0 > ephemeralOwner = 0xa234f4f3bc220001 > dataLength = 16 > numChildren = 0 > {quote} > However, when I connect zkCli.sh to the second server (90.0.0.222, zkid 251), > I saw three znodes under that same path: > {quote} > [zk: 90.0.0.222:2888(CONNECTED) 2] ls /election/zkrsm > nominee06 nominee10 nominee11 > [zk: 90.0.0.222:2888(CONNECTED) 2] get /election/zkrsm/nominee11 > 90.0.0.222: > cZxid = 0x40027 > ctime = Thu Jan 19 08:18:24 UTC 2012 > mZxid = 0x40027 > mtime = Thu Jan 19 08:18:24 UTC 2012 > pZxid = 0x40027 > cversion = 0 > dataVersion = 0 > aclVersion = 0 > ephemeralOwner = 0xa234f4f3bc220001 > dataLength = 16 > numChildren = 0 > [zk: 90.0.0.222:2888(CONNECTED) 3] get /election/zkrsm/nominee10 > 90.0.0.221: > cZxid = 0x3014c > ctime = Thu Jan 19 07:53:42 UTC 2012 > mZxid = 0x3014c > mtime = Thu Jan 19 07:53:42 UTC 2012 > pZxid = 0x3014c > cversion = 0 > dataVersion = 0 > aclVersion = 0 > ephemeralOwner = 0xa234f4f3bc22 > dataLength = 16 > numChildren = 0 > [zk: 90.0.0.222:2888(CONNECTED) 4] get /election/zkrsm/nominee06 > 90.0.0.223: > cZxid = 0x20cab > ctime = Thu Jan 19 08:00:30 UTC 2012 > mZxid = 0x20cab > mtime = Thu Jan 19 08:00:30 UTC 2012 > pZxid = 0x20cab > cversion = 0 > dataVersion = 0 > aclVersion = 0 > ephemeralOwner = 0x5434f5074e040002 > dataLength = 16 > numChildren = 0 > {quote} > These never went away for the lifetime of the server, for any clients > connected directly to that server. Note that this cluster is configured to > have all three servers still, the third one being down (90.0.0.223, zkid 162). > I captured the data/snapshot directories for the the two live servers. When > I start single-node servers using each directory, I can briefly see that the > inconsistent data is present in those logs, though the ephemeral nodes seem > to get (correctly) cleaned up pretty soon after I start the server. > I will upload a tar containing the debug logs and data directories from the > failure. I think we can reproduce it regularly if you need more info. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1367) Data inconsistencies and unexpired ephemeral nodes after cluster restart
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Reed updated ZOOKEEPER-1367: - Attachment: ZOOKEEPER-1367-3.4.patch here is the patch for the 3.4 branch > Data inconsistencies and unexpired ephemeral nodes after cluster restart > > > Key: ZOOKEEPER-1367 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1367 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.2 > Environment: Debian Squeeze, 64-bit >Reporter: Jeremy Stribling >Priority: Blocker > Fix For: 3.4.3 > > Attachments: ZOOKEEPER-1367-3.4.patch, ZOOKEEPER-1367.patch, > ZOOKEEPER-1367.patch, ZOOKEEPER-1367.tgz > > > In one of our tests, we have a cluster of three ZooKeeper servers. We kill > all three, and then restart just two of them. Sometimes we notice that on > one of the restarted servers, ephemeral nodes from previous sessions do not > get deleted, while on the other server they do. We are effectively running > 3.4.2, though technically we are running 3.4.1 with the patch manually > applied for ZOOKEEPER-1333 and a C client for 3.4.1 with the patches for > ZOOKEEPER-1163. > I noticed that when I connected using zkCli.sh to the first node (90.0.0.221, > zkid 84), I saw only one znode in a particular path: > {quote} > [zk: 90.0.0.221:2888(CONNECTED) 0] ls /election/zkrsm > [nominee11] > [zk: 90.0.0.221:2888(CONNECTED) 1] get /election/zkrsm/nominee11 > 90.0.0.222: > cZxid = 0x40027 > ctime = Thu Jan 19 08:18:24 UTC 2012 > mZxid = 0x40027 > mtime = Thu Jan 19 08:18:24 UTC 2012 > pZxid = 0x40027 > cversion = 0 > dataVersion = 0 > aclVersion = 0 > ephemeralOwner = 0xa234f4f3bc220001 > dataLength = 16 > numChildren = 0 > {quote} > However, when I connect zkCli.sh to the second server (90.0.0.222, zkid 251), > I saw three znodes under that same path: > {quote} > [zk: 90.0.0.222:2888(CONNECTED) 2] ls /election/zkrsm > nominee06 nominee10 nominee11 > [zk: 90.0.0.222:2888(CONNECTED) 2] get /election/zkrsm/nominee11 > 90.0.0.222: > cZxid = 0x40027 > ctime = Thu Jan 19 08:18:24 UTC 2012 > mZxid = 0x40027 > mtime = Thu Jan 19 08:18:24 UTC 2012 > pZxid = 0x40027 > cversion = 0 > dataVersion = 0 > aclVersion = 0 > ephemeralOwner = 0xa234f4f3bc220001 > dataLength = 16 > numChildren = 0 > [zk: 90.0.0.222:2888(CONNECTED) 3] get /election/zkrsm/nominee10 > 90.0.0.221: > cZxid = 0x3014c > ctime = Thu Jan 19 07:53:42 UTC 2012 > mZxid = 0x3014c > mtime = Thu Jan 19 07:53:42 UTC 2012 > pZxid = 0x3014c > cversion = 0 > dataVersion = 0 > aclVersion = 0 > ephemeralOwner = 0xa234f4f3bc22 > dataLength = 16 > numChildren = 0 > [zk: 90.0.0.222:2888(CONNECTED) 4] get /election/zkrsm/nominee06 > 90.0.0.223: > cZxid = 0x20cab > ctime = Thu Jan 19 08:00:30 UTC 2012 > mZxid = 0x20cab > mtime = Thu Jan 19 08:00:30 UTC 2012 > pZxid = 0x20cab > cversion = 0 > dataVersion = 0 > aclVersion = 0 > ephemeralOwner = 0x5434f5074e040002 > dataLength = 16 > numChildren = 0 > {quote} > These never went away for the lifetime of the server, for any clients > connected directly to that server. Note that this cluster is configured to > have all three servers still, the third one being down (90.0.0.223, zkid 162). > I captured the data/snapshot directories for the the two live servers. When > I start single-node servers using each directory, I can briefly see that the > inconsistent data is present in those logs, though the ephemeral nodes seem > to get (correctly) cleaned up pretty soon after I start the server. > I will upload a tar containing the debug logs and data directories from the > failure. I think we can reproduce it regularly if you need more info. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1367) Data inconsistencies and unexpired ephemeral nodes after cluster restart
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Reed updated ZOOKEEPER-1367: - Attachment: ZOOKEEPER-1367.patch fixed the LearnerTest (it needed to simulate the zk server a bit more robustly) > Data inconsistencies and unexpired ephemeral nodes after cluster restart > > > Key: ZOOKEEPER-1367 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1367 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.2 > Environment: Debian Squeeze, 64-bit >Reporter: Jeremy Stribling >Priority: Blocker > Fix For: 3.4.3 > > Attachments: ZOOKEEPER-1367.patch, ZOOKEEPER-1367.patch, > ZOOKEEPER-1367.tgz > > > In one of our tests, we have a cluster of three ZooKeeper servers. We kill > all three, and then restart just two of them. Sometimes we notice that on > one of the restarted servers, ephemeral nodes from previous sessions do not > get deleted, while on the other server they do. We are effectively running > 3.4.2, though technically we are running 3.4.1 with the patch manually > applied for ZOOKEEPER-1333 and a C client for 3.4.1 with the patches for > ZOOKEEPER-1163. > I noticed that when I connected using zkCli.sh to the first node (90.0.0.221, > zkid 84), I saw only one znode in a particular path: > {quote} > [zk: 90.0.0.221:2888(CONNECTED) 0] ls /election/zkrsm > [nominee11] > [zk: 90.0.0.221:2888(CONNECTED) 1] get /election/zkrsm/nominee11 > 90.0.0.222: > cZxid = 0x40027 > ctime = Thu Jan 19 08:18:24 UTC 2012 > mZxid = 0x40027 > mtime = Thu Jan 19 08:18:24 UTC 2012 > pZxid = 0x40027 > cversion = 0 > dataVersion = 0 > aclVersion = 0 > ephemeralOwner = 0xa234f4f3bc220001 > dataLength = 16 > numChildren = 0 > {quote} > However, when I connect zkCli.sh to the second server (90.0.0.222, zkid 251), > I saw three znodes under that same path: > {quote} > [zk: 90.0.0.222:2888(CONNECTED) 2] ls /election/zkrsm > nominee06 nominee10 nominee11 > [zk: 90.0.0.222:2888(CONNECTED) 2] get /election/zkrsm/nominee11 > 90.0.0.222: > cZxid = 0x40027 > ctime = Thu Jan 19 08:18:24 UTC 2012 > mZxid = 0x40027 > mtime = Thu Jan 19 08:18:24 UTC 2012 > pZxid = 0x40027 > cversion = 0 > dataVersion = 0 > aclVersion = 0 > ephemeralOwner = 0xa234f4f3bc220001 > dataLength = 16 > numChildren = 0 > [zk: 90.0.0.222:2888(CONNECTED) 3] get /election/zkrsm/nominee10 > 90.0.0.221: > cZxid = 0x3014c > ctime = Thu Jan 19 07:53:42 UTC 2012 > mZxid = 0x3014c > mtime = Thu Jan 19 07:53:42 UTC 2012 > pZxid = 0x3014c > cversion = 0 > dataVersion = 0 > aclVersion = 0 > ephemeralOwner = 0xa234f4f3bc22 > dataLength = 16 > numChildren = 0 > [zk: 90.0.0.222:2888(CONNECTED) 4] get /election/zkrsm/nominee06 > 90.0.0.223: > cZxid = 0x20cab > ctime = Thu Jan 19 08:00:30 UTC 2012 > mZxid = 0x20cab > mtime = Thu Jan 19 08:00:30 UTC 2012 > pZxid = 0x20cab > cversion = 0 > dataVersion = 0 > aclVersion = 0 > ephemeralOwner = 0x5434f5074e040002 > dataLength = 16 > numChildren = 0 > {quote} > These never went away for the lifetime of the server, for any clients > connected directly to that server. Note that this cluster is configured to > have all three servers still, the third one being down (90.0.0.223, zkid 162). > I captured the data/snapshot directories for the the two live servers. When > I start single-node servers using each directory, I can briefly see that the > inconsistent data is present in those logs, though the ephemeral nodes seem > to get (correctly) cleaned up pretty soon after I start the server. > I will upload a tar containing the debug logs and data directories from the > failure. I think we can reproduce it regularly if you need more info. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1367) Data inconsistencies and unexpired ephemeral nodes after cluster restart
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Reed updated ZOOKEEPER-1367: - Attachment: ZOOKEEPER-1367.patch ok this should fix the bug and also has a test that reliably reproduces the bug. > Data inconsistencies and unexpired ephemeral nodes after cluster restart > > > Key: ZOOKEEPER-1367 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1367 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.2 > Environment: Debian Squeeze, 64-bit >Reporter: Jeremy Stribling >Priority: Blocker > Fix For: 3.4.3 > > Attachments: ZOOKEEPER-1367.patch, ZOOKEEPER-1367.tgz > > > In one of our tests, we have a cluster of three ZooKeeper servers. We kill > all three, and then restart just two of them. Sometimes we notice that on > one of the restarted servers, ephemeral nodes from previous sessions do not > get deleted, while on the other server they do. We are effectively running > 3.4.2, though technically we are running 3.4.1 with the patch manually > applied for ZOOKEEPER-1333 and a C client for 3.4.1 with the patches for > ZOOKEEPER-1163. > I noticed that when I connected using zkCli.sh to the first node (90.0.0.221, > zkid 84), I saw only one znode in a particular path: > {quote} > [zk: 90.0.0.221:2888(CONNECTED) 0] ls /election/zkrsm > [nominee11] > [zk: 90.0.0.221:2888(CONNECTED) 1] get /election/zkrsm/nominee11 > 90.0.0.222: > cZxid = 0x40027 > ctime = Thu Jan 19 08:18:24 UTC 2012 > mZxid = 0x40027 > mtime = Thu Jan 19 08:18:24 UTC 2012 > pZxid = 0x40027 > cversion = 0 > dataVersion = 0 > aclVersion = 0 > ephemeralOwner = 0xa234f4f3bc220001 > dataLength = 16 > numChildren = 0 > {quote} > However, when I connect zkCli.sh to the second server (90.0.0.222, zkid 251), > I saw three znodes under that same path: > {quote} > [zk: 90.0.0.222:2888(CONNECTED) 2] ls /election/zkrsm > nominee06 nominee10 nominee11 > [zk: 90.0.0.222:2888(CONNECTED) 2] get /election/zkrsm/nominee11 > 90.0.0.222: > cZxid = 0x40027 > ctime = Thu Jan 19 08:18:24 UTC 2012 > mZxid = 0x40027 > mtime = Thu Jan 19 08:18:24 UTC 2012 > pZxid = 0x40027 > cversion = 0 > dataVersion = 0 > aclVersion = 0 > ephemeralOwner = 0xa234f4f3bc220001 > dataLength = 16 > numChildren = 0 > [zk: 90.0.0.222:2888(CONNECTED) 3] get /election/zkrsm/nominee10 > 90.0.0.221: > cZxid = 0x3014c > ctime = Thu Jan 19 07:53:42 UTC 2012 > mZxid = 0x3014c > mtime = Thu Jan 19 07:53:42 UTC 2012 > pZxid = 0x3014c > cversion = 0 > dataVersion = 0 > aclVersion = 0 > ephemeralOwner = 0xa234f4f3bc22 > dataLength = 16 > numChildren = 0 > [zk: 90.0.0.222:2888(CONNECTED) 4] get /election/zkrsm/nominee06 > 90.0.0.223: > cZxid = 0x20cab > ctime = Thu Jan 19 08:00:30 UTC 2012 > mZxid = 0x20cab > mtime = Thu Jan 19 08:00:30 UTC 2012 > pZxid = 0x20cab > cversion = 0 > dataVersion = 0 > aclVersion = 0 > ephemeralOwner = 0x5434f5074e040002 > dataLength = 16 > numChildren = 0 > {quote} > These never went away for the lifetime of the server, for any clients > connected directly to that server. Note that this cluster is configured to > have all three servers still, the third one being down (90.0.0.223, zkid 162). > I captured the data/snapshot directories for the the two live servers. When > I start single-node servers using each directory, I can briefly see that the > inconsistent data is present in those logs, though the ephemeral nodes seem > to get (correctly) cleaned up pretty soon after I start the server. > I will upload a tar containing the debug logs and data directories from the > failure. I think we can reproduce it regularly if you need more info. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1355) Add zk.updateServerList(newServerList)
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Reed updated ZOOKEEPER-1355: - Hadoop Flags: Reviewed > Add zk.updateServerList(newServerList) > --- > > Key: ZOOKEEPER-1355 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1355 > Project: ZooKeeper > Issue Type: New Feature > Components: java client >Reporter: Alexander Shraer >Assignee: Alexander Shraer > Fix For: 3.5.0 > > Attachments: ZOOKEEPER-1355-ver2.patch, ZOOKEEPER-1355-ver4.patch, > ZOOKEEPER-1355-ver5.patch, ZOOKEEPER=1355-ver3.patch, > ZOOOKEEPER-1355-test.patch, ZOOOKEEPER-1355-ver1.patch, > ZOOOKEEPER-1355.patch, loadbalancing-more-details.pdf, loadbalancing.pdf > > > When the set of servers changes, we would like to update the server list > stored by clients without restarting the clients. > Moreover, assuming that the number of clients per server is the same (in > expectation) in the old configuration (as guaranteed by the current list > shuffling for example), we would like to re-balance client connections across > the new set of servers in a way that a) the number of clients per server is > the same for all servers (in expectation) and b) there is no > excessive/unnecessary client migration. > It is simple to achieve (a) without (b) - just re-shuffle the new list of > servers at every client. But this would create unnecessary migration, which > we'd like to avoid. > We propose a simple probabilistic migration scheme that achieves (a) and (b) > - each client locally decides whether and where to migrate when the list of > servers changes. The attached document describes the scheme and shows an > evaluation of it in Zookeeper. We also implemented re-balancing through a > consistent-hashing scheme and show a comparison. We derived the probabilistic > migration rules from a simple formula that we can also provide, if someone's > interested in the proof. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1319) Missing data after restarting+expanding a cluster
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Reed updated ZOOKEEPER-1319: - Attachment: ZOOKEEPER-1319_trunk2.patch this should fix everything. i'd like to add a couple more unit tests, but the functional fixes are in. > Missing data after restarting+expanding a cluster > - > > Key: ZOOKEEPER-1319 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1319 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.4.0 > Environment: Linux (Debian Squeeze) >Reporter: Jeremy Stribling >Assignee: Patrick Hunt >Priority: Blocker > Labels: cluster, data > Fix For: 3.5.0, 3.4.1 > > Attachments: ZOOKEEPER-1319.patch, ZOOKEEPER-1319.patch, > ZOOKEEPER-1319_trunk.patch, ZOOKEEPER-1319_trunk2.patch, logs.tgz > > > I've been trying to update to ZK 3.4.0 and have had some issues where some > data become inaccessible after adding a node to a cluster. My use case is a > bit strange (as explained before on this list) in that I try to grow the > cluster dynamically by having an external program automatically restart > Zookeeper servers in a controlled way whenever the list of participating ZK > servers needs to change. This used to work just fine in 3.3.3 (and before), > so this represents a regression. > The scenario I see is this: > 1) Start up a 1-server ZK cluster (the server has ZK ID 0). > 2) A client connects to the server, and makes a bunch of znodes, in > particular a znode called "/membership". > 3) Shut down the cluster. > 4) Bring up a 2-server ZK cluster, including the original server 0 with its > existing data, and a new server with ZK ID 1. > 5) Node 0 has the highest zxid and is elected leader. > 6) A client connecting to server 1 tries to "get /membership" and gets back a > -101 error code (no such znode). > 7) The same client then tries to "create /membership" and gets back a -110 > error code (znode already exists). > 8) Clients connecting to server 0 can successfully "get /membership". > I will attach a tarball with debug logs for both servers, annotating where > steps #1 and #4 happen. You can see that the election involves a proposal > for zxid 110 from server 0, but immediately following the election server 1 > has these lines: > 2011-12-05 17:18:48,308 9299 [QuorumPeer[myid=1]/127.0.0.1:2901] WARN > org.apache.zookeeper.server.quorum.Learner - Got zxid 0x10001 expected > 0x1 > 2011-12-05 17:18:48,313 9304 [SyncThread:1] INFO > org.apache.zookeeper.server.persistence.FileTxnLog - Creating new log file: > log.10001 > Perhaps that's not relevant, but it struck me as odd. At the end of server > 1's log you can see a repeated cycle of getData->create->getData as the > client tries to make sense of the inconsistent responses. > The other piece of information is that if I try to use the on-disk > directories for either of the servers to start a new one-node ZK cluster, all > the data are accessible. > I haven't tried writing a program outside of my application to reproduce > this, but I can do it very easily with some of my app's tests if anyone needs > more information. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (BOOKKEEPER-31) Need a project logo
[ https://issues.apache.org/jira/browse/BOOKKEEPER-31?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Reed updated BOOKKEEPER-31: Attachment: bookeper_wht.png bookeper_blk.png here are hi-res pngs one for white background and the other for black. if everyone is good with these, i'll upload to the repo and update the website. > Need a project logo > --- > > Key: BOOKKEEPER-31 > URL: https://issues.apache.org/jira/browse/BOOKKEEPER-31 > Project: Bookkeeper > Issue Type: Improvement >Reporter: Benjamin Reed >Assignee: Benjamin Reed > Attachments: bk_1.jpg, bk_2.jpg, bk_3.jpg, bk_4.jpg, > bookeper_black_sm.png, bookeper_blk.png, bookeper_white_sm.png, > bookeper_wht.png > > > we need a logo for the project something that looks good in the big and the > small and is easily recognizable. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Reed updated ZOOKEEPER-1264: - Attachment: ZOOKEEPER-1264.patch this patch merges camille's test in as well. it also adds a couple of extra asserts to cover ZOOKEEPER-1282. finally, it also moves around a couple of lines to fix ZOOKEEPER-1282. (i merged in 1282 because the fix and tests were simple modifications of this patch and we need to get this out asap.) > FollowerResyncConcurrencyTest failing intermittently > > > Key: ZOOKEEPER-1264 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264 > Project: ZooKeeper > Issue Type: Bug > Components: tests >Affects Versions: 3.3.3, 3.4.0, 3.5.0 >Reporter: Patrick Hunt >Assignee: Camille Fournier >Priority: Blocker > Fix For: 3.3.4, 3.4.0, 3.5.0 > > Attachments: ZOOKEEPER-1264-merge.patch, ZOOKEEPER-1264.patch, > ZOOKEEPER-1264.patch, ZOOKEEPER-1264.patch, ZOOKEEPER-1264.patch, > ZOOKEEPER-1264_branch33.patch, ZOOKEEPER-1264_branch34.patch, > ZOOKEEPER-1264unittest.patch, ZOOKEEPER-1264unittest.patch, > followerresyncfailure_log.txt.gz, logs.zip, tmp.zip > > > The FollowerResyncConcurrencyTest test is failing intermittently. > saw the following on 3.4: > {noformat} > junit.framework.AssertionFailedError: Should have same number of > ephemerals in both followers expected:<11741> but was:<14001> >at > org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400) >at > org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196) >at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Reed updated ZOOKEEPER-1264: - Attachment: ZOOKEEPER-1264.patch i think i got it. camille can you try it with your test to see if it's fixed there as well? (the tests always passed on my machine.) > FollowerResyncConcurrencyTest failing intermittently > > > Key: ZOOKEEPER-1264 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264 > Project: ZooKeeper > Issue Type: Bug > Components: tests >Affects Versions: 3.3.3, 3.4.0, 3.5.0 >Reporter: Patrick Hunt >Assignee: Camille Fournier >Priority: Blocker > Fix For: 3.3.4, 3.4.0, 3.5.0 > > Attachments: ZOOKEEPER-1264.patch, ZOOKEEPER-1264.patch, > ZOOKEEPER-1264.patch, ZOOKEEPER-1264_branch33.patch, > ZOOKEEPER-1264_branch34.patch, ZOOKEEPER-1264unittest.patch, > ZOOKEEPER-1264unittest.patch, followerresyncfailure_log.txt.gz, logs.zip, > tmp.zip > > > The FollowerResyncConcurrencyTest test is failing intermittently. > saw the following on 3.4: > {noformat} > junit.framework.AssertionFailedError: Should have same number of > ephemerals in both followers expected:<11741> but was:<14001> >at > org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400) >at > org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196) >at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Reed updated ZOOKEEPER-1264: - Attachment: ZOOKEEPER-1264unittest.patch here is the unit test. doing a snapshot at UPDATE will make this test pass, but i'm afraid it is masking a deeper problem. the question is, why does it fix the problem? > FollowerResyncConcurrencyTest failing intermittently > > > Key: ZOOKEEPER-1264 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264 > Project: ZooKeeper > Issue Type: Bug > Components: tests >Affects Versions: 3.3.3, 3.4.0, 3.5.0 >Reporter: Patrick Hunt >Assignee: Camille Fournier >Priority: Blocker > Fix For: 3.3.4, 3.4.0, 3.5.0 > > Attachments: ZOOKEEPER-1264.patch, ZOOKEEPER-1264.patch, > ZOOKEEPER-1264_branch33.patch, ZOOKEEPER-1264_branch34.patch, > ZOOKEEPER-1264unittest.patch, ZOOKEEPER-1264unittest.patch, > followerresyncfailure_log.txt.gz, logs.zip, tmp.zip > > > The FollowerResyncConcurrencyTest test is failing intermittently. > saw the following on 3.4: > {noformat} > junit.framework.AssertionFailedError: Should have same number of > ephemerals in both followers expected:<11741> but was:<14001> >at > org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400) >at > org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196) >at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Reed updated ZOOKEEPER-1264: - Attachment: ZOOKEEPER-1264unittest.patch i've created a unit test to reproduce the problem since we can test it more directly and deterministicly, but i can't seem to make it happen. i'm attaching my unit test patch just in case you are camille can see what i'm missing. if i understand it, the problem is that we are losing proposals that are received between the NEWLEADER and the UPDATE, but a follower doesn't send out any acks during that time, so it's okay to lose them. am i misunderstanding the problem? > FollowerResyncConcurrencyTest failing intermittently > > > Key: ZOOKEEPER-1264 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264 > Project: ZooKeeper > Issue Type: Bug > Components: tests >Affects Versions: 3.3.3, 3.4.0, 3.5.0 >Reporter: Patrick Hunt >Assignee: Camille Fournier >Priority: Blocker > Fix For: 3.3.4, 3.4.0, 3.5.0 > > Attachments: ZOOKEEPER-1264.patch, ZOOKEEPER-1264.patch, > ZOOKEEPER-1264_branch33.patch, ZOOKEEPER-1264_branch34.patch, > ZOOKEEPER-1264unittest.patch, followerresyncfailure_log.txt.gz, logs.zip, > tmp.zip > > > The FollowerResyncConcurrencyTest test is failing intermittently. > saw the following on 3.4: > {noformat} > junit.framework.AssertionFailedError: Should have same number of > ephemerals in both followers expected:<11741> but was:<14001> >at > org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400) >at > org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196) >at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (BOOKKEEPER-88) derby doesn't like - in the topic names
[ https://issues.apache.org/jira/browse/BOOKKEEPER-88?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Reed updated BOOKKEEPER-88: Attachment: BOOKKEEPER-88.patch > derby doesn't like - in the topic names > --- > > Key: BOOKKEEPER-88 > URL: https://issues.apache.org/jira/browse/BOOKKEEPER-88 > Project: Bookkeeper > Issue Type: Bug >Reporter: Benjamin Reed >Priority: Minor > Attachments: BOOKKEEPER-88.patch > > > it's just a benchmark, but it is convenient to be able to use derby as a > backend for the hedwig benchmark. derby does not support - in topic names. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira