[jira] [Updated] (ZOOKEEPER-1390) some expensive debug code not protected by a check for debug

2012-02-09 Thread Benjamin Reed (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Reed updated ZOOKEEPER-1390:
-

Attachment: ZOOKEEPER-1390.patch

this fixes the performance issue. i found that the improvement is anywhere from 
5% (with 100% reads) to almost 100% (with 100% writes and 3 servers)

no tests since this is not a bug fix and does not add functionality.

> some expensive debug code not protected by a check for debug
> 
>
> Key: ZOOKEEPER-1390
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1390
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Reporter: Benjamin Reed
> Fix For: 3.5.0
>
> Attachments: ZOOKEEPER-1390.patch
>
>
> there is some expensive debug code in DataTree.processTxn() that formats 
> transactions for debugging that are very expensive but are only used when 
> errors happen and when debugging is turned on.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (ZOOKEEPER-1390) some expensive debug code not protected by a check for debug

2012-02-09 Thread Benjamin Reed (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Reed updated ZOOKEEPER-1390:
-

  Component/s: server
Fix Version/s: 3.5.0

> some expensive debug code not protected by a check for debug
> 
>
> Key: ZOOKEEPER-1390
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1390
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Reporter: Benjamin Reed
> Fix For: 3.5.0
>
> Attachments: ZOOKEEPER-1390.patch
>
>
> there is some expensive debug code in DataTree.processTxn() that formats 
> transactions for debugging that are very expensive but are only used when 
> errors happen and when debugging is turned on.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (ZOOKEEPER-1387) Wrong epoch file created

2012-02-07 Thread Benjamin Reed (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Reed updated ZOOKEEPER-1387:
-

Attachment: ZOOKEEPER-1387.patch

then makes the change proposed by benjamin and includes a test

> Wrong epoch file created
> 
>
> Key: ZOOKEEPER-1387
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1387
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.2
>Reporter: Benjamin Busjaeger
>Priority: Minor
> Attachments: ZOOKEEPER-1387.patch
>
>
> It looks like line 443 in QuorumPeer [1] may need to change from:
> writeLongToFile(CURRENT_EPOCH_FILENAME, acceptedEpoch);
> to
> writeLongToFile(ACCEPTED_EPOCH_FILENAME, acceptedEpoch);
> I only noticed this reading the code, so I may be wrong and I don't know yet 
> if/how this affects the runtime.
> [1] 
> https://github.com/apache/zookeeper/blob/trunk/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java#L443

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (ZOOKEEPER-1367) Data inconsistencies and unexpired ephemeral nodes after cluster restart

2012-01-31 Thread Benjamin Reed (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Reed updated ZOOKEEPER-1367:
-

Attachment: ZOOKEEPER-1367-3.3.patch

here is the patch to fix 3.3 with a testcase that will reproduce the bug in 3.3.

> Data inconsistencies and unexpired ephemeral nodes after cluster restart
> 
>
> Key: ZOOKEEPER-1367
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1367
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.4.2
> Environment: Debian Squeeze, 64-bit
>Reporter: Jeremy Stribling
>Assignee: Benjamin Reed
>Priority: Blocker
> Fix For: 3.3.5, 3.4.3, 3.5.0
>
> Attachments: 1367-3.3.patch, ZOOKEEPER-1367-3.3.patch, 
> ZOOKEEPER-1367-3.4.patch, ZOOKEEPER-1367.patch, ZOOKEEPER-1367.patch, 
> ZOOKEEPER-1367.tgz
>
>
> In one of our tests, we have a cluster of three ZooKeeper servers.  We kill 
> all three, and then restart just two of them.  Sometimes we notice that on 
> one of the restarted servers, ephemeral nodes from previous sessions do not 
> get deleted, while on the other server they do.  We are effectively running 
> 3.4.2, though technically we are running 3.4.1 with the patch manually 
> applied for ZOOKEEPER-1333 and a C client for 3.4.1 with the patches for 
> ZOOKEEPER-1163.
> I noticed that when I connected using zkCli.sh to the first node (90.0.0.221, 
> zkid 84), I saw only one znode in a particular path:
> {quote}
> [zk: 90.0.0.221:2888(CONNECTED) 0] ls /election/zkrsm
> [nominee11]
> [zk: 90.0.0.221:2888(CONNECTED) 1] get /election/zkrsm/nominee11
> 90.0.0.222: 
> cZxid = 0x40027
> ctime = Thu Jan 19 08:18:24 UTC 2012
> mZxid = 0x40027
> mtime = Thu Jan 19 08:18:24 UTC 2012
> pZxid = 0x40027
> cversion = 0
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0xa234f4f3bc220001
> dataLength = 16
> numChildren = 0
> {quote}
> However, when I connect zkCli.sh to the second server (90.0.0.222, zkid 251), 
> I saw three znodes under that same path:
> {quote}
> [zk: 90.0.0.222:2888(CONNECTED) 2] ls /election/zkrsm
> nominee06   nominee10   nominee11
> [zk: 90.0.0.222:2888(CONNECTED) 2] get /election/zkrsm/nominee11
> 90.0.0.222: 
> cZxid = 0x40027
> ctime = Thu Jan 19 08:18:24 UTC 2012
> mZxid = 0x40027
> mtime = Thu Jan 19 08:18:24 UTC 2012
> pZxid = 0x40027
> cversion = 0
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0xa234f4f3bc220001
> dataLength = 16
> numChildren = 0
> [zk: 90.0.0.222:2888(CONNECTED) 3] get /election/zkrsm/nominee10
> 90.0.0.221: 
> cZxid = 0x3014c
> ctime = Thu Jan 19 07:53:42 UTC 2012
> mZxid = 0x3014c
> mtime = Thu Jan 19 07:53:42 UTC 2012
> pZxid = 0x3014c
> cversion = 0
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0xa234f4f3bc22
> dataLength = 16
> numChildren = 0
> [zk: 90.0.0.222:2888(CONNECTED) 4] get /election/zkrsm/nominee06
> 90.0.0.223: 
> cZxid = 0x20cab
> ctime = Thu Jan 19 08:00:30 UTC 2012
> mZxid = 0x20cab
> mtime = Thu Jan 19 08:00:30 UTC 2012
> pZxid = 0x20cab
> cversion = 0
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0x5434f5074e040002
> dataLength = 16
> numChildren = 0
> {quote}
> These never went away for the lifetime of the server, for any clients 
> connected directly to that server.  Note that this cluster is configured to 
> have all three servers still, the third one being down (90.0.0.223, zkid 162).
> I captured the data/snapshot directories for the the two live servers.  When 
> I start single-node servers using each directory, I can briefly see that the 
> inconsistent data is present in those logs, though the ephemeral nodes seem 
> to get (correctly) cleaned up pretty soon after I start the server.
> I will upload a tar containing the debug logs and data directories from the 
> failure.  I think we can reproduce it regularly if you need more info.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (ZOOKEEPER-1367) Data inconsistencies and unexpired ephemeral nodes after cluster restart

2012-01-27 Thread Benjamin Reed (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Reed updated ZOOKEEPER-1367:
-

Attachment: ZOOKEEPER-1367-3.4.patch

here is the patch for the 3.4 branch

> Data inconsistencies and unexpired ephemeral nodes after cluster restart
> 
>
> Key: ZOOKEEPER-1367
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1367
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.4.2
> Environment: Debian Squeeze, 64-bit
>Reporter: Jeremy Stribling
>Priority: Blocker
> Fix For: 3.4.3
>
> Attachments: ZOOKEEPER-1367-3.4.patch, ZOOKEEPER-1367.patch, 
> ZOOKEEPER-1367.patch, ZOOKEEPER-1367.tgz
>
>
> In one of our tests, we have a cluster of three ZooKeeper servers.  We kill 
> all three, and then restart just two of them.  Sometimes we notice that on 
> one of the restarted servers, ephemeral nodes from previous sessions do not 
> get deleted, while on the other server they do.  We are effectively running 
> 3.4.2, though technically we are running 3.4.1 with the patch manually 
> applied for ZOOKEEPER-1333 and a C client for 3.4.1 with the patches for 
> ZOOKEEPER-1163.
> I noticed that when I connected using zkCli.sh to the first node (90.0.0.221, 
> zkid 84), I saw only one znode in a particular path:
> {quote}
> [zk: 90.0.0.221:2888(CONNECTED) 0] ls /election/zkrsm
> [nominee11]
> [zk: 90.0.0.221:2888(CONNECTED) 1] get /election/zkrsm/nominee11
> 90.0.0.222: 
> cZxid = 0x40027
> ctime = Thu Jan 19 08:18:24 UTC 2012
> mZxid = 0x40027
> mtime = Thu Jan 19 08:18:24 UTC 2012
> pZxid = 0x40027
> cversion = 0
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0xa234f4f3bc220001
> dataLength = 16
> numChildren = 0
> {quote}
> However, when I connect zkCli.sh to the second server (90.0.0.222, zkid 251), 
> I saw three znodes under that same path:
> {quote}
> [zk: 90.0.0.222:2888(CONNECTED) 2] ls /election/zkrsm
> nominee06   nominee10   nominee11
> [zk: 90.0.0.222:2888(CONNECTED) 2] get /election/zkrsm/nominee11
> 90.0.0.222: 
> cZxid = 0x40027
> ctime = Thu Jan 19 08:18:24 UTC 2012
> mZxid = 0x40027
> mtime = Thu Jan 19 08:18:24 UTC 2012
> pZxid = 0x40027
> cversion = 0
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0xa234f4f3bc220001
> dataLength = 16
> numChildren = 0
> [zk: 90.0.0.222:2888(CONNECTED) 3] get /election/zkrsm/nominee10
> 90.0.0.221: 
> cZxid = 0x3014c
> ctime = Thu Jan 19 07:53:42 UTC 2012
> mZxid = 0x3014c
> mtime = Thu Jan 19 07:53:42 UTC 2012
> pZxid = 0x3014c
> cversion = 0
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0xa234f4f3bc22
> dataLength = 16
> numChildren = 0
> [zk: 90.0.0.222:2888(CONNECTED) 4] get /election/zkrsm/nominee06
> 90.0.0.223: 
> cZxid = 0x20cab
> ctime = Thu Jan 19 08:00:30 UTC 2012
> mZxid = 0x20cab
> mtime = Thu Jan 19 08:00:30 UTC 2012
> pZxid = 0x20cab
> cversion = 0
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0x5434f5074e040002
> dataLength = 16
> numChildren = 0
> {quote}
> These never went away for the lifetime of the server, for any clients 
> connected directly to that server.  Note that this cluster is configured to 
> have all three servers still, the third one being down (90.0.0.223, zkid 162).
> I captured the data/snapshot directories for the the two live servers.  When 
> I start single-node servers using each directory, I can briefly see that the 
> inconsistent data is present in those logs, though the ephemeral nodes seem 
> to get (correctly) cleaned up pretty soon after I start the server.
> I will upload a tar containing the debug logs and data directories from the 
> failure.  I think we can reproduce it regularly if you need more info.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (ZOOKEEPER-1367) Data inconsistencies and unexpired ephemeral nodes after cluster restart

2012-01-27 Thread Benjamin Reed (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Reed updated ZOOKEEPER-1367:
-

Attachment: ZOOKEEPER-1367.patch

fixed the LearnerTest (it needed to simulate the zk server a bit more robustly)

> Data inconsistencies and unexpired ephemeral nodes after cluster restart
> 
>
> Key: ZOOKEEPER-1367
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1367
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.4.2
> Environment: Debian Squeeze, 64-bit
>Reporter: Jeremy Stribling
>Priority: Blocker
> Fix For: 3.4.3
>
> Attachments: ZOOKEEPER-1367.patch, ZOOKEEPER-1367.patch, 
> ZOOKEEPER-1367.tgz
>
>
> In one of our tests, we have a cluster of three ZooKeeper servers.  We kill 
> all three, and then restart just two of them.  Sometimes we notice that on 
> one of the restarted servers, ephemeral nodes from previous sessions do not 
> get deleted, while on the other server they do.  We are effectively running 
> 3.4.2, though technically we are running 3.4.1 with the patch manually 
> applied for ZOOKEEPER-1333 and a C client for 3.4.1 with the patches for 
> ZOOKEEPER-1163.
> I noticed that when I connected using zkCli.sh to the first node (90.0.0.221, 
> zkid 84), I saw only one znode in a particular path:
> {quote}
> [zk: 90.0.0.221:2888(CONNECTED) 0] ls /election/zkrsm
> [nominee11]
> [zk: 90.0.0.221:2888(CONNECTED) 1] get /election/zkrsm/nominee11
> 90.0.0.222: 
> cZxid = 0x40027
> ctime = Thu Jan 19 08:18:24 UTC 2012
> mZxid = 0x40027
> mtime = Thu Jan 19 08:18:24 UTC 2012
> pZxid = 0x40027
> cversion = 0
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0xa234f4f3bc220001
> dataLength = 16
> numChildren = 0
> {quote}
> However, when I connect zkCli.sh to the second server (90.0.0.222, zkid 251), 
> I saw three znodes under that same path:
> {quote}
> [zk: 90.0.0.222:2888(CONNECTED) 2] ls /election/zkrsm
> nominee06   nominee10   nominee11
> [zk: 90.0.0.222:2888(CONNECTED) 2] get /election/zkrsm/nominee11
> 90.0.0.222: 
> cZxid = 0x40027
> ctime = Thu Jan 19 08:18:24 UTC 2012
> mZxid = 0x40027
> mtime = Thu Jan 19 08:18:24 UTC 2012
> pZxid = 0x40027
> cversion = 0
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0xa234f4f3bc220001
> dataLength = 16
> numChildren = 0
> [zk: 90.0.0.222:2888(CONNECTED) 3] get /election/zkrsm/nominee10
> 90.0.0.221: 
> cZxid = 0x3014c
> ctime = Thu Jan 19 07:53:42 UTC 2012
> mZxid = 0x3014c
> mtime = Thu Jan 19 07:53:42 UTC 2012
> pZxid = 0x3014c
> cversion = 0
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0xa234f4f3bc22
> dataLength = 16
> numChildren = 0
> [zk: 90.0.0.222:2888(CONNECTED) 4] get /election/zkrsm/nominee06
> 90.0.0.223: 
> cZxid = 0x20cab
> ctime = Thu Jan 19 08:00:30 UTC 2012
> mZxid = 0x20cab
> mtime = Thu Jan 19 08:00:30 UTC 2012
> pZxid = 0x20cab
> cversion = 0
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0x5434f5074e040002
> dataLength = 16
> numChildren = 0
> {quote}
> These never went away for the lifetime of the server, for any clients 
> connected directly to that server.  Note that this cluster is configured to 
> have all three servers still, the third one being down (90.0.0.223, zkid 162).
> I captured the data/snapshot directories for the the two live servers.  When 
> I start single-node servers using each directory, I can briefly see that the 
> inconsistent data is present in those logs, though the ephemeral nodes seem 
> to get (correctly) cleaned up pretty soon after I start the server.
> I will upload a tar containing the debug logs and data directories from the 
> failure.  I think we can reproduce it regularly if you need more info.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (ZOOKEEPER-1367) Data inconsistencies and unexpired ephemeral nodes after cluster restart

2012-01-27 Thread Benjamin Reed (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Reed updated ZOOKEEPER-1367:
-

Attachment: ZOOKEEPER-1367.patch

ok this should fix the bug and also has a test that reliably reproduces the bug.

> Data inconsistencies and unexpired ephemeral nodes after cluster restart
> 
>
> Key: ZOOKEEPER-1367
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1367
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.4.2
> Environment: Debian Squeeze, 64-bit
>Reporter: Jeremy Stribling
>Priority: Blocker
> Fix For: 3.4.3
>
> Attachments: ZOOKEEPER-1367.patch, ZOOKEEPER-1367.tgz
>
>
> In one of our tests, we have a cluster of three ZooKeeper servers.  We kill 
> all three, and then restart just two of them.  Sometimes we notice that on 
> one of the restarted servers, ephemeral nodes from previous sessions do not 
> get deleted, while on the other server they do.  We are effectively running 
> 3.4.2, though technically we are running 3.4.1 with the patch manually 
> applied for ZOOKEEPER-1333 and a C client for 3.4.1 with the patches for 
> ZOOKEEPER-1163.
> I noticed that when I connected using zkCli.sh to the first node (90.0.0.221, 
> zkid 84), I saw only one znode in a particular path:
> {quote}
> [zk: 90.0.0.221:2888(CONNECTED) 0] ls /election/zkrsm
> [nominee11]
> [zk: 90.0.0.221:2888(CONNECTED) 1] get /election/zkrsm/nominee11
> 90.0.0.222: 
> cZxid = 0x40027
> ctime = Thu Jan 19 08:18:24 UTC 2012
> mZxid = 0x40027
> mtime = Thu Jan 19 08:18:24 UTC 2012
> pZxid = 0x40027
> cversion = 0
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0xa234f4f3bc220001
> dataLength = 16
> numChildren = 0
> {quote}
> However, when I connect zkCli.sh to the second server (90.0.0.222, zkid 251), 
> I saw three znodes under that same path:
> {quote}
> [zk: 90.0.0.222:2888(CONNECTED) 2] ls /election/zkrsm
> nominee06   nominee10   nominee11
> [zk: 90.0.0.222:2888(CONNECTED) 2] get /election/zkrsm/nominee11
> 90.0.0.222: 
> cZxid = 0x40027
> ctime = Thu Jan 19 08:18:24 UTC 2012
> mZxid = 0x40027
> mtime = Thu Jan 19 08:18:24 UTC 2012
> pZxid = 0x40027
> cversion = 0
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0xa234f4f3bc220001
> dataLength = 16
> numChildren = 0
> [zk: 90.0.0.222:2888(CONNECTED) 3] get /election/zkrsm/nominee10
> 90.0.0.221: 
> cZxid = 0x3014c
> ctime = Thu Jan 19 07:53:42 UTC 2012
> mZxid = 0x3014c
> mtime = Thu Jan 19 07:53:42 UTC 2012
> pZxid = 0x3014c
> cversion = 0
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0xa234f4f3bc22
> dataLength = 16
> numChildren = 0
> [zk: 90.0.0.222:2888(CONNECTED) 4] get /election/zkrsm/nominee06
> 90.0.0.223: 
> cZxid = 0x20cab
> ctime = Thu Jan 19 08:00:30 UTC 2012
> mZxid = 0x20cab
> mtime = Thu Jan 19 08:00:30 UTC 2012
> pZxid = 0x20cab
> cversion = 0
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0x5434f5074e040002
> dataLength = 16
> numChildren = 0
> {quote}
> These never went away for the lifetime of the server, for any clients 
> connected directly to that server.  Note that this cluster is configured to 
> have all three servers still, the third one being down (90.0.0.223, zkid 162).
> I captured the data/snapshot directories for the the two live servers.  When 
> I start single-node servers using each directory, I can briefly see that the 
> inconsistent data is present in those logs, though the ephemeral nodes seem 
> to get (correctly) cleaned up pretty soon after I start the server.
> I will upload a tar containing the debug logs and data directories from the 
> failure.  I think we can reproduce it regularly if you need more info.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (ZOOKEEPER-1355) Add zk.updateServerList(newServerList)

2012-01-25 Thread Benjamin Reed (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Reed updated ZOOKEEPER-1355:
-

Hadoop Flags: Reviewed

> Add zk.updateServerList(newServerList) 
> ---
>
> Key: ZOOKEEPER-1355
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1355
> Project: ZooKeeper
>  Issue Type: New Feature
>  Components: java client
>Reporter: Alexander Shraer
>Assignee: Alexander Shraer
> Fix For: 3.5.0
>
> Attachments: ZOOKEEPER-1355-ver2.patch, ZOOKEEPER-1355-ver4.patch, 
> ZOOKEEPER-1355-ver5.patch, ZOOKEEPER=1355-ver3.patch, 
> ZOOOKEEPER-1355-test.patch, ZOOOKEEPER-1355-ver1.patch, 
> ZOOOKEEPER-1355.patch, loadbalancing-more-details.pdf, loadbalancing.pdf
>
>
> When the set of servers changes, we would like to update the server list 
> stored by clients without restarting the clients.
> Moreover, assuming that the number of clients per server is the same (in 
> expectation) in the old configuration (as guaranteed by the current list 
> shuffling for example), we would like to re-balance client connections across 
> the new set of servers in a way that a) the number of clients per server is 
> the same for all servers (in expectation) and b) there is no 
> excessive/unnecessary client migration.
> It is simple to achieve (a) without (b) - just re-shuffle the new list of 
> servers at every client. But this would create unnecessary migration, which 
> we'd like to avoid.
> We propose a simple probabilistic migration scheme that achieves (a) and (b) 
> - each client locally decides whether and where to migrate when the list of 
> servers changes. The attached document describes the scheme and shows an 
> evaluation of it in Zookeeper. We also implemented re-balancing through a 
> consistent-hashing scheme and show a comparison. We derived the probabilistic 
> migration rules from a simple formula that we can also provide, if someone's 
> interested in the proof.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (ZOOKEEPER-1319) Missing data after restarting+expanding a cluster

2011-12-07 Thread Benjamin Reed (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Reed updated ZOOKEEPER-1319:
-

Attachment: ZOOKEEPER-1319_trunk2.patch

this should fix everything. i'd like to add a couple more unit tests, but the 
functional fixes are in.

> Missing data after restarting+expanding a cluster
> -
>
> Key: ZOOKEEPER-1319
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1319
> Project: ZooKeeper
>  Issue Type: Bug
>Affects Versions: 3.4.0
> Environment: Linux (Debian Squeeze)
>Reporter: Jeremy Stribling
>Assignee: Patrick Hunt
>Priority: Blocker
>  Labels: cluster, data
> Fix For: 3.5.0, 3.4.1
>
> Attachments: ZOOKEEPER-1319.patch, ZOOKEEPER-1319.patch, 
> ZOOKEEPER-1319_trunk.patch, ZOOKEEPER-1319_trunk2.patch, logs.tgz
>
>
> I've been trying to update to ZK 3.4.0 and have had some issues where some 
> data become inaccessible after adding a node to a cluster.  My use case is a 
> bit strange (as explained before on this list) in that I try to grow the 
> cluster dynamically by having an external program automatically restart 
> Zookeeper servers in a controlled way whenever the list of participating ZK 
> servers needs to change.  This used to work just fine in 3.3.3 (and before), 
> so this represents a regression.
> The scenario I see is this:
> 1) Start up a 1-server ZK cluster (the server has ZK ID 0).
> 2) A client connects to the server, and makes a bunch of znodes, in 
> particular a znode called "/membership".
> 3) Shut down the cluster.
> 4) Bring up a 2-server ZK cluster, including the original server 0 with its 
> existing data, and a new server with ZK ID 1.
> 5) Node 0 has the highest zxid and is elected leader.
> 6) A client connecting to server 1 tries to "get /membership" and gets back a 
> -101 error code (no such znode).
> 7) The same client then tries to "create /membership" and gets back a -110 
> error code (znode already exists).
> 8) Clients connecting to server 0 can successfully "get /membership".
> I will attach a tarball with debug logs for both servers, annotating where 
> steps #1 and #4 happen.  You can see that the election involves a proposal 
> for zxid 110 from server 0, but immediately following the election server 1 
> has these lines:
> 2011-12-05 17:18:48,308 9299 [QuorumPeer[myid=1]/127.0.0.1:2901] WARN 
> org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x10001 expected 
> 0x1
> 2011-12-05 17:18:48,313 9304 [SyncThread:1] INFO 
> org.apache.zookeeper.server.persistence.FileTxnLog  - Creating new log file: 
> log.10001
> Perhaps that's not relevant, but it struck me as odd.  At the end of server 
> 1's log you can see a repeated cycle of getData->create->getData as the 
> client tries to make sense of the inconsistent responses.
> The other piece of information is that if I try to use the on-disk 
> directories for either of the servers to start a new one-node ZK cluster, all 
> the data are accessible.
> I haven't tried writing a program outside of my application to reproduce 
> this, but I can do it very easily with some of my app's tests if anyone needs 
> more information.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (BOOKKEEPER-31) Need a project logo

2011-12-06 Thread Benjamin Reed (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/BOOKKEEPER-31?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Reed updated BOOKKEEPER-31:


Attachment: bookeper_wht.png
bookeper_blk.png

here are hi-res pngs one for white background and the other for black. if 
everyone is good with these, i'll upload to the repo and update the website.

> Need a project logo
> ---
>
> Key: BOOKKEEPER-31
> URL: https://issues.apache.org/jira/browse/BOOKKEEPER-31
> Project: Bookkeeper
>  Issue Type: Improvement
>Reporter: Benjamin Reed
>Assignee: Benjamin Reed
> Attachments: bk_1.jpg, bk_2.jpg, bk_3.jpg, bk_4.jpg, 
> bookeper_black_sm.png, bookeper_blk.png, bookeper_white_sm.png, 
> bookeper_wht.png
>
>
> we need a logo for the project something that looks good in the big and the 
> small and is easily recognizable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently

2011-11-02 Thread Benjamin Reed (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Reed updated ZOOKEEPER-1264:
-

Attachment: ZOOKEEPER-1264.patch

this patch merges camille's test in as well. it also adds a couple of extra 
asserts to cover ZOOKEEPER-1282. finally, it also moves around a couple of 
lines to fix ZOOKEEPER-1282. (i merged in 1282 because the fix and tests were 
simple modifications of this patch and we need to get this out asap.)

> FollowerResyncConcurrencyTest failing intermittently
> 
>
> Key: ZOOKEEPER-1264
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: tests
>Affects Versions: 3.3.3, 3.4.0, 3.5.0
>Reporter: Patrick Hunt
>Assignee: Camille Fournier
>Priority: Blocker
> Fix For: 3.3.4, 3.4.0, 3.5.0
>
> Attachments: ZOOKEEPER-1264-merge.patch, ZOOKEEPER-1264.patch, 
> ZOOKEEPER-1264.patch, ZOOKEEPER-1264.patch, ZOOKEEPER-1264.patch, 
> ZOOKEEPER-1264_branch33.patch, ZOOKEEPER-1264_branch34.patch, 
> ZOOKEEPER-1264unittest.patch, ZOOKEEPER-1264unittest.patch, 
> followerresyncfailure_log.txt.gz, logs.zip, tmp.zip
>
>
> The FollowerResyncConcurrencyTest test is failing intermittently. 
> saw the following on 3.4:
> {noformat}
> junit.framework.AssertionFailedError: Should have same number of
> ephemerals in both followers expected:<11741> but was:<14001>
>at 
> org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400)
>at 
> org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196)
>at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently

2011-11-02 Thread Benjamin Reed (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Reed updated ZOOKEEPER-1264:
-

Attachment: ZOOKEEPER-1264.patch

i think i got it. camille can you try it with your test to see if it's fixed 
there as well?  (the tests always passed on my machine.)

> FollowerResyncConcurrencyTest failing intermittently
> 
>
> Key: ZOOKEEPER-1264
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: tests
>Affects Versions: 3.3.3, 3.4.0, 3.5.0
>Reporter: Patrick Hunt
>Assignee: Camille Fournier
>Priority: Blocker
> Fix For: 3.3.4, 3.4.0, 3.5.0
>
> Attachments: ZOOKEEPER-1264.patch, ZOOKEEPER-1264.patch, 
> ZOOKEEPER-1264.patch, ZOOKEEPER-1264_branch33.patch, 
> ZOOKEEPER-1264_branch34.patch, ZOOKEEPER-1264unittest.patch, 
> ZOOKEEPER-1264unittest.patch, followerresyncfailure_log.txt.gz, logs.zip, 
> tmp.zip
>
>
> The FollowerResyncConcurrencyTest test is failing intermittently. 
> saw the following on 3.4:
> {noformat}
> junit.framework.AssertionFailedError: Should have same number of
> ephemerals in both followers expected:<11741> but was:<14001>
>at 
> org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400)
>at 
> org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196)
>at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently

2011-11-02 Thread Benjamin Reed (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Reed updated ZOOKEEPER-1264:
-

Attachment: ZOOKEEPER-1264unittest.patch

here is the unit test. doing a snapshot at UPDATE will make this test pass, but 
i'm afraid it is masking a deeper problem. the question is, why does it fix the 
problem?

> FollowerResyncConcurrencyTest failing intermittently
> 
>
> Key: ZOOKEEPER-1264
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: tests
>Affects Versions: 3.3.3, 3.4.0, 3.5.0
>Reporter: Patrick Hunt
>Assignee: Camille Fournier
>Priority: Blocker
> Fix For: 3.3.4, 3.4.0, 3.5.0
>
> Attachments: ZOOKEEPER-1264.patch, ZOOKEEPER-1264.patch, 
> ZOOKEEPER-1264_branch33.patch, ZOOKEEPER-1264_branch34.patch, 
> ZOOKEEPER-1264unittest.patch, ZOOKEEPER-1264unittest.patch, 
> followerresyncfailure_log.txt.gz, logs.zip, tmp.zip
>
>
> The FollowerResyncConcurrencyTest test is failing intermittently. 
> saw the following on 3.4:
> {noformat}
> junit.framework.AssertionFailedError: Should have same number of
> ephemerals in both followers expected:<11741> but was:<14001>
>at 
> org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400)
>at 
> org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196)
>at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently

2011-11-01 Thread Benjamin Reed (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Reed updated ZOOKEEPER-1264:
-

Attachment: ZOOKEEPER-1264unittest.patch

i've created a unit test to reproduce the problem since we can test it more 
directly and deterministicly, but i can't seem to make it happen. i'm attaching 
my unit test patch just in case you are camille can see what i'm missing.

if i understand it, the problem is that we are losing proposals that are 
received between the NEWLEADER and the UPDATE, but a follower doesn't send out 
any acks during that time, so it's okay to lose them. am i misunderstanding the 
problem?

> FollowerResyncConcurrencyTest failing intermittently
> 
>
> Key: ZOOKEEPER-1264
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: tests
>Affects Versions: 3.3.3, 3.4.0, 3.5.0
>Reporter: Patrick Hunt
>Assignee: Camille Fournier
>Priority: Blocker
> Fix For: 3.3.4, 3.4.0, 3.5.0
>
> Attachments: ZOOKEEPER-1264.patch, ZOOKEEPER-1264.patch, 
> ZOOKEEPER-1264_branch33.patch, ZOOKEEPER-1264_branch34.patch, 
> ZOOKEEPER-1264unittest.patch, followerresyncfailure_log.txt.gz, logs.zip, 
> tmp.zip
>
>
> The FollowerResyncConcurrencyTest test is failing intermittently. 
> saw the following on 3.4:
> {noformat}
> junit.framework.AssertionFailedError: Should have same number of
> ephemerals in both followers expected:<11741> but was:<14001>
>at 
> org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400)
>at 
> org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196)
>at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (BOOKKEEPER-88) derby doesn't like - in the topic names

2011-10-19 Thread Benjamin Reed (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/BOOKKEEPER-88?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Reed updated BOOKKEEPER-88:


Attachment: BOOKKEEPER-88.patch

> derby doesn't like - in the topic names
> ---
>
> Key: BOOKKEEPER-88
> URL: https://issues.apache.org/jira/browse/BOOKKEEPER-88
> Project: Bookkeeper
>  Issue Type: Bug
>Reporter: Benjamin Reed
>Priority: Minor
> Attachments: BOOKKEEPER-88.patch
>
>
> it's just a benchmark, but it is convenient to be able to use derby as a 
> backend for the hedwig benchmark. derby does not support - in topic names.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira