[jira] [Commented] (ZOOKEEPER-1442) Uncaught exception handler should exit on a java.lang.Error

2012-04-13 Thread Henry Robinson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13253568#comment-13253568
 ] 

Henry Robinson commented on ZOOKEEPER-1442:
---

The current method of logging the error seems problematic - if the exception 
really is OOME then the string passed to LOG will perhaps fail to be allocated 
(since it can't be interned because of the concatenation). 

It doesn't seem correct to exit on ThreadDeath either, by a cursory reading of 
the documentation.

The advice on java.lang.error appears to be that we should not be trying to 
catch it at all. That said, I'm not against failing fast because I've seen java 
server processes go OOM and then just keep ticking along like a zombie, causing 
strange bugs when some operations appear to take effect and some don't. 

> Uncaught exception handler should exit on a java.lang.Error
> ---
>
> Key: ZOOKEEPER-1442
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1442
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: java client, server
>Affects Versions: 3.4.3, 3.3.5
>Reporter: Jeremy Stribling
>Assignee: Jeremy Stribling
>Priority: Minor
> Attachments: ZOOKEEPER-1442.patch
>
>
> The uncaught exception handler registered in NIOServerCnxnFactory and 
> ClientCnxn simply logs exceptions and lets the rest of ZooKeeper go on its 
> merry way.  However, errors such as OutOfMemoryErrors should really crash the 
> program, as they represent unrecoverable errors.  If the exception that gets 
> to the uncaught exception handler is an instanceof a java.lang.Error, ZK 
> should exit with an error code (in addition to logging the error).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1433) improve ZxidRolloverTest (test seems flakey)

2012-03-29 Thread Henry Robinson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13241973#comment-13241973
 ] 

Henry Robinson commented on ZOOKEEPER-1433:
---

+1, looks good to me.

> improve ZxidRolloverTest (test seems flakey)
> 
>
> Key: ZOOKEEPER-1433
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1433
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: tests
>Affects Versions: 3.3.5
>Reporter: Wing Yew Poon
>Assignee: Patrick Hunt
> Fix For: 3.3.6, 3.4.4, 3.5.0
>
> Attachments: ZOOKEEPER-1433.patch, ZOOKEEPER-1433_test.out
>
>
> In our jenkins job to run the ZooKeeper unit tests, 
> org.apache.zookeeper.server.ZxidRolloverTest sometimes fails.
> E.g.,
> {noformat}
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for /foo0
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>   at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:815)
>   at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:843)
>   at 
> org.apache.zookeeper.server.ZxidRolloverTest.checkNodes(ZxidRolloverTest.java:154)
>   at 
> org.apache.zookeeper.server.ZxidRolloverTest.testRolloverThenRestart(ZxidRolloverTest.java:211)
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1435) cap space usage of default log4j rolling policy

2012-03-29 Thread Henry Robinson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13241972#comment-13241972
 ] 

Henry Robinson commented on ZOOKEEPER-1435:
---

+1 looks good to me, I'll commit shortly. 

> cap space usage of default log4j rolling policy
> ---
>
> Key: ZOOKEEPER-1435
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1435
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: scripts
>Affects Versions: 3.4.3, 3.3.5, 3.5.0
>Reporter: Patrick Hunt
>Assignee: Patrick Hunt
> Fix For: 3.5.0
>
> Attachments: ZOOKEEPER-1435.patch
>
>
> Our current log4j log rolling policy (for ROLLINGFILE) doesn't cap the max 
> logging space used. This can be a problem in production systems. See similar 
> improvements recently made in hadoop: HADOOP-8149
> For ROLLINGFILE only, I believe we should change the default threshold to 
> INFO and cap the max space to something reasonable, say 5g (max file size of 
> 256mb, max file count of 20). These will be the defaults in log4j.properties, 
> which you would also be able to override from the command line.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1395) node-watcher double-free redux

2012-03-26 Thread Henry Robinson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13239105#comment-13239105
 ] 

Henry Robinson commented on ZOOKEEPER-1395:
---

+1 this looks sensible to me. I'll commit tonight or tomorrow. 

> node-watcher double-free redux
> --
>
> Key: ZOOKEEPER-1395
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1395
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client, contrib-bindings
>Affects Versions: 3.3.4
>Reporter: Mike Lundy
>Assignee: Mike Lundy
>Priority: Critical
> Attachments: 3.3-0001-notify-on-event-state-not-current-state.patch, 
> 3.[45]-0001-notify-on-event-state-not-current-state.patch
>
>
> This is basically the same issue as ZOOKEEPER-888 and ZOOKEEPER-740 (the 
> latter is open as I write this, but it was superseded by the fix that went in 
> with 888). The problem still exists after the ZOOKEEPER-888 patch, however; 
> it's just more difficult to trigger:
> 1) Zookeeper notices connection loss, schedules watcher_dispatch
> 2) Zookeeper notices session loss, schedules watcher_dispatch
> 3) watcher_dispatch runs for connection loss
> 4) pywatcher is freed due to is_unrecoverable being true
> 5) watcher_dispatch runs for session loss
> 6) PyObject_CallObject attempts to run freed pywatcher with varying bad 
> results
> The fix is easy, the dispatcher should act on the state it is given, not the 
> state of the world when it runs. (Patch attached). Reliably triggering the 
> crash is tricky due to the race, but it's not theoretical.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1161) Provide an option for disabling auto-creation of the data directory

2012-03-06 Thread Henry Robinson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223084#comment-13223084
 ] 

Henry Robinson commented on ZOOKEEPER-1161:
---

I just committed this to trunk. Thanks Patrick!

> Provide an option for disabling auto-creation of the data directory
> ---
>
> Key: ZOOKEEPER-1161
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1161
> Project: ZooKeeper
>  Issue Type: New Feature
>  Components: scripts, server
>Reporter: Roman Shaposhnik
>Assignee: Patrick Hunt
> Fix For: 3.5.0
>
> Attachments: ZOOKEEPER-1161.patch, ZOOKEEPER-1161.patch, 
> ZOOKEEPER-1161.patch
>
>
> Currently if ZK starts and doesn't see and existing dataDir it tries to 
> create it. There should be an option to tweak this behavior. As for default, 
> my personal opinion is to NOW allow autocreate.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1161) Provide an option for disabling auto-creation of the data directory

2012-03-05 Thread Henry Robinson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13222724#comment-13222724
 ] 

Henry Robinson commented on ZOOKEEPER-1161:
---

+1, looks good to me. A couple of nits:

{code}
+
+public static final String ZOOKEEPER_DATADIR_AUTOCREATE =
+"zookeeper.datadir.autocreate";
{code}

Can you add {{public static final String ZOOKEEPER_DATADIR_AUTOCREATE_DEFAULT = 
"true"}} and use that where you've used the raw string? In case we change the 
default in 4.0, this makes it so that, for example, the test will restore the 
default correctly. 

{code}
+if (!enableAutocreate) {
+throw new DatadirException("Missing data directory "
++ this.dataDir + ", please create this directory.");
+}
{code}

Might be nice to explain to the user why we're not creating the directory. 
Perhaps {{"Missing data directory " + this.dataDir + ", automatic data 
directory creation is disabled (zookeeper.datadir.autocreate is false). Please 
create this directory manually."}}

{code}

+try {
+tmpDir = ClientBase.createTmpDir();
+zks = new ZooKeeperServer(tmpDir, tmpDir, 3000);
+f = ServerCnxnFactory.createFactory(PORT, -1);
+f.startup(zks);
+Assert.assertTrue("waiting for server being up ", ClientBase
+.waitForServerUp(HOSTPORT, CONNECTION_TIMEOUT));
+
+Assert.fail("Server should not have started without datadir");
+} catch (IOException e) {
+LOG.info("Server failed to start - correct behavior " + e);
+}
+
+System.setProperty(FileTxnSnapLog.ZOOKEEPER_DATADIR_AUTOCREATE, 
"true");
+}
{code}

Can you put the final setProperty in a final block, in case a runtime exception 
throws and the property setting leaks?

> Provide an option for disabling auto-creation of the data directory
> ---
>
> Key: ZOOKEEPER-1161
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1161
> Project: ZooKeeper
>  Issue Type: New Feature
>  Components: scripts, server
>Reporter: Roman Shaposhnik
>Assignee: Patrick Hunt
> Fix For: 3.5.0
>
> Attachments: ZOOKEEPER-1161.patch
>
>
> Currently if ZK starts and doesn't see and existing dataDir it tries to 
> create it. There should be an option to tweak this behavior. As for default, 
> my personal opinion is to NOW allow autocreate.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1361) Leader.lead iterates over 'learners' set without proper synchronisation

2012-02-26 Thread Henry Robinson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13217002#comment-13217002
 ] 

Henry Robinson commented on ZOOKEEPER-1361:
---

Awesome, thanks Camille.

> Leader.lead iterates over 'learners' set without proper synchronisation
> ---
>
> Key: ZOOKEEPER-1361
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1361
> Project: ZooKeeper
>  Issue Type: Bug
>Affects Versions: 3.4.2
>Reporter: Henry Robinson
>Assignee: Henry Robinson
> Fix For: 3.5.0
>
> Attachments: ZOOKEEPER-1361-no-whitespace.patch, ZOOKEEPER-1361.patch
>
>
> This block:
> {code}
> HashSet followerSet = new HashSet();
> for(LearnerHandler f : learners)
> followerSet.add(f.getSid());
> {code}
> is executed without holding the lock on learners, so if there were ever a 
> condition where a new learner was added during the initial sync phase, I'm 
> pretty sure we'd see a concurrent modification exception. Certainly other 
> parts of the code are very careful to lock on learners when iterating. 
> It would be nice to use a {{ConcurrentHashMap}} to hold the learners instead, 
> but I can't convince myself that this wouldn't introduce some correctness 
> bugs. For example the following:
> Learners contains A, B, C, D
> Thread 1 iterates over learners, and gets as far as B.
> Thread 2 removes A, and adds E.
> Thread 1 continues iterating and sees a learner view of A, B, C, D, E
> This may be a bug if Thread 1 is counting the number of synced followers for 
> a quorum count, since at no point was A, B, C, D, E a correct view of the 
> quorum.
> In practice, I think this is actually ok, because I don't think ZK makes any 
> strong ordering guarantees on learners joining or leaving (so we don't need a 
> strong serialisability guarantee on learners) but I don't think I'll make 
> that change for this patch. Instead I want to clean up the locking protocols 
> on the follower / learner sets - to avoid another easy deadlock like the one 
> we saw in ZOOKEEPER-1294 - and to do less with the lock held; i.e. to copy 
> and then iterate over the copy rather than iterate over a locked set. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1361) Leader.lead iterates over 'learners' set without proper synchronisation

2012-02-26 Thread Henry Robinson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13216992#comment-13216992
 ] 

Henry Robinson commented on ZOOKEEPER-1361:
---

It applies to trunk for me (I don't know that this has to into 3.4, since we've 
not seen bug reports on this issue). 

My trunk is at ZOOKEEPER-1386, and I can apply the patch cleanly with

patch -p0 < ZOOKEEPER-1361-no-whitespace.patch

Am I out of date?

> Leader.lead iterates over 'learners' set without proper synchronisation
> ---
>
> Key: ZOOKEEPER-1361
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1361
> Project: ZooKeeper
>  Issue Type: Bug
>Affects Versions: 3.4.2
>Reporter: Henry Robinson
>Assignee: Henry Robinson
> Fix For: 3.5.0
>
> Attachments: ZOOKEEPER-1361-no-whitespace.patch, ZOOKEEPER-1361.patch
>
>
> This block:
> {code}
> HashSet followerSet = new HashSet();
> for(LearnerHandler f : learners)
> followerSet.add(f.getSid());
> {code}
> is executed without holding the lock on learners, so if there were ever a 
> condition where a new learner was added during the initial sync phase, I'm 
> pretty sure we'd see a concurrent modification exception. Certainly other 
> parts of the code are very careful to lock on learners when iterating. 
> It would be nice to use a {{ConcurrentHashMap}} to hold the learners instead, 
> but I can't convince myself that this wouldn't introduce some correctness 
> bugs. For example the following:
> Learners contains A, B, C, D
> Thread 1 iterates over learners, and gets as far as B.
> Thread 2 removes A, and adds E.
> Thread 1 continues iterating and sees a learner view of A, B, C, D, E
> This may be a bug if Thread 1 is counting the number of synced followers for 
> a quorum count, since at no point was A, B, C, D, E a correct view of the 
> quorum.
> In practice, I think this is actually ok, because I don't think ZK makes any 
> strong ordering guarantees on learners joining or leaving (so we don't need a 
> strong serialisability guarantee on learners) but I don't think I'll make 
> that change for this patch. Instead I want to clean up the locking protocols 
> on the follower / learner sets - to avoid another easy deadlock like the one 
> we saw in ZOOKEEPER-1294 - and to do less with the lock held; i.e. to copy 
> and then iterate over the copy rather than iterate over a locked set. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1321) Add number of client connections metric in JMX and srvr

2012-02-06 Thread Henry Robinson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13201398#comment-13201398
 ] 

Henry Robinson commented on ZOOKEEPER-1321:
---

I take your point. I'll regenerate the patch without the whitespace difference. 
I'd like to figure out how to handle all our sloppy trailing whitespace at some 
point, but there's already too much in each file just to sneak it in 
patch-by-patch. 

> Add number of client connections metric in JMX and srvr
> ---
>
> Key: ZOOKEEPER-1321
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1321
> Project: ZooKeeper
>  Issue Type: Improvement
>Affects Versions: 3.3.4, 3.4.2
>Reporter: Neha Narkhede
>Assignee: Neha Narkhede
>  Labels: patch
> Attachments: ZOOKEEPER-1321_3.4.patch, ZOOKEEPER-1321_trunk.patch, 
> ZOOKEEPER-1321_trunk.patch, zk-1321-cleanup, zk-1321-trunk.patch, 
> zk-1321.patch, zookeeper-1321-trunk-v2.patch
>
>
> The related conversation on the zookeeper user mailing list is here - 
> http://apache.markmail.org/message/4jjcmooniowwugu2?q=+list:org.apache.hadoop.zookeeper-user
> It is useful to be able to monitor the number of disconnect operations on a 
> client. This is generally indicative of a client going through large number 
> of GC and hence disconnecting way too often from a zookeeper cluster. 
> Today, this information is only indirectly exposed as part of the stat 
> command which requires counting the results. That's alot of work for the 
> server to do just to get connection count. 
> For monitoring purposes, it will be useful to have this exposed through JMX 
> and 4lw srvr.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1367) Data inconsistencies and unexpired ephemeral nodes after cluster restart

2012-01-30 Thread Henry Robinson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13196249#comment-13196249
 ] 

Henry Robinson commented on ZOOKEEPER-1367:
---

All three patches (trunk, 3.4 and 3.3) look good to me. +1, nice job. I'm 
comfortable not having the 3.3 test case, although of course it would be good. 

I don't fully understand the reason for this diff:

{code}
@@ -240,8 +242,7 @@
 // Clean up dead sessions
 LinkedList deadSessions = new LinkedList();
 for (long session : zkDb.getSessions()) {
-sessionsWithTimeouts = zkDb.getSessionWithTimeOuts();
-if (sessionsWithTimeouts.get(session) == null) {
+  if (zkDb.getSessionWithTimeOuts().get(session) == null) {
 deadSessions.add(session);
 }
 }
{code}

but if it's just tidying up, that's fine (although it would seem better to lift 
the getSessionWithTimeOuts call to outside the loop). There's also some extra 
whitespace and an unused import or two, but those can get cleaned up later. 

> Data inconsistencies and unexpired ephemeral nodes after cluster restart
> 
>
> Key: ZOOKEEPER-1367
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1367
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.4.2
> Environment: Debian Squeeze, 64-bit
>Reporter: Jeremy Stribling
>Assignee: Benjamin Reed
>Priority: Blocker
> Fix For: 3.4.3
>
> Attachments: 1367-3.3.patch, ZOOKEEPER-1367-3.4.patch, 
> ZOOKEEPER-1367.patch, ZOOKEEPER-1367.patch, ZOOKEEPER-1367.tgz
>
>
> In one of our tests, we have a cluster of three ZooKeeper servers.  We kill 
> all three, and then restart just two of them.  Sometimes we notice that on 
> one of the restarted servers, ephemeral nodes from previous sessions do not 
> get deleted, while on the other server they do.  We are effectively running 
> 3.4.2, though technically we are running 3.4.1 with the patch manually 
> applied for ZOOKEEPER-1333 and a C client for 3.4.1 with the patches for 
> ZOOKEEPER-1163.
> I noticed that when I connected using zkCli.sh to the first node (90.0.0.221, 
> zkid 84), I saw only one znode in a particular path:
> {quote}
> [zk: 90.0.0.221:2888(CONNECTED) 0] ls /election/zkrsm
> [nominee11]
> [zk: 90.0.0.221:2888(CONNECTED) 1] get /election/zkrsm/nominee11
> 90.0.0.222: 
> cZxid = 0x40027
> ctime = Thu Jan 19 08:18:24 UTC 2012
> mZxid = 0x40027
> mtime = Thu Jan 19 08:18:24 UTC 2012
> pZxid = 0x40027
> cversion = 0
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0xa234f4f3bc220001
> dataLength = 16
> numChildren = 0
> {quote}
> However, when I connect zkCli.sh to the second server (90.0.0.222, zkid 251), 
> I saw three znodes under that same path:
> {quote}
> [zk: 90.0.0.222:2888(CONNECTED) 2] ls /election/zkrsm
> nominee06   nominee10   nominee11
> [zk: 90.0.0.222:2888(CONNECTED) 2] get /election/zkrsm/nominee11
> 90.0.0.222: 
> cZxid = 0x40027
> ctime = Thu Jan 19 08:18:24 UTC 2012
> mZxid = 0x40027
> mtime = Thu Jan 19 08:18:24 UTC 2012
> pZxid = 0x40027
> cversion = 0
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0xa234f4f3bc220001
> dataLength = 16
> numChildren = 0
> [zk: 90.0.0.222:2888(CONNECTED) 3] get /election/zkrsm/nominee10
> 90.0.0.221: 
> cZxid = 0x3014c
> ctime = Thu Jan 19 07:53:42 UTC 2012
> mZxid = 0x3014c
> mtime = Thu Jan 19 07:53:42 UTC 2012
> pZxid = 0x3014c
> cversion = 0
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0xa234f4f3bc22
> dataLength = 16
> numChildren = 0
> [zk: 90.0.0.222:2888(CONNECTED) 4] get /election/zkrsm/nominee06
> 90.0.0.223: 
> cZxid = 0x20cab
> ctime = Thu Jan 19 08:00:30 UTC 2012
> mZxid = 0x20cab
> mtime = Thu Jan 19 08:00:30 UTC 2012
> pZxid = 0x20cab
> cversion = 0
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0x5434f5074e040002
> dataLength = 16
> numChildren = 0
> {quote}
> These never went away for the lifetime of the server, for any clients 
> connected directly to that server.  Note that this cluster is configured to 
> have all three servers still, the third one being down (90.0.0.223, zkid 162).
> I captured the data/snapshot directories for the the two live servers.  When 
> I start single-node servers using each directory, I can briefly see that the 
> inconsistent data is present in those logs, though the ephemeral nodes seem 
> to get (correctly) cleaned up pretty soon after I start the server.
> I will upload a tar containing the debug logs and data directories from the 
> failure.  I think we can reproduce it regularly if you need more info.

--

[jira] [Commented] (ZOOKEEPER-1366) Zookeeper should be tolerant of clock adjustments

2012-01-23 Thread Henry Robinson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13191376#comment-13191376
 ] 

Henry Robinson commented on ZOOKEEPER-1366:
---

My feeling is that Ted's fixing a legitimate issue here, so we shouldn't hold 
up the patch for a separate effort. Reworking how we deal with time is going to 
be a big effort (Thread.sleep really does complicate things, plus there's the 
question of how to actually inject a mock clock - as you say, such method calls 
would need to be non-static but then we need to figure out how to get the right 
implementation behind those methods). This patch doesn't get in the way of 
doing a better job with time, and gives us the beginnings of a nice integration 
point to mock clocks out. 

So I'll file a separate JIRA to track being able to change our clock 
implementation, and we can evaluate this on its own merits (might be nice to 
run a soak test for a few hours here to make sure that there are no weird edge 
cases that somehow got broken). Sound good?

> Zookeeper should be tolerant of clock adjustments
> -
>
> Key: ZOOKEEPER-1366
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1366
> Project: ZooKeeper
>  Issue Type: Bug
>Reporter: Ted Dunning
>Assignee: Ted Dunning
> Fix For: 3.4.3
>
> Attachments: ZOOKEEPER-1366-3.3.3.patch, ZOOKEEPER-1366.patch, 
> ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch
>
>
> If you want to wreak havoc on a ZK based system just do [date -s "+1hour"] 
> and watch the mayhem as all sessions expire at once.
> This shouldn't happen.  Zookeeper could easily know handle elapsed times as 
> elapsed times rather than as differences between absolute times.  The 
> absolute times are subject to adjustment when the clock is set while a timer 
> is not subject to this problem.  In Java, System.currentTimeMillis() gives 
> you absolute time while System.nanoTime() gives you time based on a timer 
> from an arbitrary epoch.
> I have done this and have been running tests now for some tens of minutes 
> with no failures.  I will set up a test machine to redo the build again on 
> Ubuntu and post a patch here for discussion.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1366) Zookeeper should be tolerant of clock adjustments

2012-01-19 Thread Henry Robinson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13189560#comment-13189560
 ] 

Henry Robinson commented on ZOOKEEPER-1366:
---

Oh, and my main concern here was non-monotonicity across cores, but [this 
page|http://juliusdavies.ca/posix_clocks/clock_realtime_linux_faq.html] 
alleviates that concern for modern Linux kernels:

{quote}
1. Is clock_gettime(CLOCK_REALTIME) consistent across all processors/cores?
>(Does arch matter?  e.g. ppc, arm, x86, amd64, sparc).
It *should* or it's considered buggy.

However, on x86/x86_64, it is possible to see unsynced or variable freq TSCs 
cause time inconsistencies. 2.4 kernels really had no protection against this, 
and early 2.6 kernels didn't do too well here either. As of 2.6.18 and up the 
logic for detecting this is better and we'll usually fall back to a safe 
clocksource.

ppc always has a synced timebase, so that shouldn't be an issue.

arm, i'm not so familiar with, but i assume they do the right thing (i've not 
seen many arm bugs on this issue).
{quote}

> Zookeeper should be tolerant of clock adjustments
> -
>
> Key: ZOOKEEPER-1366
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1366
> Project: ZooKeeper
>  Issue Type: Bug
>Reporter: Ted Dunning
> Fix For: 3.4.3
>
> Attachments: ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch
>
>
> If you want to wreak havoc on a ZK based system just do [date -s "+1hour"] 
> and watch the mayhem as all sessions expire at once.
> This shouldn't happen.  Zookeeper could easily know handle elapsed times as 
> elapsed times rather than as differences between absolute times.  The 
> absolute times are subject to adjustment when the clock is set while a timer 
> is not subject to this problem.  In Java, System.currentTimeMillis() gives 
> you absolute time while System.nanoTime() gives you time based on a timer 
> from an arbitrary epoch.
> I have done this and have been running tests now for some tens of minutes 
> with no failures.  I will set up a test machine to redo the build again on 
> Ubuntu and post a patch here for discussion.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1366) Zookeeper should be tolerant of clock adjustments

2012-01-19 Thread Henry Robinson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13189557#comment-13189557
 ] 

Henry Robinson commented on ZOOKEEPER-1366:
---

The nice thing is that this is a small step towards a properly mockable time 
API in ZK, which would a) make tests much faster and b) make tests much more 
deterministic. There's a way to go still because of Thread.sleep and other 
complications, but this is a good first step. 

> Zookeeper should be tolerant of clock adjustments
> -
>
> Key: ZOOKEEPER-1366
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1366
> Project: ZooKeeper
>  Issue Type: Bug
>Reporter: Ted Dunning
> Fix For: 3.4.3
>
> Attachments: ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch
>
>
> If you want to wreak havoc on a ZK based system just do [date -s "+1hour"] 
> and watch the mayhem as all sessions expire at once.
> This shouldn't happen.  Zookeeper could easily know handle elapsed times as 
> elapsed times rather than as differences between absolute times.  The 
> absolute times are subject to adjustment when the clock is set while a timer 
> is not subject to this problem.  In Java, System.currentTimeMillis() gives 
> you absolute time while System.nanoTime() gives you time based on a timer 
> from an arbitrary epoch.
> I have done this and have been running tests now for some tens of minutes 
> with no failures.  I will set up a test machine to redo the build again on 
> Ubuntu and post a patch here for discussion.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1321) Add number of client connections metric in JMX and srvr

2012-01-16 Thread Henry Robinson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187069#comment-13187069
 ] 

Henry Robinson commented on ZOOKEEPER-1321:
---

+1 looks good! Only weirdness to my eyes is the following:

{code}
+public int getNumAliveConnections() {
+int numConnections;
+synchronized(cnxns) {
+numConnections = cnxns.size();
+}
+return numConnections;
+}
{code}

It's perfectly legal to return inside a synchronized block, so it might be more 
concise to have:

{code}

+public int getNumAliveConnections() {
+synchronized(cnxns) {
+return cnxns.size();
+}
+}
{code}

If you fix this nit I'm happy for you to commit this without another review 
pass. 

> Add number of client connections metric in JMX and srvr
> ---
>
> Key: ZOOKEEPER-1321
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1321
> Project: ZooKeeper
>  Issue Type: Improvement
>Affects Versions: 3.3.4, 3.4.2
>Reporter: Neha Narkhede
>Assignee: Neha Narkhede
>  Labels: patch
> Attachments: ZOOKEEPER-1321_3.4.patch, ZOOKEEPER-1321_trunk.patch, 
> ZOOKEEPER-1321_trunk.patch, zk-1321-cleanup, zookeeper-1321-trunk-v2.patch
>
>
> The related conversation on the zookeeper user mailing list is here - 
> http://apache.markmail.org/message/4jjcmooniowwugu2?q=+list:org.apache.hadoop.zookeeper-user
> It is useful to be able to monitor the number of disconnect operations on a 
> client. This is generally indicative of a client going through large number 
> of GC and hence disconnecting way too often from a zookeeper cluster. 
> Today, this information is only indirectly exposed as part of the stat 
> command which requires counting the results. That's alot of work for the 
> server to do just to get connection count. 
> For monitoring purposes, it will be useful to have this exposed through JMX 
> and 4lw srvr.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1294) One of the zookeeper server is not accepting any requests

2012-01-12 Thread Henry Robinson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185454#comment-13185454
 ] 

Henry Robinson commented on ZOOKEEPER-1294:
---

So, after some investigation, I've found out what's happening with 
testNoLogBeforeLeaderEstablishment.

The patch changes the locking in Leader.java; now the lock around the 
sync-and-ping loop is on the forwardingFollowers set. The call to ping() with 
that lock held then takes the lock on the leader object. 

In the failing test runs, at the same time the ProposalRequestProcessor has 
locked the leader object in order to make a proposal in Leader.propose(). This 
then calls sendPacket, which (tries to) lock on forwardingFollowers. 

This is a classic deadlock - the threads try to take the same locks in a 
different order. Although there are a few options, I think actually the patch 
*shouldn't* be changing the set to forwardingFollowers, but should be using 
learners as before. This is because observers should be pinged as well, I 
think, so that they don't think they're dead. Instead, the code should 
explicitly test whether a learner is a PARTICIPANT as below:

{code}
synchronized (learners) {
for (LearnerHandler f : learners) {
if (f.synced() && f.getLearnerType() == 
LearnerType.PARTICIPANT) {
syncedCount++;
syncedSet.add(f.getSid());
}
f.ping();
}
}
{code}

So only learners get added to the sync set, but everyone gets pinged. This 
seems to fix the problem with this test, at least, for me. Any thoughts?

> One of the zookeeper server is not accepting any requests
> -
>
> Key: ZOOKEEPER-1294
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1294
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
> Environment: 3 Zookeeper + 3 Observer with SuSe-11
>Reporter: amith
>Assignee: kavita sharma
> Fix For: 3.5.0
>
> Attachments: ZOOKEEPER-1294-1.patch, ZOOKEEPER-1294.patch
>
>
> In zoo.cfg i have configured as
> server.1 = XX.XX.XX.XX:65175:65173
> server.2 = XX.XX.XX.XX:65185:65183
> server.3 = XX.XX.XX.XX:65195:65193
> server.4 = XX.XX.XX.XX:65205:65203:observer
> server.5 = XX.XX.XX.XX:65215:65213:observer
> server.6 = XX.XX.XX.XX:65225:65223:observer
> Like above I have configured 3 PARTICIPANTS and 3 OBSERVERS
> in the cluster of 6 zookeepers
> Steps to reproduce the defect
> 1. Start all the 3 participant zookeeper
> 2. Stop all the participant zookeeper
> 3. Start zookeeper 1(Participant)
> 4. Start zookeeper 2(Participant)
> 5. Start zookeeper 4(Observer)
> 6. Create a persistent node with external client and close it
> 7. Stop the zookeeper 1(Participant neo quorum is unstable)
> 8. Create a new client and try to find the node created b4 using exists api 
> (will fail since quorum not statisfied)
> 9. Start the Zookeeper 1 (Participant stabilise the quorum)
> Now check the observer using 4 letter word (Server.4)
> linux-216:/home/amith/CI/source/install/zookeeper/zookeeper2/bin # echo stat 
> | netcat localhost 65200
> Zookeeper version: 3.3.2-1031432, built on 11/05/2010 05:32 GMT
> Clients:
>  /127.0.0.1:46370[0](queued=0,recved=1,sent=0)
> Latency min/avg/max: 0/0/0
> Received: 1
> Sent: 0
> Outstanding: 0
> Zxid: 0x10003
> Mode: observer
> Node count: 5
> check the participant 2 with 4 letter word
> Latency min/avg/max: 22/48/83
> Received: 39
> Sent: 3
> Outstanding: 35
> Zxid: 0x10003
> Mode: leader
> Node count: 5
> linux-216:/home/amith/CI/source/install/zookeeper/zookeeper2/bin #
> check the participant 1 with 4 letter word
> linux-216:/home/amith/CI/source/install/zookeeper/zookeeper2/bin # echo stat 
> | netcat localhost 65170
> This ZooKeeper instance is not currently serving requests
> We can see the participant1 logs filled with
> 2011-11-08 15:49:51,360 - WARN  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:65170:NIOServerCnxn@642] - Exception 
> causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not 
> running
> Problem here is participent1 is not responding / accepting any requests

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1294) One of the zookeeper server is not accepting any requests

2012-01-11 Thread Henry Robinson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13184762#comment-13184762
 ] 

Henry Robinson commented on ZOOKEEPER-1294:
---

These failures are legit; I'm looking into them now. 

> One of the zookeeper server is not accepting any requests
> -
>
> Key: ZOOKEEPER-1294
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1294
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
> Environment: 3 Zookeeper + 3 Observer with SuSe-11
>Reporter: amith
>Assignee: kavita sharma
> Fix For: 3.5.0
>
> Attachments: ZOOKEEPER-1294-1.patch, ZOOKEEPER-1294.patch
>
>
> In zoo.cfg i have configured as
> server.1 = XX.XX.XX.XX:65175:65173
> server.2 = XX.XX.XX.XX:65185:65183
> server.3 = XX.XX.XX.XX:65195:65193
> server.4 = XX.XX.XX.XX:65205:65203:observer
> server.5 = XX.XX.XX.XX:65215:65213:observer
> server.6 = XX.XX.XX.XX:65225:65223:observer
> Like above I have configured 3 PARTICIPANTS and 3 OBSERVERS
> in the cluster of 6 zookeepers
> Steps to reproduce the defect
> 1. Start all the 3 participant zookeeper
> 2. Stop all the participant zookeeper
> 3. Start zookeeper 1(Participant)
> 4. Start zookeeper 2(Participant)
> 5. Start zookeeper 4(Observer)
> 6. Create a persistent node with external client and close it
> 7. Stop the zookeeper 1(Participant neo quorum is unstable)
> 8. Create a new client and try to find the node created b4 using exists api 
> (will fail since quorum not statisfied)
> 9. Start the Zookeeper 1 (Participant stabilise the quorum)
> Now check the observer using 4 letter word (Server.4)
> linux-216:/home/amith/CI/source/install/zookeeper/zookeeper2/bin # echo stat 
> | netcat localhost 65200
> Zookeeper version: 3.3.2-1031432, built on 11/05/2010 05:32 GMT
> Clients:
>  /127.0.0.1:46370[0](queued=0,recved=1,sent=0)
> Latency min/avg/max: 0/0/0
> Received: 1
> Sent: 0
> Outstanding: 0
> Zxid: 0x10003
> Mode: observer
> Node count: 5
> check the participant 2 with 4 letter word
> Latency min/avg/max: 22/48/83
> Received: 39
> Sent: 3
> Outstanding: 35
> Zxid: 0x10003
> Mode: leader
> Node count: 5
> linux-216:/home/amith/CI/source/install/zookeeper/zookeeper2/bin #
> check the participant 1 with 4 letter word
> linux-216:/home/amith/CI/source/install/zookeeper/zookeeper2/bin # echo stat 
> | netcat localhost 65170
> This ZooKeeper instance is not currently serving requests
> We can see the participant1 logs filled with
> 2011-11-08 15:49:51,360 - WARN  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:65170:NIOServerCnxn@642] - Exception 
> causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not 
> running
> Problem here is participent1 is not responding / accepting any requests

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1318) In Python binding, get_children (and get and exists, and probably others) with expired session doesn't raise exception properly

2012-01-03 Thread Henry Robinson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179118#comment-13179118
 ] 

Henry Robinson commented on ZOOKEEPER-1318:
---

Hi Jim - 

Good catch! By all means, feel free to work on a patch :)

Thanks,

Henry

> In Python binding, get_children (and get and exists, and probably others) 
> with expired session doesn't raise exception properly
> ---
>
> Key: ZOOKEEPER-1318
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1318
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: contrib-bindings
>Affects Versions: 3.3.3
> Environment: Mac OS X (at least)
>Reporter: Jim Fulton
>
> In Python binding, get_children (and get and exists, and probably others) 
> with expired session doesn't raise exception properly.
> >>> zookeeper.state(h)
> -112
> >>> zookeeper.get_children(h, '/')
> Traceback (most recent call last):
>   File "", line 1, in 
> SystemError: error return without exception set
> Let me know if you'd like me to work on a patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1314) improve zkpython synchronous api implementation

2011-12-09 Thread Henry Robinson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13166457#comment-13166457
 ] 

Henry Robinson commented on ZOOKEEPER-1314:
---

Hi Daniel - 

Looks good, I didn't know about Py_BEGIN_ALLOW_THREADS!

Just one comment:

{code}
   if (err != ZOK) {
 PyErr_SetString(err_to_exception(err), zerror(err));
-return NULL;
+goto cleanup;
   }
 
-  return Py_BuildValue("s", realbuf);
+  returnval = Py_BuildValue("s", realbuf);
+cleanup:
+  free(realbuf);
+  return returnval;
{code}

I'd prefer:

{code}
   if (err != ZOK) {
 PyErr_SetString(err_to_exception(err), zerror(err));
   } else {
 returnval = Py_BuildValue("s", realbuf);
   }
  
+  free(realbuf);
+  return returnval;
{code}

if that's functionally equivalent. 


> improve zkpython synchronous api implementation
> ---
>
> Key: ZOOKEEPER-1314
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1314
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: contrib-bindings
>Affects Versions: 3.3.3
>Reporter: Daniel Lescohier
>Assignee: Daniel Lescohier
>Priority: Minor
> Attachments: ZOOKEEPER-1314.patch
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> Improves the following items in zkpython which are related to the Zookeeper 
> synchronous API:
> # For pyzoo_create, no longer limit the returned znode name to 256 bytes; 
> dynamically allocate memory on the heap.
> # For all the synchronous api calls, release the Python Global Interpreter 
> Lock just before doing the synchronous call.
> I will attach the patch shortly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira