[jira] [Commented] (ZOOKEEPER-1435) cap space usage of default log4j rolling policy

2012-03-29 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241499#comment-13241499
 ] 

Mahadev konar commented on ZOOKEEPER-1435:
--

+1 for the patch. Looks good to me!

 cap space usage of default log4j rolling policy
 ---

 Key: ZOOKEEPER-1435
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1435
 Project: ZooKeeper
  Issue Type: Improvement
  Components: scripts
Affects Versions: 3.4.3, 3.3.5, 3.5.0
Reporter: Patrick Hunt
Assignee: Patrick Hunt
 Fix For: 3.5.0

 Attachments: ZOOKEEPER-1435.patch


 Our current log4j log rolling policy (for ROLLINGFILE) doesn't cap the max 
 logging space used. This can be a problem in production systems. See similar 
 improvements recently made in hadoop: HADOOP-8149
 For ROLLINGFILE only, I believe we should change the default threshold to 
 INFO and cap the max space to something reasonable, say 5g (max file size of 
 256mb, max file count of 20). These will be the defaults in log4j.properties, 
 which you would also be able to override from the command line.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1433) improve ZxidRolloverTest (test seems flakey)

2012-03-29 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241504#comment-13241504
 ] 

Mahadev konar commented on ZOOKEEPER-1433:
--

+1 looks good to me... Thanks for fixing this Pat!

 improve ZxidRolloverTest (test seems flakey)
 

 Key: ZOOKEEPER-1433
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1433
 Project: ZooKeeper
  Issue Type: Improvement
  Components: tests
Affects Versions: 3.3.5
Reporter: Wing Yew Poon
Assignee: Patrick Hunt
 Fix For: 3.3.6, 3.4.4, 3.5.0

 Attachments: ZOOKEEPER-1433.patch, ZOOKEEPER-1433_test.out


 In our jenkins job to run the ZooKeeper unit tests, 
 org.apache.zookeeper.server.ZxidRolloverTest sometimes fails.
 E.g.,
 {noformat}
 org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
 = ConnectionLoss for /foo0
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
   at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:815)
   at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:843)
   at 
 org.apache.zookeeper.server.ZxidRolloverTest.checkNodes(ZxidRolloverTest.java:154)
   at 
 org.apache.zookeeper.server.ZxidRolloverTest.testRolloverThenRestart(ZxidRolloverTest.java:211)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over

2012-03-15 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13229941#comment-13229941
 ] 

Mahadev konar commented on ZOOKEEPER-1277:
--

+1 on the patches. Looked through all 3. Good to go! Thanks Pat!

 servers stop serving when lower 32bits of zxid roll over
 

 Key: ZOOKEEPER-1277
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.3.3
Reporter: Patrick Hunt
Assignee: Patrick Hunt
Priority: Critical
 Fix For: 3.3.5, 3.4.4, 3.5.0

 Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, 
 ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, 
 ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_br34.patch, 
 ZOOKEEPER-1277_trunk.patch, ZOOKEEPER-1277_trunk.patch


 When the lower 32bits of a zxid roll over (zxid is a 64 bit number, however 
 the upper 32 are considered the epoch number) the epoch number (upper 32 
 bits) are incremented and the lower 32 start at 0 again.
 This should work fine, however in the current 3.3 branch the followers see 
 this as a NEWLEADER message, which it's not, and effectively stop serving 
 clients. Attached clients seem to eventually time out given that heartbeats 
 (or any operation) are no longer processed. The follower doesn't recover from 
 this.
 I've tested this out on 3.3 branch and confirmed this problem, however I 
 haven't tried it on 3.4/3.5. It may not happen on the newer branches due to 
 ZOOKEEPER-335, however there is certainly an issue with updating the 
 acceptedEpoch files contained in the datadir. (I'll enter a separate jira 
 for that)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over

2012-03-14 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13229708#comment-13229708
 ] 

Mahadev konar commented on ZOOKEEPER-1277:
--

Ahh... That makes more sense! Updated comments would be good. Thanks!

 servers stop serving when lower 32bits of zxid roll over
 

 Key: ZOOKEEPER-1277
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.3.3
Reporter: Patrick Hunt
Assignee: Patrick Hunt
Priority: Critical
 Fix For: 3.3.6

 Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch


 When the lower 32bits of a zxid roll over (zxid is a 64 bit number, however 
 the upper 32 are considered the epoch number) the epoch number (upper 32 
 bits) are incremented and the lower 32 start at 0 again.
 This should work fine, however in the current 3.3 branch the followers see 
 this as a NEWLEADER message, which it's not, and effectively stop serving 
 clients. Attached clients seem to eventually time out given that heartbeats 
 (or any operation) are no longer processed. The follower doesn't recover from 
 this.
 I've tested this out on 3.3 branch and confirmed this problem, however I 
 haven't tried it on 3.4/3.5. It may not happen on the newer branches due to 
 ZOOKEEPER-335, however there is certainly an issue with updating the 
 acceptedEpoch files contained in the datadir. (I'll enter a separate jira 
 for that)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1373) Hardcoded SASL login context name clashes with Hadoop security configuration override

2012-02-06 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13201136#comment-13201136
 ] 

Mahadev konar commented on ZOOKEEPER-1373:
--

Javadoc warning is due to  ZOOKEEPER-1386.

 Hardcoded SASL login context name clashes with Hadoop security configuration 
 override
 -

 Key: ZOOKEEPER-1373
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1373
 Project: ZooKeeper
  Issue Type: Bug
  Components: java client
Affects Versions: 3.4.2
Reporter: Thomas Weise
Assignee: Eugene Koontz
 Fix For: 3.4.3, 3.5.0

 Attachments: ZOOKEEPER-1373-TW_3_4.patch, ZOOKEEPER-1373.patch, 
 ZOOKEEPER-1373.patch, ZOOKEEPER-1373.patch, ZOOKEEPER-1373.patch, 
 ZOOKEEPER-1373.patch, ZOOKEEPER-1373.patch


 I'm trying to configure a process with Hadoop security (Hive metastore 
 server) to talk to ZooKeeper 3.4.2 with Kerberos authentication. In this 
 scenario Hadoop controls the SASL configuration 
 (org.apache.hadoop.security.UserGroupInformation.HadoopConfiguration), 
 instead of setting up the ZooKeeper Client loginContext via jaas.conf and 
 system property 
 {{-Djava.security.auth.login.config}}
 Using the Hadoop configuration would work, except that ZooKeeper client code 
 expects the loginContextName to be Client while Hadoop security will use  
 hadoop-keytab-kerberos. I verified that by changing the name in the 
 debugger the SASL authentication succeeds while otherwise the login 
 configuration cannot be resolved and the connection to ZooKeeper is 
 unauthenticated. 
 To integrate with Hadoop, the following in ZooKeeperSaslClient would need to 
 change to make the name configurable:
  {{login = new Login(Client,new ClientCallbackHandler(null));}}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1373) Hardcoded SASL login context name clashes with Hadoop security configuration override

2012-02-06 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13201139#comment-13201139
 ] 

Mahadev konar commented on ZOOKEEPER-1373:
--

@Eugene,
 The patch looks good, but we should work on cleaning up the security stuff a 
little. One thing would be to make ClientCnxn a little modular and not pass it 
arnd everywhere (like we do in ZKSaslClient). Neways thats for later. Ill go 
ahead and commit this for now. 

 Hardcoded SASL login context name clashes with Hadoop security configuration 
 override
 -

 Key: ZOOKEEPER-1373
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1373
 Project: ZooKeeper
  Issue Type: Bug
  Components: java client
Affects Versions: 3.4.2
Reporter: Thomas Weise
Assignee: Eugene Koontz
 Fix For: 3.4.3, 3.5.0

 Attachments: ZOOKEEPER-1373-TW_3_4.patch, ZOOKEEPER-1373.patch, 
 ZOOKEEPER-1373.patch, ZOOKEEPER-1373.patch, ZOOKEEPER-1373.patch, 
 ZOOKEEPER-1373.patch, ZOOKEEPER-1373.patch


 I'm trying to configure a process with Hadoop security (Hive metastore 
 server) to talk to ZooKeeper 3.4.2 with Kerberos authentication. In this 
 scenario Hadoop controls the SASL configuration 
 (org.apache.hadoop.security.UserGroupInformation.HadoopConfiguration), 
 instead of setting up the ZooKeeper Client loginContext via jaas.conf and 
 system property 
 {{-Djava.security.auth.login.config}}
 Using the Hadoop configuration would work, except that ZooKeeper client code 
 expects the loginContextName to be Client while Hadoop security will use  
 hadoop-keytab-kerberos. I verified that by changing the name in the 
 debugger the SASL authentication succeeds while otherwise the login 
 configuration cannot be resolved and the connection to ZooKeeper is 
 unauthenticated. 
 To integrate with Hadoop, the following in ZooKeeperSaslClient would need to 
 change to make the name configurable:
  {{login = new Login(Client,new ClientCallbackHandler(null));}}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1373) Hardcoded SASL login context name clashes with Hadoop security configuration override

2012-02-06 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13201461#comment-13201461
 ] 

Mahadev konar commented on ZOOKEEPER-1373:
--

@Thomas,
 Yes. The rc is up. Can you try it out: 
http://people.apache.org/~mahadev/zookeeper-3.4.3-candidate-0/

 Hardcoded SASL login context name clashes with Hadoop security configuration 
 override
 -

 Key: ZOOKEEPER-1373
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1373
 Project: ZooKeeper
  Issue Type: Bug
  Components: java client
Affects Versions: 3.4.2
Reporter: Thomas Weise
Assignee: Eugene Koontz
 Fix For: 3.4.3, 3.5.0

 Attachments: ZOOKEEPER-1373-TW_3_4.patch, ZOOKEEPER-1373.patch, 
 ZOOKEEPER-1373.patch, ZOOKEEPER-1373.patch, ZOOKEEPER-1373.patch, 
 ZOOKEEPER-1373.patch, ZOOKEEPER-1373.patch


 I'm trying to configure a process with Hadoop security (Hive metastore 
 server) to talk to ZooKeeper 3.4.2 with Kerberos authentication. In this 
 scenario Hadoop controls the SASL configuration 
 (org.apache.hadoop.security.UserGroupInformation.HadoopConfiguration), 
 instead of setting up the ZooKeeper Client loginContext via jaas.conf and 
 system property 
 {{-Djava.security.auth.login.config}}
 Using the Hadoop configuration would work, except that ZooKeeper client code 
 expects the loginContextName to be Client while Hadoop security will use  
 hadoop-keytab-kerberos. I verified that by changing the name in the 
 debugger the SASL authentication succeeds while otherwise the login 
 configuration cannot be resolved and the connection to ZooKeeper is 
 unauthenticated. 
 To integrate with Hadoop, the following in ZooKeeperSaslClient would need to 
 change to make the name configurable:
  {{login = new Login(Client,new ClientCallbackHandler(null));}}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1322) Cleanup/fix logging in Quorum code.

2012-02-05 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13201072#comment-13201072
 ] 

Mahadev konar commented on ZOOKEEPER-1322:
--

Pat,
 Went through the patch. Looks harmless to me. Kicking off hudson again to run 
through the patch again.

 Cleanup/fix logging in Quorum code.
 ---

 Key: ZOOKEEPER-1322
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1322
 Project: ZooKeeper
  Issue Type: Improvement
  Components: server
Affects Versions: 3.4.0, 3.5.0
Reporter: Patrick Hunt
Assignee: Patrick Hunt
 Fix For: 3.4.3, 3.5.0

 Attachments: ZOOKEEPER-1322_br34.patch, ZOOKEEPER-1322_trunk.patch


 While triaging ZOOKEEPER-1319 I updated the code with the attached patch in 
 order to help debug what was going on with that issue. I think it would be 
 useful to include these changes in the project itself. ff to include in 3.4.1 
 or push to 3.5.0.
 You should verify this with TRACE logging turned on in addition to INFO 
 (default).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1353) C client test suite fails consistently

2012-02-05 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13201110#comment-13201110
 ] 

Mahadev konar commented on ZOOKEEPER-1353:
--

Thanks for pointing this out (and also for the patch) Clint.

 C client test suite fails consistently
 --

 Key: ZOOKEEPER-1353
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1353
 Project: ZooKeeper
  Issue Type: Bug
  Components: c client, tests
Affects Versions: 3.3.4
 Environment: Ubuntu precise (dev release), amd64
Reporter: Clint Byrum
Assignee: Clint Byrum
Priority: Minor
  Labels: patch, test
 Fix For: 3.3.5, 3.4.3, 3.5.0

 Attachments: fix-broken-c-client-unittest.patch, 
 fix-broken-c-client-unittest.patch

   Original Estimate: 5m
  Remaining Estimate: 5m

 When the c client test suite, zktest-mt, is run, it fails with this:
 tests/TestZookeeperInit.cc:233: Assertion: equality assertion failed 
 [Expected: 2, Actual  : 22]
 This was also reported in 3.3.1 here:
 http://www.mail-archive.com/zookeeper-dev@hadoop.apache.org/msg08914.html
 The C client tests are making some assumptions that are not valid. 
 getaddrinfo may have, at one time, returned ENOENT instead of EINVAL for the 
 host given in the test. The assertion should simply be that EINVAL | ENOENT 
 are given, so that builds on platforms which return ENOENT for this are not 
 broken.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1367) Data inconsistencies and unexpired ephemeral nodes after cluster restart

2012-01-31 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13197056#comment-13197056
 ] 

Mahadev konar commented on ZOOKEEPER-1367:
--

Great. Go ahead and upload. Ill commit it to the 3.3 branch.

 Data inconsistencies and unexpired ephemeral nodes after cluster restart
 

 Key: ZOOKEEPER-1367
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1367
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.4.2
 Environment: Debian Squeeze, 64-bit
Reporter: Jeremy Stribling
Assignee: Benjamin Reed
Priority: Blocker
 Fix For: 3.3.5, 3.4.3, 3.5.0

 Attachments: 1367-3.3.patch, ZOOKEEPER-1367-3.4.patch, 
 ZOOKEEPER-1367.patch, ZOOKEEPER-1367.patch, ZOOKEEPER-1367.tgz


 In one of our tests, we have a cluster of three ZooKeeper servers.  We kill 
 all three, and then restart just two of them.  Sometimes we notice that on 
 one of the restarted servers, ephemeral nodes from previous sessions do not 
 get deleted, while on the other server they do.  We are effectively running 
 3.4.2, though technically we are running 3.4.1 with the patch manually 
 applied for ZOOKEEPER-1333 and a C client for 3.4.1 with the patches for 
 ZOOKEEPER-1163.
 I noticed that when I connected using zkCli.sh to the first node (90.0.0.221, 
 zkid 84), I saw only one znode in a particular path:
 {quote}
 [zk: 90.0.0.221:2888(CONNECTED) 0] ls /election/zkrsm
 [nominee11]
 [zk: 90.0.0.221:2888(CONNECTED) 1] get /election/zkrsm/nominee11
 90.0.0.222: 
 cZxid = 0x40027
 ctime = Thu Jan 19 08:18:24 UTC 2012
 mZxid = 0x40027
 mtime = Thu Jan 19 08:18:24 UTC 2012
 pZxid = 0x40027
 cversion = 0
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0xa234f4f3bc220001
 dataLength = 16
 numChildren = 0
 {quote}
 However, when I connect zkCli.sh to the second server (90.0.0.222, zkid 251), 
 I saw three znodes under that same path:
 {quote}
 [zk: 90.0.0.222:2888(CONNECTED) 2] ls /election/zkrsm
 nominee06   nominee10   nominee11
 [zk: 90.0.0.222:2888(CONNECTED) 2] get /election/zkrsm/nominee11
 90.0.0.222: 
 cZxid = 0x40027
 ctime = Thu Jan 19 08:18:24 UTC 2012
 mZxid = 0x40027
 mtime = Thu Jan 19 08:18:24 UTC 2012
 pZxid = 0x40027
 cversion = 0
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0xa234f4f3bc220001
 dataLength = 16
 numChildren = 0
 [zk: 90.0.0.222:2888(CONNECTED) 3] get /election/zkrsm/nominee10
 90.0.0.221: 
 cZxid = 0x3014c
 ctime = Thu Jan 19 07:53:42 UTC 2012
 mZxid = 0x3014c
 mtime = Thu Jan 19 07:53:42 UTC 2012
 pZxid = 0x3014c
 cversion = 0
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0xa234f4f3bc22
 dataLength = 16
 numChildren = 0
 [zk: 90.0.0.222:2888(CONNECTED) 4] get /election/zkrsm/nominee06
 90.0.0.223: 
 cZxid = 0x20cab
 ctime = Thu Jan 19 08:00:30 UTC 2012
 mZxid = 0x20cab
 mtime = Thu Jan 19 08:00:30 UTC 2012
 pZxid = 0x20cab
 cversion = 0
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0x5434f5074e040002
 dataLength = 16
 numChildren = 0
 {quote}
 These never went away for the lifetime of the server, for any clients 
 connected directly to that server.  Note that this cluster is configured to 
 have all three servers still, the third one being down (90.0.0.223, zkid 162).
 I captured the data/snapshot directories for the the two live servers.  When 
 I start single-node servers using each directory, I can briefly see that the 
 inconsistent data is present in those logs, though the ephemeral nodes seem 
 to get (correctly) cleaned up pretty soon after I start the server.
 I will upload a tar containing the debug logs and data directories from the 
 failure.  I think we can reproduce it regularly if you need more info.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1367) Data inconsistencies and unexpired ephemeral nodes after cluster restart

2012-01-30 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13196739#comment-13196739
 ] 

Mahadev konar commented on ZOOKEEPER-1367:
--

Thanks for confirming Jeremy. Ill check this in now. The patch looks good to me 
though I think we need to clean up our classes so that we have cleaner 
seperation on what ZKS should be exposing and what ZKDatabase should be 
exposing.


 Data inconsistencies and unexpired ephemeral nodes after cluster restart
 

 Key: ZOOKEEPER-1367
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1367
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.4.2
 Environment: Debian Squeeze, 64-bit
Reporter: Jeremy Stribling
Assignee: Benjamin Reed
Priority: Blocker
 Fix For: 3.4.3

 Attachments: 1367-3.3.patch, ZOOKEEPER-1367-3.4.patch, 
 ZOOKEEPER-1367.patch, ZOOKEEPER-1367.patch, ZOOKEEPER-1367.tgz


 In one of our tests, we have a cluster of three ZooKeeper servers.  We kill 
 all three, and then restart just two of them.  Sometimes we notice that on 
 one of the restarted servers, ephemeral nodes from previous sessions do not 
 get deleted, while on the other server they do.  We are effectively running 
 3.4.2, though technically we are running 3.4.1 with the patch manually 
 applied for ZOOKEEPER-1333 and a C client for 3.4.1 with the patches for 
 ZOOKEEPER-1163.
 I noticed that when I connected using zkCli.sh to the first node (90.0.0.221, 
 zkid 84), I saw only one znode in a particular path:
 {quote}
 [zk: 90.0.0.221:2888(CONNECTED) 0] ls /election/zkrsm
 [nominee11]
 [zk: 90.0.0.221:2888(CONNECTED) 1] get /election/zkrsm/nominee11
 90.0.0.222: 
 cZxid = 0x40027
 ctime = Thu Jan 19 08:18:24 UTC 2012
 mZxid = 0x40027
 mtime = Thu Jan 19 08:18:24 UTC 2012
 pZxid = 0x40027
 cversion = 0
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0xa234f4f3bc220001
 dataLength = 16
 numChildren = 0
 {quote}
 However, when I connect zkCli.sh to the second server (90.0.0.222, zkid 251), 
 I saw three znodes under that same path:
 {quote}
 [zk: 90.0.0.222:2888(CONNECTED) 2] ls /election/zkrsm
 nominee06   nominee10   nominee11
 [zk: 90.0.0.222:2888(CONNECTED) 2] get /election/zkrsm/nominee11
 90.0.0.222: 
 cZxid = 0x40027
 ctime = Thu Jan 19 08:18:24 UTC 2012
 mZxid = 0x40027
 mtime = Thu Jan 19 08:18:24 UTC 2012
 pZxid = 0x40027
 cversion = 0
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0xa234f4f3bc220001
 dataLength = 16
 numChildren = 0
 [zk: 90.0.0.222:2888(CONNECTED) 3] get /election/zkrsm/nominee10
 90.0.0.221: 
 cZxid = 0x3014c
 ctime = Thu Jan 19 07:53:42 UTC 2012
 mZxid = 0x3014c
 mtime = Thu Jan 19 07:53:42 UTC 2012
 pZxid = 0x3014c
 cversion = 0
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0xa234f4f3bc22
 dataLength = 16
 numChildren = 0
 [zk: 90.0.0.222:2888(CONNECTED) 4] get /election/zkrsm/nominee06
 90.0.0.223: 
 cZxid = 0x20cab
 ctime = Thu Jan 19 08:00:30 UTC 2012
 mZxid = 0x20cab
 mtime = Thu Jan 19 08:00:30 UTC 2012
 pZxid = 0x20cab
 cversion = 0
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0x5434f5074e040002
 dataLength = 16
 numChildren = 0
 {quote}
 These never went away for the lifetime of the server, for any clients 
 connected directly to that server.  Note that this cluster is configured to 
 have all three servers still, the third one being down (90.0.0.223, zkid 162).
 I captured the data/snapshot directories for the the two live servers.  When 
 I start single-node servers using each directory, I can briefly see that the 
 inconsistent data is present in those logs, though the ephemeral nodes seem 
 to get (correctly) cleaned up pretty soon after I start the server.
 I will upload a tar containing the debug logs and data directories from the 
 failure.  I think we can reproduce it regularly if you need more info.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1373) Hardcoded SASL login context name clashes with Hadoop security configuration override

2012-01-30 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13196748#comment-13196748
 ] 

Mahadev konar commented on ZOOKEEPER-1373:
--

I just hate the way review board updates the comments. Looking at the patch now.

 Hardcoded SASL login context name clashes with Hadoop security configuration 
 override
 -

 Key: ZOOKEEPER-1373
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1373
 Project: ZooKeeper
  Issue Type: Bug
  Components: java client
Affects Versions: 3.4.2
Reporter: Thomas Weise
Assignee: Eugene Koontz
 Fix For: 3.4.3, 3.5.0

 Attachments: ZOOKEEPER-1373-TW_3_4.patch, ZOOKEEPER-1373.patch, 
 ZOOKEEPER-1373.patch, ZOOKEEPER-1373.patch, ZOOKEEPER-1373.patch


 I'm trying to configure a process with Hadoop security (Hive metastore 
 server) to talk to ZooKeeper 3.4.2 with Kerberos authentication. In this 
 scenario Hadoop controls the SASL configuration 
 (org.apache.hadoop.security.UserGroupInformation.HadoopConfiguration), 
 instead of setting up the ZooKeeper Client loginContext via jaas.conf and 
 system property 
 {{-Djava.security.auth.login.config}}
 Using the Hadoop configuration would work, except that ZooKeeper client code 
 expects the loginContextName to be Client while Hadoop security will use  
 hadoop-keytab-kerberos. I verified that by changing the name in the 
 debugger the SASL authentication succeeds while otherwise the login 
 configuration cannot be resolved and the connection to ZooKeeper is 
 unauthenticated. 
 To integrate with Hadoop, the following in ZooKeeperSaslClient would need to 
 change to make the name configurable:
  {{login = new Login(Client,new ClientCallbackHandler(null));}}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1373) Hardcoded SASL login context name clashes with Hadoop security configuration override

2012-01-30 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13196756#comment-13196756
 ] 

Mahadev konar commented on ZOOKEEPER-1373:
--

Took a look at the patch. It looks good overall, like the new test cases. Some 
minor nits, I think the ClientCnxn code needs to move out a little (ClientCnxn 
is getting too huge). Can we do a helper class for Security? Something like 
ZooKeeperSecureUtil where all this code can reside (creating a zk sasl 
client?). Also its a little painful to see all the config property names spread 
around. This is probably another jira where we move all the  properties in a 
single place so that we dont have to go hunting arnd for our config properties.

 Hardcoded SASL login context name clashes with Hadoop security configuration 
 override
 -

 Key: ZOOKEEPER-1373
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1373
 Project: ZooKeeper
  Issue Type: Bug
  Components: java client
Affects Versions: 3.4.2
Reporter: Thomas Weise
Assignee: Eugene Koontz
 Fix For: 3.4.3, 3.5.0

 Attachments: ZOOKEEPER-1373-TW_3_4.patch, ZOOKEEPER-1373.patch, 
 ZOOKEEPER-1373.patch, ZOOKEEPER-1373.patch, ZOOKEEPER-1373.patch


 I'm trying to configure a process with Hadoop security (Hive metastore 
 server) to talk to ZooKeeper 3.4.2 with Kerberos authentication. In this 
 scenario Hadoop controls the SASL configuration 
 (org.apache.hadoop.security.UserGroupInformation.HadoopConfiguration), 
 instead of setting up the ZooKeeper Client loginContext via jaas.conf and 
 system property 
 {{-Djava.security.auth.login.config}}
 Using the Hadoop configuration would work, except that ZooKeeper client code 
 expects the loginContextName to be Client while Hadoop security will use  
 hadoop-keytab-kerberos. I verified that by changing the name in the 
 debugger the SASL authentication succeeds while otherwise the login 
 configuration cannot be resolved and the connection to ZooKeeper is 
 unauthenticated. 
 To integrate with Hadoop, the following in ZooKeeperSaslClient would need to 
 change to make the name configurable:
  {{login = new Login(Client,new ClientCallbackHandler(null));}}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1366) Zookeeper should be tolerant of clock adjustments

2012-01-27 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13195026#comment-13195026
 ] 

Mahadev konar commented on ZOOKEEPER-1366:
--

Pat/Ben,
 I think the issue here is the api with Clock. The static api is one that ruins 
mocking in it. In Hadoop we make sure we pass around the same clock object when 
creating all the sub sequent objects (the constructs in MR next gen are more DI 
compliant). We could try doing that here but again I think its a bit of an 
effort (should be manual work). But as Henry/Camille mentioned we could do that 
in another jira. I think thats the right solution instead of creating another 
layer which hides the longs (as Pat suggested).

 Zookeeper should be tolerant of clock adjustments
 -

 Key: ZOOKEEPER-1366
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1366
 Project: ZooKeeper
  Issue Type: Bug
Reporter: Ted Dunning
Assignee: Ted Dunning
 Fix For: 3.5.0

 Attachments: ZOOKEEPER-1366-3.3.3.patch, ZOOKEEPER-1366.patch, 
 ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch


 If you want to wreak havoc on a ZK based system just do [date -s +1hour] 
 and watch the mayhem as all sessions expire at once.
 This shouldn't happen.  Zookeeper could easily know handle elapsed times as 
 elapsed times rather than as differences between absolute times.  The 
 absolute times are subject to adjustment when the clock is set while a timer 
 is not subject to this problem.  In Java, System.currentTimeMillis() gives 
 you absolute time while System.nanoTime() gives you time based on a timer 
 from an arbitrary epoch.
 I have done this and have been running tests now for some tens of minutes 
 with no failures.  I will set up a test machine to redo the build again on 
 Ubuntu and post a patch here for discussion.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1367) Data inconsistencies and unexpired ephemeral nodes after cluster restart

2012-01-27 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13195373#comment-13195373
 ] 

Mahadev konar commented on ZOOKEEPER-1367:
--

From https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/928//testReport/

{code}
org.apache.zookeeper.server.quorum.LearnerTest.syncTest
Failing for the past 1 build (Since #928 )
Took 74 ms.
Stacktrace
java.lang.NullPointerException
at 
org.apache.zookeeper.server.quorum.LearnerZooKeeperServer.createSessionTracker(LearnerZooKeeperServer.java:73)
at 
org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:355)
at 
org.apache.zookeeper.server.quorum.LearnerTest.syncTest(LearnerTest.java:114)
at 
org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
{code}


 Data inconsistencies and unexpired ephemeral nodes after cluster restart
 

 Key: ZOOKEEPER-1367
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1367
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.4.2
 Environment: Debian Squeeze, 64-bit
Reporter: Jeremy Stribling
Priority: Blocker
 Fix For: 3.4.3

 Attachments: ZOOKEEPER-1367.patch, ZOOKEEPER-1367.tgz


 In one of our tests, we have a cluster of three ZooKeeper servers.  We kill 
 all three, and then restart just two of them.  Sometimes we notice that on 
 one of the restarted servers, ephemeral nodes from previous sessions do not 
 get deleted, while on the other server they do.  We are effectively running 
 3.4.2, though technically we are running 3.4.1 with the patch manually 
 applied for ZOOKEEPER-1333 and a C client for 3.4.1 with the patches for 
 ZOOKEEPER-1163.
 I noticed that when I connected using zkCli.sh to the first node (90.0.0.221, 
 zkid 84), I saw only one znode in a particular path:
 {quote}
 [zk: 90.0.0.221:2888(CONNECTED) 0] ls /election/zkrsm
 [nominee11]
 [zk: 90.0.0.221:2888(CONNECTED) 1] get /election/zkrsm/nominee11
 90.0.0.222: 
 cZxid = 0x40027
 ctime = Thu Jan 19 08:18:24 UTC 2012
 mZxid = 0x40027
 mtime = Thu Jan 19 08:18:24 UTC 2012
 pZxid = 0x40027
 cversion = 0
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0xa234f4f3bc220001
 dataLength = 16
 numChildren = 0
 {quote}
 However, when I connect zkCli.sh to the second server (90.0.0.222, zkid 251), 
 I saw three znodes under that same path:
 {quote}
 [zk: 90.0.0.222:2888(CONNECTED) 2] ls /election/zkrsm
 nominee06   nominee10   nominee11
 [zk: 90.0.0.222:2888(CONNECTED) 2] get /election/zkrsm/nominee11
 90.0.0.222: 
 cZxid = 0x40027
 ctime = Thu Jan 19 08:18:24 UTC 2012
 mZxid = 0x40027
 mtime = Thu Jan 19 08:18:24 UTC 2012
 pZxid = 0x40027
 cversion = 0
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0xa234f4f3bc220001
 dataLength = 16
 numChildren = 0
 [zk: 90.0.0.222:2888(CONNECTED) 3] get /election/zkrsm/nominee10
 90.0.0.221: 
 cZxid = 0x3014c
 ctime = Thu Jan 19 07:53:42 UTC 2012
 mZxid = 0x3014c
 mtime = Thu Jan 19 07:53:42 UTC 2012
 pZxid = 0x3014c
 cversion = 0
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0xa234f4f3bc22
 dataLength = 16
 numChildren = 0
 [zk: 90.0.0.222:2888(CONNECTED) 4] get /election/zkrsm/nominee06
 90.0.0.223: 
 cZxid = 0x20cab
 ctime = Thu Jan 19 08:00:30 UTC 2012
 mZxid = 0x20cab
 mtime = Thu Jan 19 08:00:30 UTC 2012
 pZxid = 0x20cab
 cversion = 0
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0x5434f5074e040002
 dataLength = 16
 numChildren = 0
 {quote}
 These never went away for the lifetime of the server, for any clients 
 connected directly to that server.  Note that this cluster is configured to 
 have all three servers still, the third one being down (90.0.0.223, zkid 162).
 I captured the data/snapshot directories for the the two live servers.  When 
 I start single-node servers using each directory, I can briefly see that the 
 inconsistent data is present in those logs, though the ephemeral nodes seem 
 to get (correctly) cleaned up pretty soon after I start the server.
 I will upload a tar containing the debug logs and data directories from the 
 failure.  I think we can reproduce it regularly if you need more info.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1367) Data inconsistencies and unexpired ephemeral nodes after cluster restart

2012-01-27 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13195375#comment-13195375
 ] 

Mahadev konar commented on ZOOKEEPER-1367:
--

@Ben/Jeremy,
 Ill kick off a 3.4.3 release with this patch and ZOOKEEPER-1373. 

 Data inconsistencies and unexpired ephemeral nodes after cluster restart
 

 Key: ZOOKEEPER-1367
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1367
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.4.2
 Environment: Debian Squeeze, 64-bit
Reporter: Jeremy Stribling
Priority: Blocker
 Fix For: 3.4.3

 Attachments: ZOOKEEPER-1367.patch, ZOOKEEPER-1367.tgz


 In one of our tests, we have a cluster of three ZooKeeper servers.  We kill 
 all three, and then restart just two of them.  Sometimes we notice that on 
 one of the restarted servers, ephemeral nodes from previous sessions do not 
 get deleted, while on the other server they do.  We are effectively running 
 3.4.2, though technically we are running 3.4.1 with the patch manually 
 applied for ZOOKEEPER-1333 and a C client for 3.4.1 with the patches for 
 ZOOKEEPER-1163.
 I noticed that when I connected using zkCli.sh to the first node (90.0.0.221, 
 zkid 84), I saw only one znode in a particular path:
 {quote}
 [zk: 90.0.0.221:2888(CONNECTED) 0] ls /election/zkrsm
 [nominee11]
 [zk: 90.0.0.221:2888(CONNECTED) 1] get /election/zkrsm/nominee11
 90.0.0.222: 
 cZxid = 0x40027
 ctime = Thu Jan 19 08:18:24 UTC 2012
 mZxid = 0x40027
 mtime = Thu Jan 19 08:18:24 UTC 2012
 pZxid = 0x40027
 cversion = 0
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0xa234f4f3bc220001
 dataLength = 16
 numChildren = 0
 {quote}
 However, when I connect zkCli.sh to the second server (90.0.0.222, zkid 251), 
 I saw three znodes under that same path:
 {quote}
 [zk: 90.0.0.222:2888(CONNECTED) 2] ls /election/zkrsm
 nominee06   nominee10   nominee11
 [zk: 90.0.0.222:2888(CONNECTED) 2] get /election/zkrsm/nominee11
 90.0.0.222: 
 cZxid = 0x40027
 ctime = Thu Jan 19 08:18:24 UTC 2012
 mZxid = 0x40027
 mtime = Thu Jan 19 08:18:24 UTC 2012
 pZxid = 0x40027
 cversion = 0
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0xa234f4f3bc220001
 dataLength = 16
 numChildren = 0
 [zk: 90.0.0.222:2888(CONNECTED) 3] get /election/zkrsm/nominee10
 90.0.0.221: 
 cZxid = 0x3014c
 ctime = Thu Jan 19 07:53:42 UTC 2012
 mZxid = 0x3014c
 mtime = Thu Jan 19 07:53:42 UTC 2012
 pZxid = 0x3014c
 cversion = 0
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0xa234f4f3bc22
 dataLength = 16
 numChildren = 0
 [zk: 90.0.0.222:2888(CONNECTED) 4] get /election/zkrsm/nominee06
 90.0.0.223: 
 cZxid = 0x20cab
 ctime = Thu Jan 19 08:00:30 UTC 2012
 mZxid = 0x20cab
 mtime = Thu Jan 19 08:00:30 UTC 2012
 pZxid = 0x20cab
 cversion = 0
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0x5434f5074e040002
 dataLength = 16
 numChildren = 0
 {quote}
 These never went away for the lifetime of the server, for any clients 
 connected directly to that server.  Note that this cluster is configured to 
 have all three servers still, the third one being down (90.0.0.223, zkid 162).
 I captured the data/snapshot directories for the the two live servers.  When 
 I start single-node servers using each directory, I can briefly see that the 
 inconsistent data is present in those logs, though the ephemeral nodes seem 
 to get (correctly) cleaned up pretty soon after I start the server.
 I will upload a tar containing the debug logs and data directories from the 
 failure.  I think we can reproduce it regularly if you need more info.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1355) Add zk.updateServerList(newServerList)

2012-01-25 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13193576#comment-13193576
 ] 

Mahadev konar commented on ZOOKEEPER-1355:
--

Ben,
 I was taking a look at it. Mind waiting till tomm? 


 Add zk.updateServerList(newServerList) 
 ---

 Key: ZOOKEEPER-1355
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1355
 Project: ZooKeeper
  Issue Type: New Feature
  Components: java client
Reporter: Alexander Shraer
Assignee: Alexander Shraer
 Fix For: 3.5.0

 Attachments: ZOOKEEPER-1355-ver2.patch, ZOOKEEPER-1355-ver4.patch, 
 ZOOKEEPER-1355-ver5.patch, ZOOKEEPER=1355-ver3.patch, 
 ZOOOKEEPER-1355-test.patch, ZOOOKEEPER-1355-ver1.patch, 
 ZOOOKEEPER-1355.patch, loadbalancing-more-details.pdf, loadbalancing.pdf


 When the set of servers changes, we would like to update the server list 
 stored by clients without restarting the clients.
 Moreover, assuming that the number of clients per server is the same (in 
 expectation) in the old configuration (as guaranteed by the current list 
 shuffling for example), we would like to re-balance client connections across 
 the new set of servers in a way that a) the number of clients per server is 
 the same for all servers (in expectation) and b) there is no 
 excessive/unnecessary client migration.
 It is simple to achieve (a) without (b) - just re-shuffle the new list of 
 servers at every client. But this would create unnecessary migration, which 
 we'd like to avoid.
 We propose a simple probabilistic migration scheme that achieves (a) and (b) 
 - each client locally decides whether and where to migrate when the list of 
 servers changes. The attached document describes the scheme and shows an 
 evaluation of it in Zookeeper. We also implemented re-balancing through a 
 consistent-hashing scheme and show a comparison. We derived the probabilistic 
 migration rules from a simple formula that we can also provide, if someone's 
 interested in the proof.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1355) Add zk.updateServerList(newServerList)

2012-01-25 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13193583#comment-13193583
 ] 

Mahadev konar commented on ZOOKEEPER-1355:
--

Ben/Alex,
 This adds 2 public api to zookeeper handle (java). Is this intended? What the 
intent of getCurrentHost? 

Also, I looked at the pdf (which scares me a little - hate looking at all the 
math symbols :)). Can you please explain in layman terms what the process for 
the client to select the server to connect to? What if the server list is 
incorrect, what happens then? 

 Add zk.updateServerList(newServerList) 
 ---

 Key: ZOOKEEPER-1355
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1355
 Project: ZooKeeper
  Issue Type: New Feature
  Components: java client
Reporter: Alexander Shraer
Assignee: Alexander Shraer
 Fix For: 3.5.0

 Attachments: ZOOKEEPER-1355-ver2.patch, ZOOKEEPER-1355-ver4.patch, 
 ZOOKEEPER-1355-ver5.patch, ZOOKEEPER=1355-ver3.patch, 
 ZOOOKEEPER-1355-test.patch, ZOOOKEEPER-1355-ver1.patch, 
 ZOOOKEEPER-1355.patch, loadbalancing-more-details.pdf, loadbalancing.pdf


 When the set of servers changes, we would like to update the server list 
 stored by clients without restarting the clients.
 Moreover, assuming that the number of clients per server is the same (in 
 expectation) in the old configuration (as guaranteed by the current list 
 shuffling for example), we would like to re-balance client connections across 
 the new set of servers in a way that a) the number of clients per server is 
 the same for all servers (in expectation) and b) there is no 
 excessive/unnecessary client migration.
 It is simple to achieve (a) without (b) - just re-shuffle the new list of 
 servers at every client. But this would create unnecessary migration, which 
 we'd like to avoid.
 We propose a simple probabilistic migration scheme that achieves (a) and (b) 
 - each client locally decides whether and where to migrate when the list of 
 servers changes. The attached document describes the scheme and shows an 
 evaluation of it in Zookeeper. We also implemented re-balancing through a 
 consistent-hashing scheme and show a comparison. We derived the probabilistic 
 migration rules from a simple formula that we can also provide, if someone's 
 interested in the proof.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1355) Add zk.updateServerList(newServerList)

2012-01-25 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13193586#comment-13193586
 ] 

Mahadev konar commented on ZOOKEEPER-1355:
--

One more thing, what about the c client? Will we be seeing similar changes to c 
client? I'd very much like to keep both of them in sync if possible. We are 
already a little different given the security patches.

 Add zk.updateServerList(newServerList) 
 ---

 Key: ZOOKEEPER-1355
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1355
 Project: ZooKeeper
  Issue Type: New Feature
  Components: java client
Reporter: Alexander Shraer
Assignee: Alexander Shraer
 Fix For: 3.5.0

 Attachments: ZOOKEEPER-1355-ver2.patch, ZOOKEEPER-1355-ver4.patch, 
 ZOOKEEPER-1355-ver5.patch, ZOOKEEPER=1355-ver3.patch, 
 ZOOOKEEPER-1355-test.patch, ZOOOKEEPER-1355-ver1.patch, 
 ZOOOKEEPER-1355.patch, loadbalancing-more-details.pdf, loadbalancing.pdf


 When the set of servers changes, we would like to update the server list 
 stored by clients without restarting the clients.
 Moreover, assuming that the number of clients per server is the same (in 
 expectation) in the old configuration (as guaranteed by the current list 
 shuffling for example), we would like to re-balance client connections across 
 the new set of servers in a way that a) the number of clients per server is 
 the same for all servers (in expectation) and b) there is no 
 excessive/unnecessary client migration.
 It is simple to achieve (a) without (b) - just re-shuffle the new list of 
 servers at every client. But this would create unnecessary migration, which 
 we'd like to avoid.
 We propose a simple probabilistic migration scheme that achieves (a) and (b) 
 - each client locally decides whether and where to migrate when the list of 
 servers changes. The attached document describes the scheme and shows an 
 evaluation of it in Zookeeper. We also implemented re-balancing through a 
 consistent-hashing scheme and show a comparison. We derived the probabilistic 
 migration rules from a simple formula that we can also provide, if someone's 
 interested in the proof.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1366) Zookeeper should be tolerant of clock adjustments

2012-01-23 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13191384#comment-13191384
 ] 

Mahadev konar commented on ZOOKEEPER-1366:
--

@Ted,
 Seems like a good change, only one issue I see here. I'd like this to go into 
trunk and not into 3.4 unless its really a bug. I think 3.4 will take sometime 
to stabilize and would really like to avoid big changes in 3.4. Thoughts?

 Zookeeper should be tolerant of clock adjustments
 -

 Key: ZOOKEEPER-1366
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1366
 Project: ZooKeeper
  Issue Type: Bug
Reporter: Ted Dunning
Assignee: Ted Dunning
 Fix For: 3.4.3

 Attachments: ZOOKEEPER-1366-3.3.3.patch, ZOOKEEPER-1366.patch, 
 ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch


 If you want to wreak havoc on a ZK based system just do [date -s +1hour] 
 and watch the mayhem as all sessions expire at once.
 This shouldn't happen.  Zookeeper could easily know handle elapsed times as 
 elapsed times rather than as differences between absolute times.  The 
 absolute times are subject to adjustment when the clock is set while a timer 
 is not subject to this problem.  In Java, System.currentTimeMillis() gives 
 you absolute time while System.nanoTime() gives you time based on a timer 
 from an arbitrary epoch.
 I have done this and have been running tests now for some tens of minutes 
 with no failures.  I will set up a test machine to redo the build again on 
 Ubuntu and post a patch here for discussion.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1373) Hardcoded SASL login context name clashes with Hadoop security configuration override

2012-01-23 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13191858#comment-13191858
 ] 

Mahadev konar commented on ZOOKEEPER-1373:
--

This is a bug. We should fix to have a login context name configurable.

 Hardcoded SASL login context name clashes with Hadoop security configuration 
 override
 -

 Key: ZOOKEEPER-1373
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1373
 Project: ZooKeeper
  Issue Type: Bug
  Components: java client
Affects Versions: 3.4.2
Reporter: Thomas Weise
 Fix For: 3.4.3


 I'm trying to configure a process with Hadoop security (Hive metastore 
 server) to talk to ZooKeeper 3.4.2 with Kerberos authentication. In this 
 scenario Hadoop controls the SASL configuration 
 (org.apache.hadoop.security.UserGroupInformation.HadoopConfiguration), 
 instead of setting up the ZooKeeper Client loginContext via jaas.conf and 
 system property 
 {{-Djava.security.auth.login.config}}
 Using the Hadoop configuration would work, except that ZooKeeper client code 
 expects the loginContextName to be Client while Hadoop security will use  
 hadoop-keytab-kerberos. I verified that by changing the name in the 
 debugger the SASL authentication succeeds while otherwise the login 
 configuration cannot be resolved and the connection to ZooKeeper is 
 unauthenticated. 
 To integrate with Hadoop, the following in ZooKeeperSaslClient would need to 
 change to make the name configurable:
  {{login = new Login(Client,new ClientCallbackHandler(null));}}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1302) patch to create rpm/deb on 3.3 branch

2012-01-17 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13188021#comment-13188021
 ] 

Mahadev konar commented on ZOOKEEPER-1302:
--

Thanks Giri. It might be useful for folks on 3.3 branch but as Pat mentioned, 
given the patch is big, we'll have to skip it for 3.3.

 patch to create rpm/deb on 3.3 branch
 -

 Key: ZOOKEEPER-1302
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1302
 Project: ZooKeeper
  Issue Type: Improvement
  Components: build
Affects Versions: 3.3.3
Reporter: Giridharan Kesavan
Assignee: Giridharan Kesavan
 Attachments: ZOOKEEPER-999-3.3-with-setupscript-3.patch, 
 zk-1302-1.patch, zk-1302.patch


 backport zookeeper-999 patch to 3.3 branch and add zookeeper-setup-conf.sh to 
 enable zk quorum setup

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1333) NPE in FileTxnSnapLog when restarting a cluster

2011-12-21 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174256#comment-13174256
 ] 

Mahadev konar commented on ZOOKEEPER-1333:
--

@Camille,
 Agreed. I think the patch as it stands is good to go. The only concern I have 
is the code is pretty convoluted in processtransaction. We should work on 
making it cleaner. Ill add some comments for now when committing.

 NPE in FileTxnSnapLog when restarting a cluster
 ---

 Key: ZOOKEEPER-1333
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1333
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.4.0
Reporter: Andrew McNair
Assignee: Patrick Hunt
Priority: Blocker
 Fix For: 3.4.2

 Attachments: ZOOKEEPER-1333.patch, ZOOKEEPER-1333.patch, 
 test_case.diff, test_case.diff


 I think a NPE was created in the fix for 
 https://issues.apache.org/jira/browse/ZOOKEEPER-1269
 Looking in DataTree.processTxn(TxnHeader header, Record txn) it seems likely 
 that if rc.err != Code.OK then rc.path will be null. 
 I'm currently working on a minimal test case for the bug, I'll attach it to 
 this issue when it's ready.
 java.lang.NullPointerException
   at 
 org.apache.zookeeper.server.persistence.FileTxnSnapLog.processTransaction(FileTxnSnapLog.java:203)
   at 
 org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:150)
   at 
 org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
   at 
 org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:418)
   at 
 org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:410)
   at 
 org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:151)
   at 
 org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:111)
   at 
 org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1319) Missing data after restarting+expanding a cluster

2011-12-09 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13166420#comment-13166420
 ] 

Mahadev konar commented on ZOOKEEPER-1319:
--

I am going ahead and checking in Pat's patch. I have opened ZOOKEEPER-1324 to 
track the duplicate NEWLEADER packets. Just being paranoid here and making 
minimal changes for the RC.

 Missing data after restarting+expanding a cluster
 -

 Key: ZOOKEEPER-1319
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1319
 Project: ZooKeeper
  Issue Type: Bug
Affects Versions: 3.4.0
 Environment: Linux (Debian Squeeze)
Reporter: Jeremy Stribling
Assignee: Patrick Hunt
Priority: Blocker
  Labels: cluster, data
 Fix For: 3.5.0, 3.4.1

 Attachments: ZOOKEEPER-1319.patch, ZOOKEEPER-1319.patch, 
 ZOOKEEPER-1319_trunk.patch, ZOOKEEPER-1319_trunk2.patch, logs.tgz


 I've been trying to update to ZK 3.4.0 and have had some issues where some 
 data become inaccessible after adding a node to a cluster.  My use case is a 
 bit strange (as explained before on this list) in that I try to grow the 
 cluster dynamically by having an external program automatically restart 
 Zookeeper servers in a controlled way whenever the list of participating ZK 
 servers needs to change.  This used to work just fine in 3.3.3 (and before), 
 so this represents a regression.
 The scenario I see is this:
 1) Start up a 1-server ZK cluster (the server has ZK ID 0).
 2) A client connects to the server, and makes a bunch of znodes, in 
 particular a znode called /membership.
 3) Shut down the cluster.
 4) Bring up a 2-server ZK cluster, including the original server 0 with its 
 existing data, and a new server with ZK ID 1.
 5) Node 0 has the highest zxid and is elected leader.
 6) A client connecting to server 1 tries to get /membership and gets back a 
 -101 error code (no such znode).
 7) The same client then tries to create /membership and gets back a -110 
 error code (znode already exists).
 8) Clients connecting to server 0 can successfully get /membership.
 I will attach a tarball with debug logs for both servers, annotating where 
 steps #1 and #4 happen.  You can see that the election involves a proposal 
 for zxid 110 from server 0, but immediately following the election server 1 
 has these lines:
 2011-12-05 17:18:48,308 9299 [QuorumPeer[myid=1]/127.0.0.1:2901] WARN 
 org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x10001 expected 
 0x1
 2011-12-05 17:18:48,313 9304 [SyncThread:1] INFO 
 org.apache.zookeeper.server.persistence.FileTxnLog  - Creating new log file: 
 log.10001
 Perhaps that's not relevant, but it struck me as odd.  At the end of server 
 1's log you can see a repeated cycle of getData-create-getData as the 
 client tries to make sense of the inconsistent responses.
 The other piece of information is that if I try to use the on-disk 
 directories for either of the servers to start a new one-node ZK cluster, all 
 the data are accessible.
 I haven't tried writing a program outside of my application to reproduce 
 this, but I can do it very easily with some of my app's tests if anyone needs 
 more information.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1319) Missing data after restarting+expanding a cluster

2011-12-08 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13165465#comment-13165465
 ] 

Mahadev konar commented on ZOOKEEPER-1319:
--

I am more inclined towards what flavio mentioned above. To reduce the number of 
changes I think its best we dont remove the duplicate NEWLEADER. Ben any 
thoughts? 

 Missing data after restarting+expanding a cluster
 -

 Key: ZOOKEEPER-1319
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1319
 Project: ZooKeeper
  Issue Type: Bug
Affects Versions: 3.4.0
 Environment: Linux (Debian Squeeze)
Reporter: Jeremy Stribling
Assignee: Patrick Hunt
Priority: Blocker
  Labels: cluster, data
 Fix For: 3.5.0, 3.4.1

 Attachments: ZOOKEEPER-1319.patch, ZOOKEEPER-1319.patch, 
 ZOOKEEPER-1319_trunk.patch, ZOOKEEPER-1319_trunk2.patch, logs.tgz


 I've been trying to update to ZK 3.4.0 and have had some issues where some 
 data become inaccessible after adding a node to a cluster.  My use case is a 
 bit strange (as explained before on this list) in that I try to grow the 
 cluster dynamically by having an external program automatically restart 
 Zookeeper servers in a controlled way whenever the list of participating ZK 
 servers needs to change.  This used to work just fine in 3.3.3 (and before), 
 so this represents a regression.
 The scenario I see is this:
 1) Start up a 1-server ZK cluster (the server has ZK ID 0).
 2) A client connects to the server, and makes a bunch of znodes, in 
 particular a znode called /membership.
 3) Shut down the cluster.
 4) Bring up a 2-server ZK cluster, including the original server 0 with its 
 existing data, and a new server with ZK ID 1.
 5) Node 0 has the highest zxid and is elected leader.
 6) A client connecting to server 1 tries to get /membership and gets back a 
 -101 error code (no such znode).
 7) The same client then tries to create /membership and gets back a -110 
 error code (znode already exists).
 8) Clients connecting to server 0 can successfully get /membership.
 I will attach a tarball with debug logs for both servers, annotating where 
 steps #1 and #4 happen.  You can see that the election involves a proposal 
 for zxid 110 from server 0, but immediately following the election server 1 
 has these lines:
 2011-12-05 17:18:48,308 9299 [QuorumPeer[myid=1]/127.0.0.1:2901] WARN 
 org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x10001 expected 
 0x1
 2011-12-05 17:18:48,313 9304 [SyncThread:1] INFO 
 org.apache.zookeeper.server.persistence.FileTxnLog  - Creating new log file: 
 log.10001
 Perhaps that's not relevant, but it struck me as odd.  At the end of server 
 1's log you can see a repeated cycle of getData-create-getData as the 
 client tries to make sense of the inconsistent responses.
 The other piece of information is that if I try to use the on-disk 
 directories for either of the servers to start a new one-node ZK cluster, all 
 the data are accessible.
 I haven't tried writing a program outside of my application to reproduce 
 this, but I can do it very easily with some of my app's tests if anyone needs 
 more information.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-442) need a way to remove watches that are no longer of interest

2011-12-07 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13164886#comment-13164886
 ] 

Mahadev konar commented on ZOOKEEPER-442:
-

@Ben,
 Can you please take a look at this patch? 

 need a way to remove watches that are no longer of interest
 ---

 Key: ZOOKEEPER-442
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-442
 Project: ZooKeeper
  Issue Type: Improvement
Reporter: Benjamin Reed
Assignee: Daniel Gómez Ferro
Priority: Critical
 Fix For: 3.5.0

 Attachments: ZOOKEEPER-442.patch, ZOOKEEPER-442.patch, 
 ZOOKEEPER-442.patch, ZOOKEEPER-442.patch, ZOOKEEPER-442.patch, 
 ZOOKEEPER-442.patch, ZOOKEEPER-442.patch


 currently the only way a watch cleared is to trigger it. we need a way to 
 enumerate the outstanding watch objects, find watch events the objects are 
 watching for, and remove interests in an event.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1319) Missing data after restarting+expanding a cluster

2011-12-07 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13164965#comment-13164965
 ] 

Mahadev konar commented on ZOOKEEPER-1319:
--

Flavio/Pat,
 Ben and I had a long discussion on this.
 Here is the gist - There are 2 NEWLEADER packets one added when the Leader 
just has become a leader and one added in startforwarding as Flavio mentioned 
above. We need to skip adding the first one (the one in leader.lead())  to the 
queue of packets to send to the follower. Flavio is right above that if we skip 
the adding of NEWLEADER in start forwarding we are good. We need to send the 
NEWLEADER packet in LearnerHandler (Line 390) because that means the end of all 
syncing up transactions from the Leader for the follower.

Ben has an updated patch and will update the jira soon tonight.

 Missing data after restarting+expanding a cluster
 -

 Key: ZOOKEEPER-1319
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1319
 Project: ZooKeeper
  Issue Type: Bug
Affects Versions: 3.4.0
 Environment: Linux (Debian Squeeze)
Reporter: Jeremy Stribling
Assignee: Patrick Hunt
Priority: Blocker
  Labels: cluster, data
 Fix For: 3.5.0, 3.4.1

 Attachments: ZOOKEEPER-1319.patch, ZOOKEEPER-1319.patch, 
 ZOOKEEPER-1319_trunk.patch, logs.tgz


 I've been trying to update to ZK 3.4.0 and have had some issues where some 
 data become inaccessible after adding a node to a cluster.  My use case is a 
 bit strange (as explained before on this list) in that I try to grow the 
 cluster dynamically by having an external program automatically restart 
 Zookeeper servers in a controlled way whenever the list of participating ZK 
 servers needs to change.  This used to work just fine in 3.3.3 (and before), 
 so this represents a regression.
 The scenario I see is this:
 1) Start up a 1-server ZK cluster (the server has ZK ID 0).
 2) A client connects to the server, and makes a bunch of znodes, in 
 particular a znode called /membership.
 3) Shut down the cluster.
 4) Bring up a 2-server ZK cluster, including the original server 0 with its 
 existing data, and a new server with ZK ID 1.
 5) Node 0 has the highest zxid and is elected leader.
 6) A client connecting to server 1 tries to get /membership and gets back a 
 -101 error code (no such znode).
 7) The same client then tries to create /membership and gets back a -110 
 error code (znode already exists).
 8) Clients connecting to server 0 can successfully get /membership.
 I will attach a tarball with debug logs for both servers, annotating where 
 steps #1 and #4 happen.  You can see that the election involves a proposal 
 for zxid 110 from server 0, but immediately following the election server 1 
 has these lines:
 2011-12-05 17:18:48,308 9299 [QuorumPeer[myid=1]/127.0.0.1:2901] WARN 
 org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x10001 expected 
 0x1
 2011-12-05 17:18:48,313 9304 [SyncThread:1] INFO 
 org.apache.zookeeper.server.persistence.FileTxnLog  - Creating new log file: 
 log.10001
 Perhaps that's not relevant, but it struck me as odd.  At the end of server 
 1's log you can see a repeated cycle of getData-create-getData as the 
 client tries to make sense of the inconsistent responses.
 The other piece of information is that if I try to use the on-disk 
 directories for either of the servers to start a new one-node ZK cluster, all 
 the data are accessible.
 I haven't tried writing a program outside of my application to reproduce 
 this, but I can do it very easily with some of my app's tests if anyone needs 
 more information.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-866) Adding no disk persistence option in zookeeper.

2011-12-05 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13162856#comment-13162856
 ] 

Mahadev konar commented on ZOOKEEPER-866:
-

@peter,
 I didnt. From what I found was that the throughput when writing to disk was as 
good as the throughput with no persistence, so I didnt bother getting this in.

 Adding no disk persistence option in zookeeper.
 ---

 Key: ZOOKEEPER-866
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-866
 Project: ZooKeeper
  Issue Type: New Feature
Reporter: Mahadev konar
Assignee: Mahadev konar
 Fix For: 3.5.0

 Attachments: ZOOKEEPER-nodisk.patch


 Its been seen that some folks would like to use zookeeper for very fine 
 grained locking. Also, in there use case they are fine with loosing all old 
 zookeeper state if they reboot zookeeper or zookeeper goes down. The use case 
 is more of a runtime locking wherein forgetting the state of locks is 
 acceptable in case of a zookeeper reboot. Not logging to disk allows high 
 throughput on and low latency on the writes to zookeeper. This would be a 
 configuration option to set (ofcourse the default would be logging to disk).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1312) Add a getChildrenWithStat operation

2011-11-30 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13160494#comment-13160494
 ] 

Mahadev konar commented on ZOOKEEPER-1312:
--

Agree. Would be very useful!

 Add a getChildrenWithStat operation
 -

 Key: ZOOKEEPER-1312
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1312
 Project: ZooKeeper
  Issue Type: New Feature
Reporter: Daniel Lord

 It would be extremely useful to be able to have a getChildrenWithStat 
 method.  This method would behave exactly the same as getChildren but in 
 addition to returning the list of all child znode names it would also return 
 a Stat for each child.  I'm sure there are quite a few use cases for this but 
 it could save a lot of extra reads for my application.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (BOOKKEEPER-31) Need a project logo

2011-11-22 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/BOOKKEEPER-31?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13155481#comment-13155481
 ] 

Mahadev konar commented on BOOKKEEPER-31:
-

@Ben,
 Nice one. I like it. Flavio, are you trying to scare people with black 
background ppts? :)

 Need a project logo
 ---

 Key: BOOKKEEPER-31
 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-31
 Project: Bookkeeper
  Issue Type: Improvement
Reporter: Benjamin Reed
Assignee: Benjamin Reed
 Attachments: bk_1.jpg, bk_2.jpg, bk_3.jpg, bk_4.jpg, 
 bookeper_black_sm.png, bookeper_white_sm.png


 we need a logo for the project something that looks good in the big and the 
 small and is easily recognizable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1301) backport patches related to the zk startup script from 3.4 to 3.3 release

2011-11-16 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13151671#comment-13151671
 ] 

Mahadev konar commented on ZOOKEEPER-1301:
--

Looking at the patch, I think we should do this:

3) looks fine to me (giri can you just add an echo statement as roman mentioned)

1) giri already fixed.

2) lets revert

4) lets revert

 backport patches related to the zk startup script from 3.4 to 3.3 release 
 --

 Key: ZOOKEEPER-1301
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1301
 Project: ZooKeeper
  Issue Type: Improvement
Affects Versions: 3.3.4
Reporter: Giridharan Kesavan
Assignee: Giridharan Kesavan
 Attachments: zookeeper-1301-1.patch, zookeeper-1301.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1301) backport patches related to the zk startup script from 3.4 to 3.3 release

2011-11-16 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13151699#comment-13151699
 ] 

Mahadev konar commented on ZOOKEEPER-1301:
--

looks good. +1 on the patch.

 backport patches related to the zk startup script from 3.4 to 3.3 release 
 --

 Key: ZOOKEEPER-1301
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1301
 Project: ZooKeeper
  Issue Type: Improvement
Affects Versions: 3.3.4
Reporter: Giridharan Kesavan
Assignee: Giridharan Kesavan
 Attachments: zookeeper-1301-1.patch, zookeeper-1301-2.patch, 
 zookeeper-1301.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1239) add logging/stats to identify fsync stalls

2011-11-15 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13150658#comment-13150658
 ] 

Mahadev konar commented on ZOOKEEPER-1239:
--

Camille,
 Can you please commit this to 3.4 branch as well? 

thanks!

 add logging/stats to identify fsync stalls
 --

 Key: ZOOKEEPER-1239
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1239
 Project: ZooKeeper
  Issue Type: Improvement
  Components: server
Reporter: Patrick Hunt
Assignee: Patrick Hunt
 Fix For: 3.3.4, 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1239_br33.patch, ZOOKEEPER-1239_br34.patch


 We don't have any logging to identify fsync stalls. It's a somewhat common 
 occurrence (after gc/swap issues) when trying to diagnose pipeline stalls - 
 where outstanding requests start piling up and operational latency increases.
 We should have some sort of logging around this. e.g. if the fsync time 
 exceeds some limit then log a warning, something like that.
 It would also be useful to publish stat information related to this. 
 min/avg/max latency for fsync.
 This should also be exposed through JMX.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1208) Ephemeral node not removed after the client session is long gone

2011-11-14 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13149837#comment-13149837
 ] 

Mahadev konar commented on ZOOKEEPER-1208:
--

Sorry I meant ZOOKEEPER-1239.

 Ephemeral node not removed after the client session is long gone
 

 Key: ZOOKEEPER-1208
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1208
 Project: ZooKeeper
  Issue Type: Bug
Affects Versions: 3.3.3
Reporter: kishore gopalakrishna
Assignee: Patrick Hunt
Priority: Blocker
 Fix For: 3.3.4, 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1208_br33.patch, ZOOKEEPER-1208_br33.patch, 
 ZOOKEEPER-1208_br34.patch, ZOOKEEPER-1208_trunk.patch


 Copying from email thread.
 We found our ZK server in a state where an ephemeral node still exists after
 a client session is long gone. I used the cons command on each ZK host to
 list all connections and couldn't find the ephemeralOwner id. We are using
 ZK 3.3.3. Has anyone seen this problem?
 I got the following information from the logs.
 The node that still exists is 
 /kafka-tracking/consumers/UserPerformanceEvent-host/owners/UserPerformanceEvent/529-7
 I saw that the ephemeral owner is 86167322861045079 which is session id 
 0x13220b93e610550.
 After searching in the transaction log of one of the ZK servers found that 
 session expired
 9/22/11 12:17:57 PM PDT session 0x13220b93e610550 cxid 0x74 zxid 0x601bd36f7 
 closeSession null
 On digging further into the logs I found that there were multiple sessions 
 created in quick succession and every session tried to create the same node. 
 But i verified that the sessions were closed and opened in order
 9/22/11 12:17:56 PM PDT session 0x13220b93e610550 cxid 0x0 zxid 0x601bd36b5 
 createSession 6000
 9/22/11 12:17:57 PM PDT session 0x13220b93e610550 cxid 0x74 zxid 0x601bd36f7 
 closeSession null
 9/22/11 12:17:58 PM PDT session 0x13220b93e610551 cxid 0x0 zxid 0x601bd36f8 
 createSession 6000
 9/22/11 12:17:59 PM PDT session 0x13220b93e610551 cxid 0x74 zxid 0x601bd373a 
 closeSession null
 9/22/11 12:18:00 PM PDT session 0x13220b93e610552 cxid 0x0 zxid 0x601bd373e 
 createSession 6000
 9/22/11 12:18:01 PM PDT session 0x13220b93e610552 cxid 0x6c zxid 0x601bd37a0 
 closeSession null
 9/22/11 12:18:02 PM PDT session 0x13220b93e610553 cxid 0x0 zxid 0x601bd37e9 
 createSession 6000
 9/22/11 12:18:03 PM PDT session 0x13220b93e610553 cxid 0x74 zxid 0x601bd382b 
 closeSession null
 9/22/11 12:18:04 PM PDT session 0x13220b93e610554 cxid 0x0 zxid 0x601bd383c 
 createSession 6000
 9/22/11 12:18:05 PM PDT session 0x13220b93e610554 cxid 0x6a zxid 0x601bd388f 
 closeSession null
 9/22/11 12:18:06 PM PDT session 0x13220b93e610555 cxid 0x0 zxid 0x601bd3895 
 createSession 6000
 9/22/11 12:18:07 PM PDT session 0x13220b93e610555 cxid 0x6a zxid 0x601bd38cd 
 closeSession null
 9/22/11 12:18:10 PM PDT session 0x13220b93e610556 cxid 0x0 zxid 0x601bd38d1 
 createSession 6000
 9/22/11 12:18:11 PM PDT session 0x13220b93e610557 cxid 0x0 zxid 0x601bd38f2 
 createSession 6000
 9/22/11 12:18:11 PM PDT session 0x13220b93e610557 cxid 0x51 zxid 0x601bd396a 
 closeSession null
 Here is the log output for the sessions that tried creating the same node
 9/22/11 12:17:54 PM PDT session 0x13220b93e61054f cxid 0x42 zxid 0x601bd366b 
 create 
 '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7
 9/22/11 12:17:56 PM PDT session 0x13220b93e610550 cxid 0x42 zxid 0x601bd36ce 
 create 
 '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7
 9/22/11 12:17:58 PM PDT session 0x13220b93e610551 cxid 0x42 zxid 0x601bd3711 
 create 
 '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7
 9/22/11 12:18:00 PM PDT session 0x13220b93e610552 cxid 0x42 zxid 0x601bd3777 
 create 
 '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7
 9/22/11 12:18:02 PM PDT session 0x13220b93e610553 cxid 0x42 zxid 0x601bd3802 
 create 
 '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7
 9/22/11 12:18:05 PM PDT session 0x13220b93e610554 cxid 0x44 zxid 0x601bd385d 
 create 
 '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7
 9/22/11 12:18:07 PM PDT session 0x13220b93e610555 cxid 0x44 zxid 0x601bd38b0 
 create 
 '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7
 9/22/11 12:18:11 PM PDT session 0x13220b93e610557 cxid 0x52 zxid 0x601bd396b 
 create 
 '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7
 Let me know if you need additional information.

--
This message is automatically generated by JIRA.
If you 

[jira] [Commented] (ZOOKEEPER-1208) Ephemeral node not removed after the client session is long gone

2011-11-11 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13148335#comment-13148335
 ] 

Mahadev konar commented on ZOOKEEPER-1208:
--

Sorry for being out of action (blame hadoop world :)). Looks like you found it 
Pat. About the testcase, I am not sure about the session id being 0. How is it 
tracking that the same session is being closed and an create on the same 
session is being sent?

 Ephemeral node not removed after the client session is long gone
 

 Key: ZOOKEEPER-1208
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1208
 Project: ZooKeeper
  Issue Type: Bug
Affects Versions: 3.3.3
Reporter: kishore gopalakrishna
Assignee: Patrick Hunt
Priority: Blocker
 Fix For: 3.3.4, 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1208_br33.patch, ZOOKEEPER-1208_br33.patch


 Copying from email thread.
 We found our ZK server in a state where an ephemeral node still exists after
 a client session is long gone. I used the cons command on each ZK host to
 list all connections and couldn't find the ephemeralOwner id. We are using
 ZK 3.3.3. Has anyone seen this problem?
 I got the following information from the logs.
 The node that still exists is 
 /kafka-tracking/consumers/UserPerformanceEvent-host/owners/UserPerformanceEvent/529-7
 I saw that the ephemeral owner is 86167322861045079 which is session id 
 0x13220b93e610550.
 After searching in the transaction log of one of the ZK servers found that 
 session expired
 9/22/11 12:17:57 PM PDT session 0x13220b93e610550 cxid 0x74 zxid 0x601bd36f7 
 closeSession null
 On digging further into the logs I found that there were multiple sessions 
 created in quick succession and every session tried to create the same node. 
 But i verified that the sessions were closed and opened in order
 9/22/11 12:17:56 PM PDT session 0x13220b93e610550 cxid 0x0 zxid 0x601bd36b5 
 createSession 6000
 9/22/11 12:17:57 PM PDT session 0x13220b93e610550 cxid 0x74 zxid 0x601bd36f7 
 closeSession null
 9/22/11 12:17:58 PM PDT session 0x13220b93e610551 cxid 0x0 zxid 0x601bd36f8 
 createSession 6000
 9/22/11 12:17:59 PM PDT session 0x13220b93e610551 cxid 0x74 zxid 0x601bd373a 
 closeSession null
 9/22/11 12:18:00 PM PDT session 0x13220b93e610552 cxid 0x0 zxid 0x601bd373e 
 createSession 6000
 9/22/11 12:18:01 PM PDT session 0x13220b93e610552 cxid 0x6c zxid 0x601bd37a0 
 closeSession null
 9/22/11 12:18:02 PM PDT session 0x13220b93e610553 cxid 0x0 zxid 0x601bd37e9 
 createSession 6000
 9/22/11 12:18:03 PM PDT session 0x13220b93e610553 cxid 0x74 zxid 0x601bd382b 
 closeSession null
 9/22/11 12:18:04 PM PDT session 0x13220b93e610554 cxid 0x0 zxid 0x601bd383c 
 createSession 6000
 9/22/11 12:18:05 PM PDT session 0x13220b93e610554 cxid 0x6a zxid 0x601bd388f 
 closeSession null
 9/22/11 12:18:06 PM PDT session 0x13220b93e610555 cxid 0x0 zxid 0x601bd3895 
 createSession 6000
 9/22/11 12:18:07 PM PDT session 0x13220b93e610555 cxid 0x6a zxid 0x601bd38cd 
 closeSession null
 9/22/11 12:18:10 PM PDT session 0x13220b93e610556 cxid 0x0 zxid 0x601bd38d1 
 createSession 6000
 9/22/11 12:18:11 PM PDT session 0x13220b93e610557 cxid 0x0 zxid 0x601bd38f2 
 createSession 6000
 9/22/11 12:18:11 PM PDT session 0x13220b93e610557 cxid 0x51 zxid 0x601bd396a 
 closeSession null
 Here is the log output for the sessions that tried creating the same node
 9/22/11 12:17:54 PM PDT session 0x13220b93e61054f cxid 0x42 zxid 0x601bd366b 
 create 
 '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7
 9/22/11 12:17:56 PM PDT session 0x13220b93e610550 cxid 0x42 zxid 0x601bd36ce 
 create 
 '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7
 9/22/11 12:17:58 PM PDT session 0x13220b93e610551 cxid 0x42 zxid 0x601bd3711 
 create 
 '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7
 9/22/11 12:18:00 PM PDT session 0x13220b93e610552 cxid 0x42 zxid 0x601bd3777 
 create 
 '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7
 9/22/11 12:18:02 PM PDT session 0x13220b93e610553 cxid 0x42 zxid 0x601bd3802 
 create 
 '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7
 9/22/11 12:18:05 PM PDT session 0x13220b93e610554 cxid 0x44 zxid 0x601bd385d 
 create 
 '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7
 9/22/11 12:18:07 PM PDT session 0x13220b93e610555 cxid 0x44 zxid 0x601bd38b0 
 create 
 '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7
 9/22/11 12:18:11 PM PDT session 0x13220b93e610557 cxid 0x52 zxid 0x601bd396b 
 create 
 

[jira] [Commented] (ZOOKEEPER-1208) Ephemeral node not removed after the client session is long gone

2011-11-11 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13148634#comment-13148634
 ] 

Mahadev konar commented on ZOOKEEPER-1208:
--

You are right. I was worried abt the returned sid. Go ahead and upload patches 
for 3.4 and trunk. 

 Ephemeral node not removed after the client session is long gone
 

 Key: ZOOKEEPER-1208
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1208
 Project: ZooKeeper
  Issue Type: Bug
Affects Versions: 3.3.3
Reporter: kishore gopalakrishna
Assignee: Patrick Hunt
Priority: Blocker
 Fix For: 3.3.4, 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1208_br33.patch, ZOOKEEPER-1208_br33.patch


 Copying from email thread.
 We found our ZK server in a state where an ephemeral node still exists after
 a client session is long gone. I used the cons command on each ZK host to
 list all connections and couldn't find the ephemeralOwner id. We are using
 ZK 3.3.3. Has anyone seen this problem?
 I got the following information from the logs.
 The node that still exists is 
 /kafka-tracking/consumers/UserPerformanceEvent-host/owners/UserPerformanceEvent/529-7
 I saw that the ephemeral owner is 86167322861045079 which is session id 
 0x13220b93e610550.
 After searching in the transaction log of one of the ZK servers found that 
 session expired
 9/22/11 12:17:57 PM PDT session 0x13220b93e610550 cxid 0x74 zxid 0x601bd36f7 
 closeSession null
 On digging further into the logs I found that there were multiple sessions 
 created in quick succession and every session tried to create the same node. 
 But i verified that the sessions were closed and opened in order
 9/22/11 12:17:56 PM PDT session 0x13220b93e610550 cxid 0x0 zxid 0x601bd36b5 
 createSession 6000
 9/22/11 12:17:57 PM PDT session 0x13220b93e610550 cxid 0x74 zxid 0x601bd36f7 
 closeSession null
 9/22/11 12:17:58 PM PDT session 0x13220b93e610551 cxid 0x0 zxid 0x601bd36f8 
 createSession 6000
 9/22/11 12:17:59 PM PDT session 0x13220b93e610551 cxid 0x74 zxid 0x601bd373a 
 closeSession null
 9/22/11 12:18:00 PM PDT session 0x13220b93e610552 cxid 0x0 zxid 0x601bd373e 
 createSession 6000
 9/22/11 12:18:01 PM PDT session 0x13220b93e610552 cxid 0x6c zxid 0x601bd37a0 
 closeSession null
 9/22/11 12:18:02 PM PDT session 0x13220b93e610553 cxid 0x0 zxid 0x601bd37e9 
 createSession 6000
 9/22/11 12:18:03 PM PDT session 0x13220b93e610553 cxid 0x74 zxid 0x601bd382b 
 closeSession null
 9/22/11 12:18:04 PM PDT session 0x13220b93e610554 cxid 0x0 zxid 0x601bd383c 
 createSession 6000
 9/22/11 12:18:05 PM PDT session 0x13220b93e610554 cxid 0x6a zxid 0x601bd388f 
 closeSession null
 9/22/11 12:18:06 PM PDT session 0x13220b93e610555 cxid 0x0 zxid 0x601bd3895 
 createSession 6000
 9/22/11 12:18:07 PM PDT session 0x13220b93e610555 cxid 0x6a zxid 0x601bd38cd 
 closeSession null
 9/22/11 12:18:10 PM PDT session 0x13220b93e610556 cxid 0x0 zxid 0x601bd38d1 
 createSession 6000
 9/22/11 12:18:11 PM PDT session 0x13220b93e610557 cxid 0x0 zxid 0x601bd38f2 
 createSession 6000
 9/22/11 12:18:11 PM PDT session 0x13220b93e610557 cxid 0x51 zxid 0x601bd396a 
 closeSession null
 Here is the log output for the sessions that tried creating the same node
 9/22/11 12:17:54 PM PDT session 0x13220b93e61054f cxid 0x42 zxid 0x601bd366b 
 create 
 '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7
 9/22/11 12:17:56 PM PDT session 0x13220b93e610550 cxid 0x42 zxid 0x601bd36ce 
 create 
 '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7
 9/22/11 12:17:58 PM PDT session 0x13220b93e610551 cxid 0x42 zxid 0x601bd3711 
 create 
 '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7
 9/22/11 12:18:00 PM PDT session 0x13220b93e610552 cxid 0x42 zxid 0x601bd3777 
 create 
 '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7
 9/22/11 12:18:02 PM PDT session 0x13220b93e610553 cxid 0x42 zxid 0x601bd3802 
 create 
 '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7
 9/22/11 12:18:05 PM PDT session 0x13220b93e610554 cxid 0x44 zxid 0x601bd385d 
 create 
 '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7
 9/22/11 12:18:07 PM PDT session 0x13220b93e610555 cxid 0x44 zxid 0x601bd38b0 
 create 
 '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7
 9/22/11 12:18:11 PM PDT session 0x13220b93e610557 cxid 0x52 zxid 0x601bd396b 
 create 
 '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7
 Let me know if you need additional information.

--
This message is automatically generated by 

[jira] [Commented] (ZOOKEEPER-1215) C client persisted cache

2011-11-08 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13146788#comment-13146788
 ] 

Mahadev konar commented on ZOOKEEPER-1215:
--

Marc,
 Sorry I've been a little busy with 3.4. Would definitely comment on the jira 
after reading/thinking through this. 

thanks

 C client persisted cache
 

 Key: ZOOKEEPER-1215
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1215
 Project: ZooKeeper
  Issue Type: New Feature
  Components: c client
Reporter: Marc Celani
Assignee: Marc Celani

 Motivation:
 1.  Reduce the impact of client restarts on zookeeper by implementing a 
 persisted cache, and only fetching deltas on restart
 2.  Reduce unnecessary calls to zookeeper.
 3.  Improve performance of gets by caching on the client
 4.  Allow for larger caches than in memory caches.
 Behavior Change:
 Zookeeper clients will not have the option to specify a folder path where it 
 can cache zookeeper gets.  If they do choose to cache results, the zookeeper 
 library will check the persisted cache before actually sending a request to 
 zookeeper.  Watches will automatically be placed on all gets in order to 
 invalidate the cache.  Alternatively, we can add a cache flag to the get API 
 - thoughts?  On reconnect or restart, zookeeper clients will check the 
 version number of each entries into its persisted cache, and will invalidate 
 any old entries.  In checking version number, zookeeper clients will also 
 place a watch on those files.  In regards to watches, client watch handlers 
 will not fire until the invalidation step is completed, which may slow down 
 client watch handling. Since setting up watches on all files is necessary on 
 initialization, initialization will likely slow down as well.
 API Change:
 The zookeeper library will expose a new init interface that specifies a 
 folder path to the cache.  A new get API will specify whether or not to use 
 cache, and whether or not stale data is safe to return if the connection is 
 down.
 Design:
 The zookeeper handler structure will now include a cache_root_path (possibly 
 null) string to cache all gets, as well as a bool for whether or not it is 
 okay to serve stale data.  Old API calls will default to a null path (which 
 signifies no cache), and signify that it is not okay to serve stale data.
 The cache will be located at a cache_root_path.  All files will be placed at 
 cache_root_path/file_path.  The cache will be an incomplete copy of 
 everything that is in zookeeper, but everything in the cache will have the 
 same relative path from the cache_root_path that it has as a path in 
 zookeeper.  Each file in the cache will include the Statstructure and the 
 file contents.
 zoo_get will check the zookeeper handler to determine whether or not it has a 
 cache.  If it does, it will first go to the path to the persisted cache and 
 append the get path.  If the file exists and it is not invalidated, the 
 zookeeper client will read it and return its value.  If the file does not 
 exist or is invalidated, the zookeeper library will perform the same get as 
 is currently designed.  After getting the results, the library will place the 
 value in the persisted cache for subsequent reads.  zoo_set will 
 automatically invalidate the path in the cache.
 If caching is requested, then on each zoo_get that goes through to zookeeper, 
 a watch will be placed on the path. A cache watch handler will handle all 
 watch events by invalidating the cache, and placing another watch on it.  
 Client watch handlers will handle the watch event after the cache watch 
 handler.  The cache watch handler will not call zoo_get, because it is 
 assumed that the client watch handlers will call zoo_get if they need the 
 fresh data as soon as it is invalidated (which is why the cache watch handler 
 must be executed first).
 All updates to the cache will be done on a separate thread, but will be 
 queued in order to maintain consistency in the cache.  In addition, all 
 client watch handlers will not be fired until the cache watch handler 
 completes its invalidation write in order to ensure that client calls to 
 zoo_get in the watch event handler are done after the invalidation step.  
 This means that a client watch handler could be waiting on SEVERAL writes 
 before it can be fired off, since all writes are queued.
 When a new connection is made, if a zookeeper handler has a cache, then that 
 cache will be scanned in order to find all leaf nodes.  Calls will be made to 
 zookeeper to check if all of these nodes still exist, and if they do, what 
 their version number is.  Any inconsistencies in version will result in the 
 cache invalidating the out of date files.  Any files that no longer exist 
 will be deleted from the 

[jira] [Commented] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently

2011-11-04 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13144469#comment-13144469
 ] 

Mahadev konar commented on ZOOKEEPER-1264:
--

Camille,
 Are you debugging the test failure in 3.4 or waiting for others to take a 
look? 

 FollowerResyncConcurrencyTest failing intermittently
 

 Key: ZOOKEEPER-1264
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264
 Project: ZooKeeper
  Issue Type: Bug
  Components: tests
Affects Versions: 3.3.3, 3.4.0, 3.5.0
Reporter: Patrick Hunt
Assignee: Camille Fournier
Priority: Blocker
 Fix For: 3.3.4, 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1264-branch34.patch, 
 ZOOKEEPER-1264-merge.patch, ZOOKEEPER-1264.patch, ZOOKEEPER-1264.patch, 
 ZOOKEEPER-1264.patch, ZOOKEEPER-1264.patch, ZOOKEEPER-1264_branch33.patch, 
 ZOOKEEPER-1264_branch34.patch, ZOOKEEPER-1264unittest.patch, 
 ZOOKEEPER-1264unittest.patch, followerresyncfailure_log.txt.gz, logs.zip, 
 tmp.zip


 The FollowerResyncConcurrencyTest test is failing intermittently. 
 saw the following on 3.4:
 {noformat}
 junit.framework.AssertionFailedError: Should have same number of
 ephemerals in both followers expected:11741 but was:14001
at 
 org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400)
at 
 org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196)
at 
 org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1270) testEarlyLeaderAbandonment failing intermittently, quorum formed, no serving.

2011-11-04 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13144556#comment-13144556
 ] 

Mahadev konar commented on ZOOKEEPER-1270:
--

Alex,
 Can you please upload a patch that applies to trunk and 3.4 branch here? I'd 
like to get this done tonight.

 testEarlyLeaderAbandonment failing intermittently, quorum formed, no serving.
 -

 Key: ZOOKEEPER-1270
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1270
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Reporter: Patrick Hunt
Assignee: Flavio Junqueira
Priority: Blocker
 Fix For: 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1270-and-1194.patch, ZOOKEEPER-1270.patch, 
 ZOOKEEPER-1270.patch, ZOOKEEPER-1270_br34.patch, ZOOKEEPER-1270tests.patch, 
 ZOOKEEPER-1270tests2.patch, testEarlyLeaderAbandonment.txt.gz, 
 testEarlyLeaderAbandonment2.txt.gz, testEarlyLeaderAbandonment3.txt.gz, 
 testEarlyLeaderAbandonment4.txt.gz


 Looks pretty serious - quorum is formed but no clients can attach. Will 
 attach logs momentarily.
 This test was introduced in the following commit (all three jira commit at 
 once):
 ZOOKEEPER-335. zookeeper servers should commit the new leader txn to their 
 logs.
 ZOOKEEPER-1081. modify leader/follower code to correctly deal with new leader
 ZOOKEEPER-1082. modify leader election to correctly take into account current

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1270) testEarlyLeaderAbandonment failing intermittently, quorum formed, no serving.

2011-11-04 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13144557#comment-13144557
 ] 

Mahadev konar commented on ZOOKEEPER-1270:
--

Alex,
 Please make sure that you grant code changes to Apache. You just have to click 
on the box that says Grant license to Apache when attaching the patch.

Do reattach the patch with the grant. thanks

 testEarlyLeaderAbandonment failing intermittently, quorum formed, no serving.
 -

 Key: ZOOKEEPER-1270
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1270
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Reporter: Patrick Hunt
Assignee: Flavio Junqueira
Priority: Blocker
 Fix For: 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1270-and-1194.patch, ZOOKEEPER-1270.patch, 
 ZOOKEEPER-1270.patch, ZOOKEEPER-1270_br34.patch, ZOOKEEPER-1270tests.patch, 
 ZOOKEEPER-1270tests2.patch, testEarlyLeaderAbandonment.txt.gz, 
 testEarlyLeaderAbandonment2.txt.gz, testEarlyLeaderAbandonment3.txt.gz, 
 testEarlyLeaderAbandonment4.txt.gz


 Looks pretty serious - quorum is formed but no clients can attach. Will 
 attach logs momentarily.
 This test was introduced in the following commit (all three jira commit at 
 once):
 ZOOKEEPER-335. zookeeper servers should commit the new leader txn to their 
 logs.
 ZOOKEEPER-1081. modify leader/follower code to correctly deal with new leader
 ZOOKEEPER-1082. modify leader election to correctly take into account current

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1270) testEarlyLeaderAbandonment failing intermittently, quorum formed, no serving.

2011-11-04 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13144595#comment-13144595
 ] 

Mahadev konar commented on ZOOKEEPER-1270:
--

+1 on Alex's suggestion. Lets stick to minimal changes for now :).

 testEarlyLeaderAbandonment failing intermittently, quorum formed, no serving.
 -

 Key: ZOOKEEPER-1270
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1270
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Reporter: Patrick Hunt
Assignee: Flavio Junqueira
Priority: Blocker
 Fix For: 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1270-and-1194-branch34.patch, 
 ZOOKEEPER-1270-and-1194.patch, ZOOKEEPER-1270-and-1194.patch, 
 ZOOKEEPER-1270.patch, ZOOKEEPER-1270.patch, ZOOKEEPER-1270_br34.patch, 
 ZOOKEEPER-1270tests.patch, ZOOKEEPER-1270tests2.patch, 
 testEarlyLeaderAbandonment.txt.gz, testEarlyLeaderAbandonment2.txt.gz, 
 testEarlyLeaderAbandonment3.txt.gz, testEarlyLeaderAbandonment4.txt.gz


 Looks pretty serious - quorum is formed but no clients can attach. Will 
 attach logs momentarily.
 This test was introduced in the following commit (all three jira commit at 
 once):
 ZOOKEEPER-335. zookeeper servers should commit the new leader txn to their 
 logs.
 ZOOKEEPER-1081. modify leader/follower code to correctly deal with new leader
 ZOOKEEPER-1082. modify leader election to correctly take into account current

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1270) testEarlyLeaderAbandonment failing intermittently, quorum formed, no serving.

2011-11-02 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13142709#comment-13142709
 ] 

Mahadev konar commented on ZOOKEEPER-1270:
--

Looks like the zookeeperserver does not start running within the Quorum Peers. 
There is something really wrong which prevents the Followers/leaders to start 
running the ZooKeeperServers. I suspect, it has something to do with NEWLeader 
transaction (could be wrong). Need to look deeper. Another pair of eyes would 
help!

 testEarlyLeaderAbandonment failing intermittently, quorum formed, no serving.
 -

 Key: ZOOKEEPER-1270
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1270
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Reporter: Patrick Hunt
Priority: Blocker
 Fix For: 3.4.0, 3.5.0

 Attachments: testEarlyLeaderAbandonment.txt.gz, 
 testEarlyLeaderAbandonment2.txt.gz


 Looks pretty serious - quorum is formed but no clients can attach. Will 
 attach logs momentarily.
 This test was introduced in the following commit (all three jira commit at 
 once):
 ZOOKEEPER-335. zookeeper servers should commit the new leader txn to their 
 logs.
 ZOOKEEPER-1081. modify leader/follower code to correctly deal with new leader
 ZOOKEEPER-1082. modify leader election to correctly take into account current

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1246) Dead code in PrepRequestProcessor catch Exception block

2011-11-01 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13140984#comment-13140984
 ] 

Mahadev konar commented on ZOOKEEPER-1246:
--

Looks good to me. Camille you want to check this in?

 Dead code in PrepRequestProcessor catch Exception block
 ---

 Key: ZOOKEEPER-1246
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1246
 Project: ZooKeeper
  Issue Type: Sub-task
Reporter: Thomas Koch
Assignee: Camille Fournier
Priority: Blocker
 Fix For: 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1246.patch, ZOOKEEPER-1246.patch, 
 ZOOKEEPER-1246_trunk.patch, ZOOKEEPER-1246_trunk.patch


 This is a regression introduced by ZOOKEEPER-965 (multi transactions). The 
 catch(Exception e) block in PrepRequestProcessor.pRequest contains an if 
 block with condition request.getHdr() != null. This condition will always 
 evaluate to false since the changes in ZOOKEEPER-965.
 This is caused by a change in sequence: Before ZK-965, the txnHeader was set 
 _before_ the deserialization of the request. Afterwards the deserialization 
 happens before request.setHdr is set. So the following RequestProcessors 
 won't see the request as a failed one but as a Read request, since it doesn't 
 have a hdr set.
 Notes:
 - it is very bad practice to catch Exception. The block should rather catch 
 IOException
 - The check whether the TxnHeader is set in the request is used at several 
 places to see whether the request is a read or write request. It isn't 
 obvious for a newby, what it means whether a request has a hdr set or not.
 - at the beginning of pRequest the hdr and txn of request are set to null. 
 However there is no chance that these fields could ever not be null at this 
 point. The code however suggests that this could be the case. There should 
 rather be an assertion that confirms that these fields are indeed null. The 
 practice of doing things just in case, even if there is no chance that this 
 case could happen, is a very stinky code smell and means that the code isn't 
 understandable or trustworthy.
 - The multi transaction switch case block in pRequest is very hard to read, 
 because it missuses the request.{hdr|txn} fields as local variables.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently

2011-11-01 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141369#comment-13141369
 ] 

Mahadev konar commented on ZOOKEEPER-1264:
--

@Ben,
 sorry to be pestering, I'd like to get 3.4 rc1 out today. Please be back today 
:).

 FollowerResyncConcurrencyTest failing intermittently
 

 Key: ZOOKEEPER-1264
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264
 Project: ZooKeeper
  Issue Type: Bug
  Components: tests
Affects Versions: 3.3.3, 3.4.0, 3.5.0
Reporter: Patrick Hunt
Assignee: Camille Fournier
Priority: Blocker
 Fix For: 3.3.4, 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1264.patch, ZOOKEEPER-1264.patch, 
 ZOOKEEPER-1264_branch33.patch, ZOOKEEPER-1264_branch34.patch, 
 followerresyncfailure_log.txt.gz, logs.zip, tmp.zip


 The FollowerResyncConcurrencyTest test is failing intermittently. 
 saw the following on 3.4:
 {noformat}
 junit.framework.AssertionFailedError: Should have same number of
 ephemerals in both followers expected:11741 but was:14001
at 
 org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400)
at 
 org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196)
at 
 org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1257) Rename MultiTransactionRecord to MultiRequest

2011-11-01 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141572#comment-13141572
 ] 

Mahadev konar commented on ZOOKEEPER-1257:
--

Looked through the code, the rename does not change any compatibility story. We 
can change it anytime we want. Not really a blocker for 3.4.

 Rename MultiTransactionRecord to MultiRequest
 -

 Key: ZOOKEEPER-1257
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1257
 Project: ZooKeeper
  Issue Type: Sub-task
Reporter: Thomas Koch
Assignee: Thomas Koch
Priority: Critical

 Understanding the code behind multi operations doesn't get any easier when 
 the code violates naming consistency.
 All other Request classes are called xxxRequest, only for multi its 
 xxxTransactionRecord! Also Transaction is wrong, because there is the 
 concepts of transactions that are transmitted between quorum peers or 
 committed to disc. MultiTransactionRecord however is a _Request_ from a 
 client.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1269) Multi deserialization issues

2011-11-01 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141654#comment-13141654
 ] 

Mahadev konar commented on ZOOKEEPER-1269:
--

Camille, Should this go into 3.4 or just trunk? 

 Multi deserialization issues
 

 Key: ZOOKEEPER-1269
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1269
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.4.0
Reporter: Camille Fournier
Assignee: Camille Fournier
 Attachments: ZOOKEEPER-1269.patch


 From the mailing list:
 FileTxnSnapLog.restore contains a code block handling a NODEEXISTS failure 
 during deserialization. The problem is explained there in a code comment. The 
 code block however is only executed for a CREATE txn, not for a multiTxn 
 containing a CREATE.
 Even if the mentioned code block would also be executed for multi 
 transactions, it needs adaption for multi transactions. What, if after the 
 first failed transaction in a multi txn during deserialization, there would 
 be subsequent transactions in the same multi that would also have failed?
 We don't know, since the first failed transaction hides the information about 
 the remaining transactions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1100) Killed (or missing) SendThread will cause hanging threads

2011-11-01 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141719#comment-13141719
 ] 

Mahadev konar commented on ZOOKEEPER-1100:
--

Camille,
 I dont think we have a dependency on mockito yet. I am adding one in 
ZOOKEEPER-1271.

 Killed (or missing) SendThread will cause hanging threads
 -

 Key: ZOOKEEPER-1100
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1100
 Project: ZooKeeper
  Issue Type: Bug
  Components: java client
Affects Versions: 3.3.3
 Environment: 
 http://mail-archives.apache.org/mod_mbox/zookeeper-user/201106.mbox/%3Citpgb6$2mi$1...@dough.gmane.org%3E
Reporter: Gunnar Wagenknecht
Assignee: Rakesh R
 Fix For: 3.5.0

 Attachments: ZOOKEEPER-1100.patch, ZOOKEEPER-1100.patch


 After investigating an issues with [hanging 
 threads|http://mail-archives.apache.org/mod_mbox/zookeeper-user/201106.mbox/%3Citpgb6$2mi$1...@dough.gmane.org%3E]
  I noticed that any java.lang.Error might silently kill the SendThread. 
 Without a SendThread any thread that wants to send something will hang 
 forever. 
 Currently nobody will recognize a SendThread that died. I think at least a 
 state should be flipped (or flag should be set) that causes all further send 
 attempts to fail or to re-spin the connection loop.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently

2011-10-31 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13139970#comment-13139970
 ] 

Mahadev konar commented on ZOOKEEPER-1264:
--

Ben/Flavio,
 Any comments?

 FollowerResyncConcurrencyTest failing intermittently
 

 Key: ZOOKEEPER-1264
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264
 Project: ZooKeeper
  Issue Type: Bug
  Components: tests
Affects Versions: 3.3.3, 3.4.0, 3.5.0
Reporter: Patrick Hunt
Assignee: Camille Fournier
Priority: Blocker
 Fix For: 3.3.4, 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1264.patch, ZOOKEEPER-1264_branch33.patch, 
 ZOOKEEPER-1264_branch34.patch, followerresyncfailure_log.txt.gz, logs.zip, 
 tmp.zip


 The FollowerResyncConcurrencyTest test is failing intermittently. 
 saw the following on 3.4:
 {noformat}
 junit.framework.AssertionFailedError: Should have same number of
 ephemerals in both followers expected:11741 but was:14001
at 
 org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400)
at 
 org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196)
at 
 org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1273) Copy'n'pasted unit test

2011-10-31 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13139971#comment-13139971
 ] 

Mahadev konar commented on ZOOKEEPER-1273:
--

@Thomas, 
  Might be better to do taht to make sure hudson agrees with the deletion.

 Copy'n'pasted unit test
 ---

 Key: ZOOKEEPER-1273
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1273
 Project: ZooKeeper
  Issue Type: Bug
Reporter: Thomas Koch
Assignee: Thomas Koch
Priority: Trivial

 Probably caused by the usage of a legacy VCS a code duplication happened when 
 you moved from Sourceforge to Apache (ZOOKEEPER-38). The following file can 
 be deleted:
 src/java/test/org/apache/zookeeper/server/DataTreeUnitTest.java
 src/java/test/org/apache/zookeeper/test/DataTreeTest.java was an exact copy 
 of the above until ZOOKEEPER-1046 added an additional test case only to the 
 latter.
 Do I need to upload a patch file for this?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently

2011-10-28 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13138116#comment-13138116
 ] 

Mahadev konar commented on ZOOKEEPER-1264:
--

+1 looks good to me. Might want to check on the the hudson tests. Looks like 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/709//testReport/ has 
observer test failing? Doesnt seem related but no harm in running the trunk 
patch through hudson again.

 FollowerResyncConcurrencyTest failing intermittently
 

 Key: ZOOKEEPER-1264
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264
 Project: ZooKeeper
  Issue Type: Bug
  Components: tests
Affects Versions: 3.3.3, 3.4.0, 3.5.0
Reporter: Patrick Hunt
Assignee: Patrick Hunt
 Fix For: 3.3.4, 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1264.patch, ZOOKEEPER-1264_branch33.patch, 
 ZOOKEEPER-1264_branch34.patch


 The FollowerResyncConcurrencyTest test is failing intermittently. 
 saw the following on 3.4:
 {noformat}
 junit.framework.AssertionFailedError: Should have same number of
 ephemerals in both followers expected:11741 but was:14001
at 
 org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400)
at 
 org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196)
at 
 org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1259) central mapping from type to txn record class

2011-10-27 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13137264#comment-13137264
 ] 

Mahadev konar commented on ZOOKEEPER-1259:
--

@Thomas,
 You can check the console output for C test failures:

 https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/702//console

{noformat}
  [exec]  [exec] 
/home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-Build/trunk/src/c/tests/TestMulti.cc:574:
 Assertion: equality assertion failed [Expected: 0, Actual  : 709395008]
 [exec]  [exec] Failures !!!
 [exec]  [exec] Run: 57   Failure total: 1   Failures: 1   Errors: 0
 [exec]  [exec] make: *** [run-check] Error 1
 [exec] 
{noformat}


 central mapping from type to txn record class
 -

 Key: ZOOKEEPER-1259
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1259
 Project: ZooKeeper
  Issue Type: Sub-task
Reporter: Thomas Koch
Assignee: Thomas Koch
 Attachments: ZOOKEEPER-1259.patch


 There are two places where large switch statements do nothing else to get the 
 correct Record class accorging to a txn type. Provided a static map in 
 SerializeUtils from type to Class? extends Record and a method to retrieve 
 a new txn Record instance for a type.
 Code size reduced by 28 lines.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1242) Repeat add watcher, memory leak

2011-10-24 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13133842#comment-13133842
 ] 

Mahadev konar commented on ZOOKEEPER-1242:
--

@Peng, 
 The jira seems to be resolved? The patch doesnt seem to be committed, any 
reason you marked this resolved?

 Repeat add watcher, memory leak  
 -

 Key: ZOOKEEPER-1242
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1242
 Project: ZooKeeper
  Issue Type: Bug
  Components: c client
Affects Versions: 3.3.3
 Environment: Redhat linux
Reporter: Peng Futian
  Labels: patch
 Fix For: 3.3.4

 Attachments: ZOOKEEPER-1242.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

  When I repeat add watcher , there are a memory leak. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1240) Compiler issue with redhat linux

2011-10-24 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13133847#comment-13133847
 ] 

Mahadev konar commented on ZOOKEEPER-1240:
--

Peng, 
 You seem to have closed the jira again? Take a look at how to contribute, on 
the https://cwiki.apache.org/confluence/display/ZOOKEEPER/HowToContribute for 
guidance on how to upload/review/get it committed.

 Compiler issue with redhat linux
 

 Key: ZOOKEEPER-1240
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1240
 Project: ZooKeeper
  Issue Type: Bug
  Components: c client
Affects Versions: 3.3.3
 Environment: Linux phy 2.6.18-53.el5 #1 SMP Wed Oct 10 16:34:19 EDT 
 2007 x86_64 x86_64 x86_64 GNU/Linux
 gcc version 4.1.2 20070626 (Red Hat 4.1.2-14)
Reporter: Peng Futian
Priority: Minor
  Labels: patch
 Fix For: 3.3.4

 Attachments: ZOOKEEPER-1240.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 When I compile zookeeper c client in my project, there are some error:
 ../../../include/zookeeper/recordio.h:70: error:expected unqualified-id 
 before '__extension__'
 ../../../include/zookeeper/recordio.h:70: error:expected `)' before 
 '__extension__'
 ../../.. /include/zookeeper/recordio.h:70: error:expected unqualified-id 
 before ')' token

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1197) Incorrect socket handling of 4 letter words for NIO

2011-10-11 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13125524#comment-13125524
 ] 

Mahadev konar commented on ZOOKEEPER-1197:
--

Camille,
 What do we want to do then? Closing the connection from client is probably not 
feasible. Should we just checkin what we have? I am not a big fan of letting 
the connections linger on the server and then close them later. 

 Incorrect socket handling of 4 letter words for NIO
 ---

 Key: ZOOKEEPER-1197
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1197
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.3.3, 3.4.0
Reporter: Camille Fournier
Assignee: Camille Fournier
Priority: Blocker
 Fix For: 3.3.4, 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1197.patch


 When transferring a large amount of information from a 4 letter word, 
 especially in interactive mode (telnet or nc) over a slower network link, the 
 connection can be closed before all of the data has reached the client. This 
 is due to the way we handle nc non-interactive mode, by cancelling the 
 selector key. 
 Instead of cancelling the selector key for 4-letter-words, we should instead 
 flag the NIOServerCnxn to ignore detection of a close condition on that 
 socket (CancelledKeyException, EndOfStreamException). Since the 4lw will 
 close the connection immediately upon completion, this should be safe to do. 
 See ZOOKEEPER-737 for more details

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1210) Can't build ZooKeeper RPM with RPM = 4.6.0 (i.e. on RHEL 6 and Fedora = 10)

2011-10-07 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13122593#comment-13122593
 ] 

Mahadev konar commented on ZOOKEEPER-1210:
--

Tadeusz, 
 You might want to use --no-prefix for generating the patch.

 Can't build ZooKeeper RPM with RPM = 4.6.0 (i.e. on RHEL 6 and Fedora = 10)
 -

 Key: ZOOKEEPER-1210
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1210
 Project: ZooKeeper
  Issue Type: Bug
  Components: build
Affects Versions: 3.4.0
 Environment: Tested to fail on both Centos 6.0 and Fedora 14
Reporter: Tadeusz Andrzej Kadłubowski
Priority: Minor
  Labels: patch
 Attachments: rpm_buildroot_on_RHEL6.patch


 I was trying to build the zookeeper RPM (basically, `ant rpm 
 -Dskip.contrib=1`), using build scripts that were recently merged from the 
 work on the ZOOKEEPER-999 issue.
 The final stage, i.e. running rpmbuild failed. From what I understand it 
 mixed BUILD and BUILDROOT subdirectories in 
 /tmp/zookeeper_package_build_tkadlubo/, leaving BUILDROOT empty, and placing 
 everything in BUILD.
 The full build log is at http://pastebin.com/0ZvUAKJt (Caution: I cut out 
 long file listings from running tar -xvvf).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1215) C client persisted cache

2011-10-06 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13122503#comment-13122503
 ] 

Mahadev konar commented on ZOOKEEPER-1215:
--

@Marc,

 Can you elaborate on the use case for this? What are the issues that you are 
facing which is creating a need for client side caching? Also, on a restart 
wont the client cache be invalid? Do you plan to persist the session and make 
sure you restart within the session expiry? 


 C client persisted cache
 

 Key: ZOOKEEPER-1215
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1215
 Project: ZooKeeper
  Issue Type: New Feature
  Components: c client
Reporter: Marc Celani
Assignee: Marc Celani

 Motivation:
 1.  Reduce the impact of client restarts on zookeeper by implementing a 
 persisted cache, and only fetching deltas on restart
 2.  Reduce unnecessary calls to zookeeper.
 3.  Improve performance of gets by caching on the client
 4.  Allow for larger caches than in memory caches.
 Behavior Change:
 Zookeeper clients will not have the option to specify a folder path where it 
 can cache zookeeper gets.  If they do choose to cache results, the zookeeper 
 library will check the persisted cache before actually sending a request to 
 zookeeper.  Watches will automatically be placed on all gets in order to 
 invalidate the cache.  Alternatively, we can add a cache flag to the get API 
 - thoughts?  On reconnect or restart, zookeeper clients will check the 
 version number of each entries into its persisted cache, and will invalidate 
 any old entries.  In checking version number, zookeeper clients will also 
 place a watch on those files.  In regards to watches, client watch handlers 
 will not fire until the invalidation step is completed, which may slow down 
 client watch handling. Since setting up watches on all files is necessary on 
 initialization, initialization will likely slow down as well.
 API Change:
 The zookeeper library will expose a new init interface that specifies a 
 folder path to the cache.  A new get API will specify whether or not to use 
 cache, and whether or not stale data is safe to return if the connection is 
 down.
 Design:
 The zookeeper handler structure will now include a cache_root_path (possibly 
 null) string to cache all gets, as well as a bool for whether or not it is 
 okay to serve stale data.  Old API calls will default to a null path (which 
 signifies no cache), and signify that it is not okay to serve stale data.
 The cache will be located at a cache_root_path.  All files will be placed at 
 cache_root_path/file_path.  The cache will be an incomplete copy of 
 everything that is in zookeeper, but everything in the cache will have the 
 same relative path from the cache_root_path that it has as a path in 
 zookeeper.  Each file in the cache will include the Statstructure and the 
 file contents.
 zoo_get will check the zookeeper handler to determine whether or not it has a 
 cache.  If it does, it will first go to the path to the persisted cache and 
 append the get path.  If the file exists and it is not invalidated, the 
 zookeeper client will read it and return its value.  If the file does not 
 exist or is invalidated, the zookeeper library will perform the same get as 
 is currently designed.  After getting the results, the library will place the 
 value in the persisted cache for subsequent reads.  zoo_set will 
 automatically invalidate the path in the cache.
 If caching is requested, then on each zoo_get that goes through to zookeeper, 
 a watch will be placed on the path. A cache watch handler will handle all 
 watch events by invalidating the cache, and placing another watch on it.  
 Client watch handlers will handle the watch event after the cache watch 
 handler.  The cache watch handler will not call zoo_get, because it is 
 assumed that the client watch handlers will call zoo_get if they need the 
 fresh data as soon as it is invalidated (which is why the cache watch handler 
 must be executed first).
 All updates to the cache will be done on a separate thread, but will be 
 queued in order to maintain consistency in the cache.  In addition, all 
 client watch handlers will not be fired until the cache watch handler 
 completes its invalidation write in order to ensure that client calls to 
 zoo_get in the watch event handler are done after the invalidation step.  
 This means that a client watch handler could be waiting on SEVERAL writes 
 before it can be fired off, since all writes are queued.
 When a new connection is made, if a zookeeper handler has a cache, then that 
 cache will be scanned in order to find all leaf nodes.  Calls will be made to 
 zookeeper to check if all of these nodes still exist, and if they do, what 
 their version number is.  

[jira] [Commented] (ZOOKEEPER-1112) Add support for C client for SASL authentication

2011-10-06 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13122505#comment-13122505
 ] 

Mahadev konar commented on ZOOKEEPER-1112:
--

Very glad to see this! Will take a look at the patch sometime this week!

 Add support for C client for SASL authentication
 

 Key: ZOOKEEPER-1112
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1112
 Project: ZooKeeper
  Issue Type: New Feature
Reporter: Eugene Koontz
 Attachments: ZOOKEEPER-1112.patch, zookeeper-c-client-sasl.patch


 Hopefully this would leverage the SASL server-side support provided by 
 ZOOKEEPER-938. It would be similar to the Java SASL client support also 
 provided in ZOOKEEPER-938.
 Java has built-in SASL support, but I'm not sure what C libraries are 
 available for SASL and if so, are they compatible with the Apache license.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1189) For an invalid snapshot file(less than 10bytes size) RandomAccessFile stream is leaking.

2011-09-26 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13115069#comment-13115069
 ] 

Mahadev konar commented on ZOOKEEPER-1189:
--

Thanks Rakesh, will go ahead and commit.

 For an invalid snapshot file(less than 10bytes size) RandomAccessFile stream 
 is leaking.
 

 Key: ZOOKEEPER-1189
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1189
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.3.3
Reporter: Rakesh R
Assignee: Rakesh R
 Fix For: 3.3.4, 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1189-branch-3.3.patch, ZOOKEEPER-1189.1.patch, 
 ZOOKEEPER-1189.patch


 When loading the snapshot, ZooKeeper will consider only the 'snapshots with 
 atleast 10 bytes size'. Otherwsie it will ignore and just return without 
 closing the RandomAccessFile.
 {noformat}
 Util.isValidSnapshot() having the following logic. 
// Check for a valid snapshot
 RandomAccessFile raf = new RandomAccessFile(f, r);
 // including the header and the last / bytes
 // the snapshot should be atleast 10 bytes
 if (raf.length()  10) {
 return false;
 }
 {noformat}
 Since the snapshot file validation logic is outside try block, it won't go to 
 the finally block and will be leaked.
 Suggestion: Move the validation logic to the try/catch block.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1195) SASL authorizedID being incorrectly set: should use getHostName() rather than getServiceName()

2011-09-26 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13115136#comment-13115136
 ] 

Mahadev konar commented on ZOOKEEPER-1195:
--

Eugene, 
 Should we just incorporate ZOOKEEPER-1201 into 3.4? What do you think?

 SASL authorizedID being incorrectly set: should use getHostName() rather than 
 getServiceName()
 --

 Key: ZOOKEEPER-1195
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1195
 Project: ZooKeeper
  Issue Type: Bug
Affects Versions: 3.4.0
Reporter: Eugene Koontz
Assignee: Eugene Koontz
 Fix For: 3.4.0

 Attachments: SaslAuthNamingTest.java, ZOOKEEPER-1195.patch


 Tom Klonikowski writes:
 Hello developers,
 the SaslServerCallbackHandler in trunk changes the principal name
 service/host@REALM to service/service@REALM (i guess unintentionally).
 lines 131-133:
 if (!removeHost()  (kerberosName.getHostName() != null)) {
   userName += / + kerberosName.getServiceName();
 }
 Server Log:
 SaslServerCallbackHandler@115] - Successfully authenticated client:
 authenticationID=fetcher/ubook@QUINZOO;
 authorizationID=fetcher/ubook@QUINZOO.
 SaslServerCallbackHandler@137] - Setting authorizedID:
 fetcher/fetcher@QUINZOO

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1181) Fix problems with Kerberos TGT renewal

2011-09-26 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13115144#comment-13115144
 ] 

Mahadev konar commented on ZOOKEEPER-1181:
--

Eugene,
 We should write some unit tests for this. I am fine checking this into 3.4 for 
now. Can you please create a ticket to add a unit test for this? Mockito would 
be very helpful here.

Might make some changes to the patch to get this in ASAP.

 Fix problems with Kerberos TGT renewal
 --

 Key: ZOOKEEPER-1181
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1181
 Project: ZooKeeper
  Issue Type: Bug
  Components: java client, server
Affects Versions: 3.4.0
Reporter: Eugene Koontz
Assignee: Eugene Koontz
  Labels: kerberos, security
 Fix For: 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1181.patch, ZOOKEEPER-1181.patch


 Currently, in Zookeeper trunk, there are two problems with Kerberos TGT 
 renewal:
 1. TGTs obtained from a keytab are not refreshed periodically. They should 
 be, just as those from ticket cache are refreshed.
 2. Ticket renewal should be retried if it fails. Ticket renewal might fail if 
 two or more separate processes (different JVMs) running as the same user try 
 to renew Kerberos credentials at the same time. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1174) FD leak when network unreachable

2011-09-26 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13115159#comment-13115159
 ] 

Mahadev konar commented on ZOOKEEPER-1174:
--

Ted,
 Any update on this? Please let me know. I plan to cut a release soon and would 
like to get this in.

thanks

 FD leak when network unreachable
 

 Key: ZOOKEEPER-1174
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1174
 Project: ZooKeeper
  Issue Type: Bug
  Components: java client
Affects Versions: 3.3.3
Reporter: Ted Dunning
Assignee: Ted Dunning
Priority: Critical
 Fix For: 3.3.4, 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1174.patch, ZOOKEEPER-1174.patch, 
 ZOOKEEPER-1174.patch, ZOOKEEPER-1174.patch, ZOOKEEPER-1174.patch, 
 zk-fd-leak.tgz


 In the socket connection logic there are several errors that result in bad 
 behavior.  The basic problem is that a socket is registered with a selector 
 unconditionally when there are nuances that should be dealt with.  First, the 
 socket may connect immediately.  Secondly, the connect may throw an 
 exception.  In either of these two cases, I don't think that the socket 
 should be registered.
 I will attach a test case that demonstrates the problem.  I have been unable 
 to create a unit test that exhibits the problem because I would have to mock 
 the low level socket libraries to do so.  It would still be good to do so if 
 somebody can figure out a good way.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1174) FD leak when network unreachable

2011-09-26 Thread Mahadev konar (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13115202#comment-13115202
 ] 

Mahadev konar commented on ZOOKEEPER-1174:
--

Wed night my time?


 FD leak when network unreachable
 

 Key: ZOOKEEPER-1174
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1174
 Project: ZooKeeper
  Issue Type: Bug
  Components: java client
Affects Versions: 3.3.3
Reporter: Ted Dunning
Assignee: Ted Dunning
Priority: Critical
 Fix For: 3.3.4, 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1174.patch, ZOOKEEPER-1174.patch, 
 ZOOKEEPER-1174.patch, ZOOKEEPER-1174.patch, ZOOKEEPER-1174.patch, 
 zk-fd-leak.tgz


 In the socket connection logic there are several errors that result in bad 
 behavior.  The basic problem is that a socket is registered with a selector 
 unconditionally when there are nuances that should be dealt with.  First, the 
 socket may connect immediately.  Secondly, the connect may throw an 
 exception.  In either of these two cases, I don't think that the socket 
 should be registered.
 I will attach a test case that demonstrates the problem.  I have been unable 
 to create a unit test that exhibits the problem because I would have to mock 
 the low level socket libraries to do so.  It would still be good to do so if 
 somebody can figure out a good way.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira