[jira] Resolved: (ZOOKEEPER-889) pyzoo_aget_children crashes due to incorrect watcher context
[ https://issues.apache.org/jira/browse/ZOOKEEPER-889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Austin Shoemaker resolved ZOOKEEPER-889. Resolution: Fixed Just noticed that the fix is already in trunk, closing the issue. > pyzoo_aget_children crashes due to incorrect watcher context > > > Key: ZOOKEEPER-889 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-889 > Project: Zookeeper > Issue Type: Bug > Components: contrib-bindings >Affects Versions: 3.3.1 > Environment: OS X 10.6.5, Python 2.6.1 >Reporter: Austin Shoemaker >Priority: Critical > Attachments: repro.py > > > The pyzoo_aget_children function passes the completion callback ("pyw") in > place of the watcher callback ("get_pyw"). Since it is a one-shot callback, > it is deallocated after the completion callback fires, causing a crash when > the watcher callback should be invoked. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-889) pyzoo_aget_children crashes due to incorrect watcher context
[ https://issues.apache.org/jira/browse/ZOOKEEPER-889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Austin Shoemaker updated ZOOKEEPER-889: --- Attachment: repro.py Minimal repro script > pyzoo_aget_children crashes due to incorrect watcher context > > > Key: ZOOKEEPER-889 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-889 > Project: Zookeeper > Issue Type: Bug > Components: contrib-bindings >Affects Versions: 3.3.1 > Environment: OS X 10.6.5, Python 2.6.1 >Reporter: Austin Shoemaker >Priority: Critical > Attachments: repro.py > > > The pyzoo_aget_children function passes the completion callback ("pyw") in > place of the watcher callback ("get_pyw"). Since it is a one-shot callback, > it is deallocated after the completion callback fires, causing a crash when > the watcher callback should be invoked. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (ZOOKEEPER-889) pyzoo_aget_children crashes due to incorrect watcher context
pyzoo_aget_children crashes due to incorrect watcher context Key: ZOOKEEPER-889 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-889 Project: Zookeeper Issue Type: Bug Components: contrib-bindings Affects Versions: 3.3.1 Environment: OS X 10.6.5, Python 2.6.1 Reporter: Austin Shoemaker Priority: Critical The pyzoo_aget_children function passes the completion callback ("pyw") in place of the watcher callback ("get_pyw"). Since it is a one-shot callback, it is deallocated after the completion callback fires, causing a crash when the watcher callback should be invoked. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: znode inconsistencies across ZooKeeper servers
Hi Patrick, You are correct, the test restarts both ZooKeeper server and the client. The client opens a new connection after restarting. So we would expect that the ephmeral znode (/foo) to expire after the session timeout. However, the client with the new session creates the ephemeral znode (/foo) again after it reboots (it sets a watch for /foo and recreates /foo if it is deleted or doesn't exist). The client is not reusing the session ID. What I expect to see is that the older /foo should expire after which a new /foo should get created. Is my expectation correct? What confuses me is the following output of 3 successive getstat /foo requests on A (the zxid, time and owner fields). Notice that the older znode reappeared. At the same time when I do getstat at B and C, I see the newer /foo. log4j:WARN No appenders could be found for logger (org.apache.zookeeper.ZooKeeper). log4j:WARN Please initialize the log4j system properly. cZxid = 0x105ef ctime = Tue Oct 05 15:00:50 UTC 2010 mZxid = 0x105ef mtime = Tue Oct 05 15:00:50 UTC 2010 pZxid = 0x105ef cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x2b7ce57ce4 dataLength = 54 numChildren = 0 log4j:WARN No appenders could be found for logger (org.apache.zookeeper.ZooKeeper). log4j:WARN Please initialize the log4j system properly. cZxid = 0x10607 ctime = Tue Oct 05 15:01:07 UTC 2010 mZxid = 0x10607 mtime = Tue Oct 05 15:01:07 UTC 2010 pZxid = 0x10607 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x2b7ce5bda4 dataLength = 54 numChildren = 0 log4j:WARN No appenders could be found for logger (org.apache.zookeeper.ZooKeeper). log4j:WARN Please initialize the log4j system properly. cZxid = 0x105ef ctime = Tue Oct 05 15:00:50 UTC 2010 mZxid = 0x105ef mtime = Tue Oct 05 15:00:50 UTC 2010 pZxid = 0x105ef cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x2b7ce57ce4 dataLength = 54 numChildren = 0 Thanks for your help. -Vishal On Wed, Oct 6, 2010 at 4:45 PM, Patrick Hunt wrote: > Vishal the attachment seems to be getting removed by the list daemon (I > don't have it), can you create a JIRA and attach? Also this is a good > question for the ppl on zookeeper-user. (ccing) > > You are aware that ephemeral znodes are tied to the session? And that > sessions only expire after the session timeout period? At which time any > znodes created during that session are then deleted. The fact that you are > "kill"ing your client process leads me to believe that you are not closing > the session cleanly (meaning that it will eventually expire after the > session timeout period), in which case the ephemeral znodes _should_ > reappear when A is restarted and successfully rejoins the cluster. (at > least > until the session timeout is exceeded) > > Patrick > > On Tue, Oct 5, 2010 at 11:04 AM, Vishal K wrote: > > > Hi, > > > > I have a 3 node ZK cluster (A, B, C). On one of the the nodes (node A), I > > have a ZK client running that connects to the local server and creates an > > ephemeral znode to indicate clients on other nodes that it is online. > > > > I have test script that reboots the zookeeper server as well as client on > > A. The test does a getstat on the ephemeral znode created by the client > on > > A. I am seeing that the view of znodes on A is different from the other 2 > > nodes. I can tell this from the session ID that the client gets after > > reconnecting to the local ZK server. > > > > So the test is simple: > > - kill zookeeper server and client process > > - wait for a few seconds > > - do zkCli.sh stat ... > test.out > > > > What I am seeing is that the ephemeral znode with old zxid, time, and > > session ID is reappearing on node A. I have attached the output of 3 > > consecutive getstat requests of the test (see client_getstat.out). Notice > > that the third output is the same as the first one. That is, the old > > ephemeral znode reappeared at A. However, both B and C are showing the > > latest znode with correct time, zxid and session ID (output not > attached). > > > > After this point, all following getstat requests on A are showing the old > > znode. Whereas, B and C show the correct znode every time the client on A > > comes online. This is something very perplexing. Earlier I thought this > was > > a bug in my client implementation. But the test shows that the ZK server > on > > A after reboot is out of sync with rest of the servers. > > > > The stat command to each server shows that the servers are in sync as far > > as zxid's are concerned (see stat.out). So there is something wrong with > A's > > local database that is causing this problem. > > > > Has anyone seen this before? I will be doing more debugging in the next > few > > days. Comments/suggestions for further debugging are welcomed. > > > > -Vishal > > > > > > >
Re: znode inconsistencies across ZooKeeper servers
Vishal the attachment seems to be getting removed by the list daemon (I don't have it), can you create a JIRA and attach? Also this is a good question for the ppl on zookeeper-user. (ccing) You are aware that ephemeral znodes are tied to the session? And that sessions only expire after the session timeout period? At which time any znodes created during that session are then deleted. The fact that you are "kill"ing your client process leads me to believe that you are not closing the session cleanly (meaning that it will eventually expire after the session timeout period), in which case the ephemeral znodes _should_ reappear when A is restarted and successfully rejoins the cluster. (at least until the session timeout is exceeded) Patrick On Tue, Oct 5, 2010 at 11:04 AM, Vishal K wrote: > Hi, > > I have a 3 node ZK cluster (A, B, C). On one of the the nodes (node A), I > have a ZK client running that connects to the local server and creates an > ephemeral znode to indicate clients on other nodes that it is online. > > I have test script that reboots the zookeeper server as well as client on > A. The test does a getstat on the ephemeral znode created by the client on > A. I am seeing that the view of znodes on A is different from the other 2 > nodes. I can tell this from the session ID that the client gets after > reconnecting to the local ZK server. > > So the test is simple: > - kill zookeeper server and client process > - wait for a few seconds > - do zkCli.sh stat ... > test.out > > What I am seeing is that the ephemeral znode with old zxid, time, and > session ID is reappearing on node A. I have attached the output of 3 > consecutive getstat requests of the test (see client_getstat.out). Notice > that the third output is the same as the first one. That is, the old > ephemeral znode reappeared at A. However, both B and C are showing the > latest znode with correct time, zxid and session ID (output not attached). > > After this point, all following getstat requests on A are showing the old > znode. Whereas, B and C show the correct znode every time the client on A > comes online. This is something very perplexing. Earlier I thought this was > a bug in my client implementation. But the test shows that the ZK server on > A after reboot is out of sync with rest of the servers. > > The stat command to each server shows that the servers are in sync as far > as zxid's are concerned (see stat.out). So there is something wrong with A's > local database that is causing this problem. > > Has anyone seen this before? I will be doing more debugging in the next few > days. Comments/suggestions for further debugging are welcomed. > > -Vishal > > >
[jira] Updated: (ZOOKEEPER-820) update c unit tests to ensure "zombie" java server processes don't cause failure
[ https://issues.apache.org/jira/browse/ZOOKEEPER-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Hunt updated ZOOKEEPER-820: --- Status: Open (was: Patch Available) Cancelling patch - needs to be updated for Mahadev's most recent comment. > update c unit tests to ensure "zombie" java server processes don't cause > failure > > > Key: ZOOKEEPER-820 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-820 > Project: Zookeeper > Issue Type: Bug >Affects Versions: 3.3.1 >Reporter: Patrick Hunt >Assignee: Michi Mutsuzaki >Priority: Critical > Fix For: 3.3.2, 3.4.0 > > Attachments: ZOOKEEPER-820-1.patch, ZOOKEEPER-820.patch, > ZOOKEEPER-820.patch > > > When the c unit tests are run sometimes the server doesn't shutdown at the > end of the test, this causes subsequent tests (hudson esp) to fail. > 1) we should try harder to make the server shut down at the end of the test, > I suspect this is related to test failing/cleanup > 2) before the tests are run we should see if the old server is still running > and try to shut it down -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-822) Leader election taking a long time to complete
[ https://issues.apache.org/jira/browse/ZOOKEEPER-822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Hunt updated ZOOKEEPER-822: --- Resolution: Fixed Status: Resolved (was: Patch Available) Committed to trunk/33 - thanks Vishal and everyone who pushed this through! > Leader election taking a long time to complete > --- > > Key: ZOOKEEPER-822 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-822 > Project: Zookeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.3.0 >Reporter: Vishal K >Assignee: Vishal K >Priority: Blocker > Fix For: 3.3.2, 3.4.0 > > Attachments: 822.tar.gz, rhel.tar.gz, test_zookeeper_1.log, > test_zookeeper_2.log, zk_leader_election.tar.gz, zookeeper-3.4.0.tar.gz, > ZOOKEEPER-822-3.3.2.patch, ZOOKEEPER-822-3.3.2.patch, > ZOOKEEPER-822-3.3.2.patch, ZOOKEEPER-822-3.3.2.patch, ZOOKEEPER-822.patch, > ZOOKEEPER-822.patch, ZOOKEEPER-822.patch, ZOOKEEPER-822.patch, > ZOOKEEPER-822.patch, ZOOKEEPER-822.patch, ZOOKEEPER-822.patch_v1 > > > Created a 3 node cluster. > 1 Fail the ZK leader > 2. Let leader election finish. Restart the leader and let it join the > 3. Repeat > After a few rounds leader election takes anywhere 25- 60 seconds to finish. > Note- we didn't have any ZK clients and no new znodes were created. > zoo.cfg is shown below: > #Mon Jul 19 12:15:10 UTC 2010 > server.1=192.168.4.12\:2888\:3888 > server.0=192.168.4.11\:2888\:3888 > clientPort=2181 > dataDir=/var/zookeeper > syncLimit=2 > server.2=192.168.4.13\:2888\:3888 > initLimit=5 > tickTime=2000 > I have attached logs from two nodes that took a long time to form the cluster > after failing the leader. The leader was down anyways so logs from that node > shouldn't matter. > Look for "START HERE". Logs after that point should be of our interest. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-844) handle auth failure in java client
[ https://issues.apache.org/jira/browse/ZOOKEEPER-844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Hunt updated ZOOKEEPER-844: --- Resolution: Fixed Hadoop Flags: [Reviewed] Status: Resolved (was: Patch Available) Committed to trunk/3.3. Thanks Camille! > handle auth failure in java client > -- > > Key: ZOOKEEPER-844 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-844 > Project: Zookeeper > Issue Type: Bug > Components: java client >Affects Versions: 3.3.1 >Reporter: Camille Fournier >Assignee: Camille Fournier > Fix For: 3.3.2, 3.4.0 > > Attachments: ZOOKEEPER-844.patch, ZOOKEEPER332-844 > > > ClientCnxn.java currently has the following code: > if (replyHdr.getXid() == -4) { > // -2 is the xid for AuthPacket > // TODO: process AuthPacket here > if (LOG.isDebugEnabled()) { > LOG.debug("Got auth sessionid:0x" > + Long.toHexString(sessionId)); > } > return; > } > Auth failures appear to cause the server to disconnect but the client never > gets a proper state change or notification that auth has failed, which makes > handling this scenario very difficult as it causes the client to go into a > loop of sending bad auth, getting disconnected, trying to reconnect, sending > bad auth again, over and over. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-844) handle auth failure in java client
[ https://issues.apache.org/jira/browse/ZOOKEEPER-844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12918566#action_12918566 ] Patrick Hunt commented on ZOOKEEPER-844: +1 looks good to me. Thanks Camille! > handle auth failure in java client > -- > > Key: ZOOKEEPER-844 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-844 > Project: Zookeeper > Issue Type: Bug > Components: java client >Affects Versions: 3.3.1 >Reporter: Camille Fournier >Assignee: Camille Fournier > Fix For: 3.3.2, 3.4.0 > > Attachments: ZOOKEEPER-844.patch, ZOOKEEPER332-844 > > > ClientCnxn.java currently has the following code: > if (replyHdr.getXid() == -4) { > // -2 is the xid for AuthPacket > // TODO: process AuthPacket here > if (LOG.isDebugEnabled()) { > LOG.debug("Got auth sessionid:0x" > + Long.toHexString(sessionId)); > } > return; > } > Auth failures appear to cause the server to disconnect but the client never > gets a proper state change or notification that auth has failed, which makes > handling this scenario very difficult as it causes the client to go into a > loop of sending bad auth, getting disconnected, trying to reconnect, sending > bad auth again, over and over. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-888) c-client / zkpython: Double free corruption on node watcher
[ https://issues.apache.org/jira/browse/ZOOKEEPER-888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Hunt updated ZOOKEEPER-888: --- Fix Version/s: 3.4.0 3.3.2 Borderline blocker, Henry any insight on this? Something that can be addressed for 3.3.2? > c-client / zkpython: Double free corruption on node watcher > --- > > Key: ZOOKEEPER-888 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-888 > Project: Zookeeper > Issue Type: Bug > Components: c client, contrib-bindings >Affects Versions: 3.3.1 >Reporter: Lukas >Priority: Critical > Fix For: 3.3.2, 3.4.0 > > Attachments: resume-segfault.py > > > the c-client / zkpython wrapper invokes already freed watcher callback > steps to reproduce: > 0. start a zookeper server on your machine > 1. run the attached python script > 2. suspend the zookeeper server process (e.g. using `pkill -STOP -f > org.apache.zookeeper.server.quorum.QuorumPeerMain` ) > 3. wait until the connection and the node observer fired with a session > event > 4. resume the zookeeper server process (e.g. using `pkill -CONT -f > org.apache.zookeeper.server.quorum.QuorumPeerMain` ) > -> the client tries to dispatch the node observer function again, but it was > already freed -> double free corruption -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-888) c-client / zkpython: Double free corruption on node watcher
[ https://issues.apache.org/jira/browse/ZOOKEEPER-888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lukas updated ZOOKEEPER-888: Attachment: resume-segfault.py Example code for triggering the bug > c-client / zkpython: Double free corruption on node watcher > --- > > Key: ZOOKEEPER-888 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-888 > Project: Zookeeper > Issue Type: Bug > Components: c client, contrib-bindings >Affects Versions: 3.3.1 >Reporter: Lukas >Priority: Critical > Attachments: resume-segfault.py > > > the c-client / zkpython wrapper invokes already freed watcher callback > steps to reproduce: > 0. start a zookeper server on your machine > 1. run the attached python script > 2. suspend the zookeeper server process (e.g. using `pkill -STOP -f > org.apache.zookeeper.server.quorum.QuorumPeerMain` ) > 3. wait until the connection and the node observer fired with a session > event > 4. resume the zookeeper server process (e.g. using `pkill -CONT -f > org.apache.zookeeper.server.quorum.QuorumPeerMain` ) > -> the client tries to dispatch the node observer function again, but it was > already freed -> double free corruption -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (ZOOKEEPER-888) c-client / zkpython: Double free corruption on node watcher
c-client / zkpython: Double free corruption on node watcher --- Key: ZOOKEEPER-888 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-888 Project: Zookeeper Issue Type: Bug Components: c client, contrib-bindings Affects Versions: 3.3.1 Reporter: Lukas Priority: Critical Attachments: resume-segfault.py the c-client / zkpython wrapper invokes already freed watcher callback steps to reproduce: 0. start a zookeper server on your machine 1. run the attached python script 2. suspend the zookeeper server process (e.g. using `pkill -STOP -f org.apache.zookeeper.server.quorum.QuorumPeerMain` ) 3. wait until the connection and the node observer fired with a session event 4. resume the zookeeper server process (e.g. using `pkill -CONT -f org.apache.zookeeper.server.quorum.QuorumPeerMain` ) -> the client tries to dispatch the node observer function again, but it was already freed -> double free corruption -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-804) c unit tests failing due to "assertion cptr failed"
[ https://issues.apache.org/jira/browse/ZOOKEEPER-804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12918466#action_12918466 ] Hudson commented on ZOOKEEPER-804: -- Integrated in ZooKeeper-trunk #958 (See [https://hudson.apache.org/hudson/job/ZooKeeper-trunk/958/]) ZOOKEEPER-804. c unit tests failing due to "assertion cptr failed" (michi mutsuzaki via mahadev) > c unit tests failing due to "assertion cptr failed" > --- > > Key: ZOOKEEPER-804 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-804 > Project: Zookeeper > Issue Type: Bug > Components: c client >Affects Versions: 3.4.0 > Environment: gcc 4.4.3, ubuntu lucid lynx, dual core laptop (intel) >Reporter: Patrick Hunt >Assignee: Michi Mutsuzaki >Priority: Critical > Fix For: 3.3.2, 3.4.0 > > Attachments: ZOOKEEPER-804.patch > > > I'm seeing this frequently: > [exec] Zookeeper_simpleSystem::testPing : elapsed 18006 : OK > [exec] Zookeeper_simpleSystem::testAcl : elapsed 1022 : OK > [exec] Zookeeper_simpleSystem::testChroot : elapsed 3145 : OK > [exec] Zookeeper_simpleSystem::testAuth ZooKeeper server started : > elapsed 25687 : OK > [exec] zktest-mt: > /home/phunt/dev/workspace/gitzk/src/c/src/zookeeper.c:1952: > zookeeper_process: Assertion `cptr' failed. > [exec] make: *** [run-check] Aborted > [exec] Zookeeper_simpleSystem::testHangingClient > Mahadev can you take a look? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.