[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13020441#comment-13020441
 ] 

Camille Fournier commented on ZOOKEEPER-1046:
---------------------------------------------

This one is a bear. I believe what is happening is the following:

Your code is creating and deleting large numbers of sequential nodes. At time 
T, it is in the process of deleting a bunch of nodes when ZK decides to take a 
snapshot of the state.

When we take a snapshot, we spawn a separate thread and serialize the nodes of 
the tree in that thread. We get into your /zkrsm node in 
DataTree.serializeNode, get that node from the tree, synchronize on it, and 
write out the record of that node including its current cversion (used to 
generate sequential node information) and the list of children. However, we 
then release the sync on that node, and attempt to iterate through the children 
to serialize them out. In the meantime, the other thread is merrily deleting 
children of this node, increasing the cversion of /zkrsm all the while. So the 
list of children that we got while serializing the parent is defunct. When we 
try to serialize these now-deleted children, we see that they are null and 
continue on.

Now, you finish this snapshot, delete some more nodes under /zkrsm, create some 
more sequential nodes under /zkrsm, and crash. When you start back up again, 
you read that snapshot and start playing through the log transactions after the 
snapshot zxid. Unfotunately, the first N transactions in your log after the 
snapshot zxid are deletions of nodes that didn't make it into the snapshot 
because you deleted them before they could be serialized to the snapshot. We 
will try to process the delete transaction and get a NoNodeException, but 
ignore it because we know that can happen due to what I wrote above. But what 
we don't do is increment the cversion of the parent node after this failed 
deletion. So our parent's cversion is less than the version it would be if you 
played just the transaction log through, or of the system before the crash. Now 
you want to continue creating sequential nodes where you left off, but your 
cversion is wrong so you try to create a node that already exists. Whoops.

So, now we just need to fix it. Should we be incrementing the cversion of the 
parent even on a NoNode exception during txn log replay? I suspect that is the 
right thing to do. Thoughts?

> Creating a new sequential node results in a ZNODEEXISTS error
> -------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1046
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1046
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.3.2, 3.3.3
>         Environment: A 3 node-cluster running Debian squeeze.
>            Reporter: Jeremy Stribling
>              Labels: sequence
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-1046.tgz
>
>
> On several occasions, I've seen a create() with the sequential flag set fail 
> with a ZNODEEXISTS error, and I don't think that should ever be possible.  In 
> past runs, I've been able to closely inspect the state of the system with the 
> command line client, and saw that the parent znode's cversion is smaller than 
> the sequential number of existing children znode under that parent.  In one 
> example:
> {noformat}
> [zk:<ip:port>(CONNECTED) 3] stat /zkrsm
> cZxid = 0x5
> ctime = Mon Jan 17 18:28:19 PST 2011
> mZxid = 0x5
> mtime = Mon Jan 17 18:28:19 PST 2011
> pZxid = 0x1d819
> cversion = 120710
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0x0
> dataLength = 0
> numChildren = 2955
> {noformat}
> However, the znode /zkrsm/000000000000002d_record0000120804 existed on disk.
> In a recent run, I was able to capture the Zookeeper logs, and I will attach 
> them to this JIRA.  The logs are named as nodeX.<zxid_prefixes>.log, and each 
> new log represents an application process restart.
> Here's the scenario:
> # There's a cluster with nodes 1,2,3 using zxid 0x3.
> # All three nodes restart, forming a cluster of zxid 0x4.
> # Node 3 restarts, leading to a cluster of 0x5.
> At this point, it seems like node 1 is the leader of the 0x5 epoch.  In its 
> log (node1.0x4-0x5.log) you can see the first (of many) instances of the 
> following message:
> {noformat}
> 2011-04-11 21:16:12,607 16649 [ProcessThread:-1] INFO 
> org.apache.zookeeper.server.PrepRequestProcessor  - Got user-level 
> KeeperException when processing sessionid:0x512f466bd44e0002 type:create 
> cxid:0x4da376ab zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error 
> Path:/zkrsm/00000000000000b2_record0001761440 Error:KeeperErrorCode = 
> NodeExists for /zkrsm/00000000000000b2_record0001761440
> {noformat}
> This then repeats forever as my application isn't expecting to ever get this 
> error message on a sequential node create, and just continually retries.  The 
> message even transfers over to node3.0x5-0x6.log once the 0x6 epoch comes 
> into play.
> I don't see anything terribly fishy in the transition between the epochs; the 
> correct snapshots seem to be getting transferred, etc.  Unfortunately I don't 
> have a ZK snapshot/log that exhibits the problem when starting with a fresh 
> system.
> Some oddities you might notice in these logs:
> * Between epochs 0x3 and 0x4, the zookeeper IDs of the nodes changed due to a 
> bug in our application code.  (They are assigned randomly, but are supposed 
> to be consistent across restarts.)
> * We manage node membership dynamically, and our application restarts the 
> ZooKeeperServer classes whenever a new node wants to join (without restarting 
> the entire application process).  This is why you'll see messages like the 
> following in node1.0x4-0x5.log before a new election begins:
> {noformat}
> 2011-04-11 21:16:00,762 4804 [QuorumPeer:/0.0.0.0:2888] INFO 
> org.apache.zookeeper.server.quorum.Learner  - shutdown called
> {noformat}
> * There is in fact one of these dynamic membership changes in 
> node1.0x4-0x5.log, just before the 0x4 epoch is formed.  I'm not sure how 
> this would be related though, as no transactions are done during this period.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to