[ 
https://issues.apache.org/jira/browse/COUCHDB-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965358#action_12965358
 ] 

Paul Joseph Davis commented on COUCHDB-968:
-------------------------------------------

@Bob

Responding to #2 first:

Consider these two ordering of events:

1. Created db1/foo and edit it more than rev_limit times. Now has history A-B-C
2. foo is replicated db1 -> db2 History: A-B-C
3. foo is replicated db2 -> db1 History: A-B-C
4. wait 3 seconds then repeat.

Here, all is hunky dory. Writng foo with an identical revision history results 
in a no-op more or less. The issue is from this progression:

1. Same as before, history is A-B-C
2. foo is replicated db1 -> db2 History: A-B-C
3. write to db1/foo, History: B-C-D
4. foo is replicated db2 -> db1 History A-B-C

Here, step four is attempting to merge A-B-C and B-C-D which results in a 
history of B-C'-D. C' is actually the same revision, but with a new doc pointer 
and high_seq in the doc_info record. Once this happens, it looks like a write 
(because of NewRevTree == OldTree is false). This confusion is where the second 
update_seq is added and then things start going downhill as described before.

To night I plan on writing a specific test for this behavior without requiring 
replication (_bulk_docs interactive_edits=false) to demonstrate that I've got 
it figured out (or to show that I've got no idea what's going on).

You'll notice the timing issue is in how the progression of edits is made with 
respect to the replication coming back.

Now, as to number 1, what you should see and what I was seeing is that db2 has 
the correct update_seq that you'd expect, N writes means update_seq=N. But db1 
has update_seq = N + some_random_number. That randomness is just in how these 
actual writes are ordered, but its greater because of the 
history-merge-that-causes-spurious-writes (I'm pretty sure).


Your last point about reversing the order makes perfect sense because what's 
happening in that case is that CouchDB is just doing a normal edit more or 
less. Ie, a doc with history A-B-C, that gets an edit with history B-C-D gets 
merged and stemmed correctly to B-C-D and all is hunky dory. Its of interest to 
note that your run of the mill every day PUT with the previous revision is the 
equivalent to doing A-B-C + C-D which results in B-C-D.


I've not yet decided who the real culprit is yet. I can't point at any of the 
various places and say that its exactly the bug. Only that the bug is the 
interaction of these two bits under these circumstances. Fixing it could go a 
number of directions and I haven't managed to calibrate my compass for the new 
timezone just yet.

> Duplicated IDs in _all_docs
> ---------------------------
>
>                 Key: COUCHDB-968
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-968
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Database Core
>    Affects Versions: 0.10.1, 0.10.2, 0.11.1, 0.11.2, 1.0, 1.0.1, 1.0.2
>         Environment: Ubuntu 10.04.
>            Reporter: Sebastian Cohnen
>            Priority: Blocker
>
> We have a database, which is causing serious trouble with compaction and 
> replication (huge memory and cpu usage, often causing couchdb to crash b/c 
> all system memory is exhausted). Yesterday we discovered that db/_all_docs is 
> reporting duplicated IDs (see [1]). Until a few minutes ago we thought that 
> there are only few duplicates but today I took a closer look and I found 10 
> IDs which sum up to a total of 922 duplicates. Some of them have only 1 
> duplicate, others have hundreds.
> Some facts about the database in question:
> * ~13k documents, with 3-5k revs each
> * all duplicated documents are in conflict (with 1 up to 14 conflicts)
> * compaction is run on a daily bases
> * several thousands updates per hour
> * multi-master setup with pull replication from each other
> * delayed_commits=false on all nodes
> * used couchdb versions 1.0.0 and 1.0.x (*)
> Unfortunately the database's contents are confidential and I'm not allowed to 
> publish it.
> [1]: Part of http://localhost:5984/DBNAME/_all_docs
> ...
> {"id":"9997","key":"9997","value":{"rev":"6096-603c68c1fa90ac3f56cf53771337ac9f"}},
> {"id":"9999","key":"9999","value":{"rev":"6097-3c873ccf6875ff3c4e2c6fa264c6a180"}},
> {"id":"9999","key":"9999","value":{"rev":"6097-3c873ccf6875ff3c4e2c6fa264c6a180"}},
> ...
> [*]
> There were two (old) servers (1.0.0) in production (already having the 
> replication and compaction issues). Then two servers (1.0.x) were added and 
> replication was set up to bring them in sync with the old production servers 
> since the two new servers were meant to replace the old ones (to update 
> node.js application code among other things).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to