[ 
https://issues.apache.org/jira/browse/COUCHDB-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12964632#action_12964632
 ] 

Paul Joseph Davis commented on COUCHDB-968:
-------------------------------------------

<brain_dump> 
So I spent some time today tracking this down. Here are some notes. 

The multiple entries in _all_docs is a bit of a red herring. Yes its something 
we should investigate preventing in the future, but its just an expression of 
the underlying cause. 

What happens is that some how multiple update_seq entries are getting inserted 
into the database's update_seq btree for a single document id. When compaction 
run it just iterates over this btree and writes the docs to disk. This means 
that it'll just write multiple docs to that tree. If we write multiple rows in 
a single btree query_modify call, its possible that we end up with multiple 
rows with identical keys (which is bad). 

The real issue is how we end up with multiple update_seq entries for a given 
doc id. This is where the replication and rev_stemming come in. Once a 
document's revision length has been exceeded, there's apparently a way for two 
update_seq's to get inserted. After some digging, I've found out that what 
happens is that couch_db_updater:update_docs_int ends up trying to remove an 
update_seq that doesn't exist. Once this happens we have two update seq's for a 
single doc id. 

So, next question is how do we screw up figuring out which update_seq to 
delete. 

The code in question would appear to be trying to delete the previous 
update_seq which gets taken from the full_doc_info record. At this point, my 
exact understanding of the events gets a bit hazy, so bear with me. 

What I think is happening is that a document with a full revision history gets 
written out due to an interactive edit (ie, one that would fail wtih a 
conflict). Then when the replicator attempts to write (in a manner that merges 
key trees, ie, no conflicts are possible) what happens is that it gets a bit 
confused. For instance: 

Given the interactive edit resulted in a revision history of B-C-D, then the 
replicator attempts to write a doc with history A-B-C, it gets confused on 
whether its writing a new doc or not. At this point I get a bit lost. Some how 
a second edit comes in and the update_seq on the full_doc_info object that gets 
looked up is newer than it should be, where as the entry in the update_seq 
btree is older, hence, duplicate rows, hence compaction gives multiple docs in 
_all_docs. 

Etc etc. 

I'm flying tomorrow so I'll have more time to investigate the exact 
consequences of these various bits if no one beats me to it. If someone wants 
to take a crack at this, the next place to start digging is in the bottom of 
couch_db_updater:merge_rev_trees where it attempts to compare the new and old 
revision trees to decide on if it should update the update_seq in the 
full_doc_info record. Specifically, I think we need to reevaluate the 
NewRevTree == OldTree comparison in the last if-statement as it appears the 
absolute root cause of this bug is that comparison evaluating false when it 
should be true. 
</brain_dump>

> Duplicated IDs in _all_docs
> ---------------------------
>
>                 Key: COUCHDB-968
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-968
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Database Core
>    Affects Versions: 0.10.1, 0.10.2, 0.11.1, 0.11.2, 1.0, 1.0.1, 1.0.2
>         Environment: Ubuntu 10.04.
>            Reporter: Sebastian Cohnen
>            Priority: Blocker
>
> We have a database, which is causing serious trouble with compaction and 
> replication (huge memory and cpu usage, often causing couchdb to crash b/c 
> all system memory is exhausted). Yesterday we discovered that db/_all_docs is 
> reporting duplicated IDs (see [1]). Until a few minutes ago we thought that 
> there are only few duplicates but today I took a closer look and I found 10 
> IDs which sum up to a total of 922 duplicates. Some of them have only 1 
> duplicate, others have hundreds.
> Some facts about the database in question:
> * ~13k documents, with 3-5k revs each
> * all duplicated documents are in conflict (with 1 up to 14 conflicts)
> * compaction is run on a daily bases
> * several thousands updates per hour
> * multi-master setup with pull replication from each other
> * delayed_commits=false on all nodes
> * used couchdb versions 1.0.0 and 1.0.x (*)
> Unfortunately the database's contents are confidential and I'm not allowed to 
> publish it.
> [1]: Part of http://localhost:5984/DBNAME/_all_docs
> ...
> {"id":"9997","key":"9997","value":{"rev":"6096-603c68c1fa90ac3f56cf53771337ac9f"}},
> {"id":"9999","key":"9999","value":{"rev":"6097-3c873ccf6875ff3c4e2c6fa264c6a180"}},
> {"id":"9999","key":"9999","value":{"rev":"6097-3c873ccf6875ff3c4e2c6fa264c6a180"}},
> ...
> [*]
> There were two (old) servers (1.0.0) in production (already having the 
> replication and compaction issues). Then two servers (1.0.x) were added and 
> replication was set up to bring them in sync with the old production servers 
> since the two new servers were meant to replace the old ones (to update 
> node.js application code among other things).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to