[ https://issues.apache.org/jira/browse/COUCHDB-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12964632#action_12964632 ]
Paul Joseph Davis commented on COUCHDB-968: ------------------------------------------- <brain_dump> So I spent some time today tracking this down. Here are some notes. The multiple entries in _all_docs is a bit of a red herring. Yes its something we should investigate preventing in the future, but its just an expression of the underlying cause. What happens is that some how multiple update_seq entries are getting inserted into the database's update_seq btree for a single document id. When compaction run it just iterates over this btree and writes the docs to disk. This means that it'll just write multiple docs to that tree. If we write multiple rows in a single btree query_modify call, its possible that we end up with multiple rows with identical keys (which is bad). The real issue is how we end up with multiple update_seq entries for a given doc id. This is where the replication and rev_stemming come in. Once a document's revision length has been exceeded, there's apparently a way for two update_seq's to get inserted. After some digging, I've found out that what happens is that couch_db_updater:update_docs_int ends up trying to remove an update_seq that doesn't exist. Once this happens we have two update seq's for a single doc id. So, next question is how do we screw up figuring out which update_seq to delete. The code in question would appear to be trying to delete the previous update_seq which gets taken from the full_doc_info record. At this point, my exact understanding of the events gets a bit hazy, so bear with me. What I think is happening is that a document with a full revision history gets written out due to an interactive edit (ie, one that would fail wtih a conflict). Then when the replicator attempts to write (in a manner that merges key trees, ie, no conflicts are possible) what happens is that it gets a bit confused. For instance: Given the interactive edit resulted in a revision history of B-C-D, then the replicator attempts to write a doc with history A-B-C, it gets confused on whether its writing a new doc or not. At this point I get a bit lost. Some how a second edit comes in and the update_seq on the full_doc_info object that gets looked up is newer than it should be, where as the entry in the update_seq btree is older, hence, duplicate rows, hence compaction gives multiple docs in _all_docs. Etc etc. I'm flying tomorrow so I'll have more time to investigate the exact consequences of these various bits if no one beats me to it. If someone wants to take a crack at this, the next place to start digging is in the bottom of couch_db_updater:merge_rev_trees where it attempts to compare the new and old revision trees to decide on if it should update the update_seq in the full_doc_info record. Specifically, I think we need to reevaluate the NewRevTree == OldTree comparison in the last if-statement as it appears the absolute root cause of this bug is that comparison evaluating false when it should be true. </brain_dump> > Duplicated IDs in _all_docs > --------------------------- > > Key: COUCHDB-968 > URL: https://issues.apache.org/jira/browse/COUCHDB-968 > Project: CouchDB > Issue Type: Bug > Components: Database Core > Affects Versions: 0.10.1, 0.10.2, 0.11.1, 0.11.2, 1.0, 1.0.1, 1.0.2 > Environment: Ubuntu 10.04. > Reporter: Sebastian Cohnen > Priority: Blocker > > We have a database, which is causing serious trouble with compaction and > replication (huge memory and cpu usage, often causing couchdb to crash b/c > all system memory is exhausted). Yesterday we discovered that db/_all_docs is > reporting duplicated IDs (see [1]). Until a few minutes ago we thought that > there are only few duplicates but today I took a closer look and I found 10 > IDs which sum up to a total of 922 duplicates. Some of them have only 1 > duplicate, others have hundreds. > Some facts about the database in question: > * ~13k documents, with 3-5k revs each > * all duplicated documents are in conflict (with 1 up to 14 conflicts) > * compaction is run on a daily bases > * several thousands updates per hour > * multi-master setup with pull replication from each other > * delayed_commits=false on all nodes > * used couchdb versions 1.0.0 and 1.0.x (*) > Unfortunately the database's contents are confidential and I'm not allowed to > publish it. > [1]: Part of http://localhost:5984/DBNAME/_all_docs > ... > {"id":"9997","key":"9997","value":{"rev":"6096-603c68c1fa90ac3f56cf53771337ac9f"}}, > {"id":"9999","key":"9999","value":{"rev":"6097-3c873ccf6875ff3c4e2c6fa264c6a180"}}, > {"id":"9999","key":"9999","value":{"rev":"6097-3c873ccf6875ff3c4e2c6fa264c6a180"}}, > ... > [*] > There were two (old) servers (1.0.0) in production (already having the > replication and compaction issues). Then two servers (1.0.x) were added and > replication was set up to bring them in sync with the old production servers > since the two new servers were meant to replace the old ones (to update > node.js application code among other things). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.