[ https://issues.apache.org/jira/browse/COUCHDB-988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bob Dionne updated COUCHDB-988: ------------------------------- Attachment: 988.patch more refactoring of couch_key_tree, new etaps, test for couchdb-902 > Rewrite couch_key_tree.erl > -------------------------- > > Key: COUCHDB-988 > URL: https://issues.apache.org/jira/browse/COUCHDB-988 > Project: CouchDB > Issue Type: Improvement > Components: Database Core > Reporter: Paul Joseph Davis > Priority: Minor > Attachments: 988.patch > > > The current key tree module is a fairly complicated piece of code with enough > subtlety to fill three romance novels. This ticket is a proposal to rewrite > the current module with an algorithm that will be more easy to reason about. > Here I'll write a brief explanation of the current algorithms, and then a > short proposal for a simpler algorithm. > A key tree is used to represent the edit history of a document. Each node of > the tree represents a particular version of the document. Relations between > nodes represent the order that these edits were applied. For instance, a > simple tree with a single linear path A->B->C means that the edit C was based > on the version B which was in turn based on A. In a world without replication > (and no ability to disable MVCC checks), all histories would be forced to be > linear lists of edits due to constraints imposed by MVCC (ie, new edits must > be based on the current version). However, we have replication, so we must > deal with not so easy cases. > Consider a document in state A. This doc is replicated to a second node. We > then edit the document one each node leaving it in two different states, B > and C. We now have two key trees, A->B and A->C. When we go to replicate a > second time, the key tree must combine these two trees which gives us > A->(B|C). For the astute reader, this is how conflicts are introduced. In > terms of the key tree, we say that we have two leaves (B and C) that are not > deleted. Hence, conflict. To remove a conflict, one of the edits (B or C) can > be deleted, which results in, A->(B|C->D) where D is an edit that is > specially marked with a deleted=true flag. > Also, of note which will help us down the line, remember that there's another > completely different possible mode of operation. Given two couchdb instances, > we could *create* two different docs with the same id. Now we have two > documents with edit histories E and F. Now after replication we have edit > history (E|D) but really, this isn't a tree. Its two roots to two different > trees. Our algorithms still work, we just have to consider multiple trees > when we apply them. Remember this after this next aside. > Hopefully from that brief description of simple operations its fairly > intuitive how we can end up with arbitrarily deep and branching trees due to > distributed edits. Of note here is that this operation parallels the nature > of Git's commit model. Each commit sha1 depends on the parent sha1 creating a > sort of inverted merkle tree. > At this point I'll mention a quick point on the implementation of these > trees. In Erlang, they are implemented as a list of nested tuples with the > roots being the most shallow node. For instance, A->B->C could be represented > as roughly, [{A, [{B, [{C}]}]}] which in english: "a single element list of a > two element tuple, where the first element is the key, and the second element > is a single element list of a two tuple....turtles." > The reason I make this note on implementation is that it should be apparent > that the depth of this data structure is going to be linearly dependent on > the edit history of a document. As we found out quickly, Erlang also > apparently has some core C functions that are written recursively. As > everyone knows about recursion in C, there is a point where one more turtle > is one too many. Or, in other words, we run out of room on the stack to > allocate more frames. > At this point we had a couple options. Patch Erlang (we did, it was rejected > IIRC). Change our representation (could work, but would break some of the > algorithmic assumptions, especially with how Erlang's SSA works). Or, add > pruning to the revision history. This is what revision stemming is. > Revision stemming in a nutshell means "Keep only the last N edits". But then > its a tree, so its actually, "keep the last N edits per leaf." The tradeoff > here is that stemming may introduce a conflict by accident. For instance, > Given A->B on one node, replicate to a second, edit it so that its > A->B->C->D. Then applying stemming with an N of 2, we end up with A->B on > node 1, and C->D on node 2. After replication, this turns into (A->B|C->D) > ie, we weren't able to see that C was a descendant of B. This is generally > considered an acceptable tradeoff. You just need to make sure you replicate > more often than the N rev_stemming edits. > The second important note about rev_stemming is that it changes our storage > of trees. The current implementation keeps track of how many edits have been > discarded. Ie, in revisions you see in documents of the format "Number-Hash", > that number is basically how deep that revision is in the path from root. We > then leverage this information when merging stemmed trees so that we know how > deep we need to look to start matching edits. > One final minor piece of implementation. Each key in the tree also carries a > value. This can either be a document body about to be written, or perhaps the > location in the file where that revision can be read. > The bug that was uncovered in COUCHDB-968 was that the start-of-tree position > information was used in such a way that we ended up choosing the wrong value. > Ie, we were choosing to re-write the same document instead of discarding it > for the version that was already on disk. This ended up cascading through > about 3 critical sections causing the end result of dupes in _changes and > _all_docs after compaction. > The fix that Adam has put together was to first fix the decision making > process by making a specific preference for which tree's value is used for > identical revisions. Due to more than a few subtleties this has take a bit to > get right with all the intervening key tree code. Hence, my desire to rewrite > it. > The last thing I'll write up is an off the cuff approach that I thought of > this morning. I don't necessarily think this is the only way to rewrite the > key tree, but it seems easier (before coding). If anyone has ideas for other > approaches that's just as well as the only goal here is to reduce the > complexity factor. > The current code was originally designed around the concept of rooted trees > of edits. Ie, Git. As we added features like stemming, the actual problem has > shifted towards considering a larger set of possibly unrelated trees. If we > add one constraint, it is possible for us to reconsider the algorithms in > terms of more general graph approaches. That one extra constraint: > 1. keys (ie, revisions) must be unique within the set of trees (ie, > single docid). > The general algorithm for merging N trees is something like: > 1. For each tree, create a list of parent/child edges. > 2. Find the union of all edges. > 3. Rebuild the tree. > There'd be more to work out of course. Depth would disappear, which would > affect revision patterns visible to the client. Though BigCouch has changed > revisions as well. I'm sure there are probably other things to work out. > So, that's all I've got right now. Its a bit of a twinkle in the eye, but I > figured I'd get my thoughts on the current code down on paper so I don't have > to re-think them in the future. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.