[ 
https://issues.apache.org/jira/browse/COUCHDB-988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob Dionne updated COUCHDB-988:
-------------------------------

    Attachment: 988.patch

more refactoring of couch_key_tree, new etaps, test for couchdb-902

> Rewrite couch_key_tree.erl
> --------------------------
>
>                 Key: COUCHDB-988
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-988
>             Project: CouchDB
>          Issue Type: Improvement
>          Components: Database Core
>            Reporter: Paul Joseph Davis
>            Priority: Minor
>         Attachments: 988.patch
>
>
> The current key tree module is a fairly complicated piece of code with enough 
> subtlety to fill three romance novels. This ticket is a proposal to rewrite 
> the current module with an algorithm that will be more easy to reason about. 
> Here I'll write a brief explanation of the current algorithms, and then a 
> short proposal for a simpler algorithm.
> A key tree is used to represent the edit history of a document. Each node of 
> the tree represents a particular version of the document. Relations between 
> nodes represent the order that these edits were applied. For instance, a 
> simple tree with a single linear path A->B->C means that the edit C was based 
> on the version B which was in turn based on A. In a world without replication 
> (and no ability to disable MVCC checks), all histories would be forced to be 
> linear lists of edits due to constraints imposed by MVCC (ie, new edits must 
> be based on the current version). However, we have replication, so we must 
> deal with not so easy cases.
> Consider a document in state A. This doc is replicated to a second node. We 
> then edit the document one each node leaving it in two different states, B 
> and C. We now have two key trees, A->B and A->C. When we go to replicate a 
> second time, the key tree must combine these two trees which gives us 
> A->(B|C). For the astute reader, this is how conflicts are introduced. In 
> terms of the key tree, we say that we have two leaves (B and C) that are not 
> deleted. Hence, conflict. To remove a conflict, one of the edits (B or C) can 
> be deleted, which results in, A->(B|C->D) where D is an edit that is 
> specially marked with a deleted=true flag.
> Also, of note which will help us down the line, remember that there's another 
> completely different possible mode of operation. Given two couchdb instances, 
> we could *create* two different docs with the same id. Now we have two 
> documents with edit histories E and F. Now after replication we have edit 
> history (E|D) but really, this isn't a tree. Its two roots to two different 
> trees. Our algorithms still work, we just have to consider multiple trees 
> when we apply them. Remember this after this next aside.
> Hopefully from that brief description of simple operations its fairly 
> intuitive how we can end up with arbitrarily deep and branching trees due to 
> distributed edits. Of note here is that this operation parallels the nature 
> of Git's commit model. Each commit sha1 depends on the parent sha1 creating a 
> sort of inverted merkle tree.
> At this point I'll mention a quick point on the implementation of these 
> trees. In Erlang, they are implemented as a list of nested tuples with the 
> roots being the most shallow node. For instance, A->B->C could be represented 
> as roughly, [{A, [{B, [{C}]}]}] which in english: "a single element list of a 
> two element tuple, where the first element is the key, and the second element 
> is a single element list of a two tuple....turtles."
> The reason I make this note on implementation is that it should be apparent 
> that the depth of this data structure is going to be linearly dependent on 
> the edit history of a document. As we found out quickly, Erlang also 
> apparently has some core C functions that are written recursively. As 
> everyone knows about recursion in C, there is a point where one more turtle 
> is one too many. Or, in other words, we run out of room on the stack to 
> allocate more frames.
> At this point we had a couple options. Patch Erlang (we did, it was rejected 
> IIRC). Change our representation (could work, but would break some of the 
> algorithmic assumptions, especially with how Erlang's SSA works). Or, add 
> pruning to the revision history. This is what revision stemming is.
> Revision stemming in a nutshell means "Keep only the last N edits". But then 
> its a tree, so its actually, "keep the last N edits per leaf." The tradeoff 
> here is that stemming may introduce a conflict by accident. For instance, 
> Given A->B on one node, replicate to a second, edit it so that its 
> A->B->C->D. Then applying stemming with an N of 2, we end up with A->B on 
> node 1, and C->D on node 2. After replication, this turns into (A->B|C->D) 
> ie, we weren't able to see that C was a descendant of B. This is generally 
> considered an acceptable tradeoff. You just need to make sure you replicate 
> more often than the N rev_stemming edits.
> The second important note about rev_stemming is that it changes our storage 
> of trees. The current implementation keeps track of how many edits have been 
> discarded. Ie, in revisions you see in documents of the format "Number-Hash", 
> that number is basically how deep that revision is in the path from root. We 
> then leverage this information when merging stemmed trees so that we know how 
> deep we need to look to start matching edits.
> One final minor piece of implementation. Each key in the tree also carries a 
> value. This can either be a document body about to be written, or perhaps the 
> location in the file where that revision can be read.
> The bug that was uncovered in COUCHDB-968 was that the start-of-tree position 
> information was used in such a way that we ended up choosing the wrong value. 
> Ie, we were choosing to re-write the same document instead of discarding it 
> for the version that was already on disk. This ended up cascading through 
> about 3 critical sections causing the end result of dupes in _changes and 
> _all_docs after compaction.
> The fix that Adam has put together was to first fix the decision making 
> process by making a specific preference for which tree's value is used for 
> identical revisions. Due to more than a few subtleties this has take a bit to 
> get right with all the intervening key tree code. Hence, my desire to rewrite 
> it.
> The last thing I'll write up is an off the cuff approach that I thought of 
> this morning. I don't necessarily think this is the only way to rewrite the 
> key tree, but it seems easier (before coding). If anyone has ideas for other 
> approaches that's just as well as the only goal here is to reduce the 
> complexity factor.
> The current code was originally designed around the concept of rooted trees 
> of edits. Ie, Git. As we added features like stemming, the actual problem has 
> shifted towards considering a larger set of possibly unrelated trees. If we 
> add one constraint, it is possible for us to reconsider the algorithms in 
> terms of more general graph approaches. That one extra constraint:
>     1. keys (ie, revisions) must be unique within the set of trees (ie, 
> single docid).
> The general algorithm for merging N trees is something like:
>     1. For each tree, create a list of parent/child edges.
>     2. Find the union of all edges.
>     3. Rebuild the tree.
> There'd be more to work out of course. Depth would disappear, which would 
> affect revision patterns visible to the client. Though BigCouch has changed 
> revisions as well. I'm sure there are probably other things to work out.
> So, that's all I've got right now. Its a bit of a twinkle in the eye, but I 
> figured I'd get my thoughts on the current code down on paper so I don't have 
> to re-think them in the future.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to