Re: Understanding the CouchDB file format

Robert Dionne Wed, 21 Dec 2011 12:43:13 -0800

Riyad,

Your welcome. At a quick glance your post has one error, internal nodes do 
contain values (from the reductions). The appendix in the couchdb book also 
makes this error[1] which I've opened a ticket for.


Cheers,

Bob


[1] https://github.com/oreilly/couchdb-guide/issues/450




On Dec 21, 2011, at 3:28 PM, Riyad Kalla wrote:

> Bob,
> 
> Really appreciate the link; Rick has a handful of articles that helped a
> lot.
> 
> Along side all the CouchDB reading I've been looking at SSD-optimized data
> storage mechanisms and tried to coalesce all of this information into this
> post on Couch's file storage format:
> https://plus.google.com/u/0/107397941677313236670/posts/CyvwRcvh4vv
> 
> It is uncanny how many things Couch seems to have gotten right with regard
> to existing storage systems and future flash-based storage systems. I'd
> appreciate any corrections, additions or feedback to the post for anyone
> interested.
> 
> Best,
> R
> 
> On Wed, Dec 21, 2011 at 12:53 PM, Robert Dionne <
> [email protected]> wrote:
> 
>> I think this is largely correct Riyad, I dug out an old article[1] by Rick
>> Ho that you may also find helpful though it might be slightly dated.
>> Generally the best performance will be had if the ids are sequential and
>> updates are done in bulk. Write heavy applications will eat up a lot of
>> space and require compaction. At the leaf nodes what are stored are either
>> full_doc_info records or doc_info records which store pointers to the data
>> so the main thing that impacts the branching at each level are the key size
>> and in the case of views the sizes of the reductions as these are stored
>> with the intermediate nodes.
>> 
>> All in all it works pretty well but as always you need to test and
>> evaluate it for you specific case to see what the limits are.
>> 
>> Regards,
>> 
>> Bob
>> 
>> 
>> [1] http://horicky.blogspot.com/2008/10/couchdb-implementation.html
>> 
>> 
>> 
>> 
>> On Dec 21, 2011, at 2:17 PM, Riyad Kalla wrote:
>> 
>>> Adding to this conversation, I found this set of slides by Chris
>> explaining
>>> the append-only index update format:
>>> http://www.slideshare.net/jchrisa/btree-nosql-oak?from=embed
>>> 
>>> Specifically slides 16, 17 and 18.
>>> 
>>> Using this example tree, rewriting the updated path (in reverse order)
>>> appended to the end of the file makes sense... you see how index queries
>>> can simply read backwards from the end of the file and not only find the
>>> latest revisions of docs, but also every other doc that wasn't touched
>> (it
>>> will just seek into the existing inner nodes of the b+ tree for
>> searching).
>>> 
>>> What I am hoping for clarification on is the following pain-point that I
>>> perceive with this approach:
>>> 
>>> 1. In a sufficiently shallow B+ tree (like CouchDB), the paths themselves
>>> to elements are short (typically no more than 3 to 5 levels deep) as
>>> opposed to a trie or some other construct that would have much longer
>> paths
>>> to elements.
>>> 
>>> 2. Because the depth of the tree is so shallow, the breadth of it becomes
>>> large to compensate... more specifically, each internal node can have
>> 100s,
>>> 1000s or more children. Using the example slides, consider the nodes
>>> [A...M] and [R...Z] -- in a good sized CouchDB database, those internal
>>> index nodes would have 100s (or more) elements in them pointing at deeper
>>> internal nodes that themselves had thousands of elements; instead of the
>> 13
>>> or so as implied by [A...M].
>>> 
>>> 3. Looking at slide 17 and 18, where you see the direct B+ tree path to
>> the
>>> update node getting appended to the end of the file after the revision is
>>> written (leaf to root ordering: [J' M] -> [A M] -> [A Z]) it implies that
>>> those internal nodes with *all* their child elements are getting
>> rewritten
>>> as well.
>>> 
>>> In this example tree, it is isn't such a big issue... but in a
>> sufficiently
>>> large CouchDB database, these nodes denoted by [A...M] and [A...Z] could
>> be
>>> quite large... I don't know the format of the node elements in the B+
>> tree,
>>> but it would be whatever the size of a node is times however many
>> elements
>>> are contained at each level (1 for root, say 100 for level 2, 1000 for
>>> level 3 and 10,000 for level 4 -- there is a lot of hand-waving going on
>>> here, of course it depends on the size of the data store).
>>> 
>>> Am I missing something or is CouchDB really rewriting that much index
>>> information between document revisions on every update?
>>> 
>>> What was previously confusing me is I thought it was *only* rewriting a
>>> direct path to the updated revision, like [B]>[E]>[J'] and Couch was
>>> some-how patching in that updated path info to the B+ index at runtime.
>>> 
>>> If couch is rewriting entire node paths with all their elements then I am
>>> no longer confused about the B+ index updates, but am curious about the
>>> on-disk cost of this.
>>> 
>>> In my own rough insertion testing, that would explain why I see my
>>> collections absolutely explode in size until they are compacted (not
>> using
>>> bulk insert, but intentionally doing single inserts for a million(s) of
>>> docs to see what kind of cost the index path duplication would be like).
>>> 
>>> Can anyone confirm/deny/correct this assessment? I want to make sure I am
>>> on the right track understanding this.
>>> 
>>> Best wishes,
>>> Riyad
>>> 
>>> On Tue, Dec 20, 2011 at 6:13 PM, Riyad Kalla <[email protected]> wrote:
>>> 
>>>> @Filipe - I was just not clear on how CouchDB operated; you and Robert
>>>> cleared that up for me. Thank you.
>>>> 
>>>> @Robert - The writeup is excellent so far (I am not familiar with
>> erlang,
>>>> so there is a bit of stickiness there), thank you for taking the time to
>>>> put this together!
>>>> 
>>>> At this point I am curious how the _id and _seq indices are read as
>> their
>>>> data is continually appended to the end of the data file in small
>>>> diff-trees for every updated doc.
>>>> 
>>>> If CouchDB kept all the indices in-memory and simply patched-in the
>>>> updated paths at runtime (maybe something akin to memory-mapped indices
>> in
>>>> MongoDB) I would be fairly clear on the operation... but as I understand
>>>> it, CouchDB keeps such a small memory footprint by doing no in-memory
>>>> caching and relying on the intelligence of the OS and filesystem (and/or
>>>> drives) to cache frequently accessed data.
>>>> 
>>>> I am trying to understand the logic used by CouchDB to answer a query
>>>> using the index once updates to the tree have been appended to the data
>>>> file... for example, consider a CouchDB datastore like the one Filipe
>>>> has... 10 million documents and let's say it is freshly compacted.
>>>> 
>>>> If I send in a request to that Couch instance, it hits the header of the
>>>> data file along with the index and walks the B+ tree to the leaf node,
>>>> where it finds the offset into the data file where the actual doc
>> lives...
>>>> let's say 1,000,000 bytes away.
>>>> 
>>>> These B+ trees are shallow, so it might look something like this:
>>>> 
>>>> Level 1: 1 node, root node.
>>>> Level 2: 100 nodes, inner child nodes
>>>> Level 3: 10,000 nodes, inner child nodes
>>>> Level 4: 1,000,000, leaf nodes... all with pointers to the data offsets
>> in
>>>> the data file.
>>>> 
>>>> Now let's say I write 10 updates to documents in that file. There are 10
>>>> new revisions appended to the end of the data file *each one* separated
>> by
>>>> a rewritten B+ path to a leaf node with it's new location at the end of
>> the
>>>> file. Each of those paths written between each doc revision (say
>> roughly 2k
>>>> like Filipe mentioned) are just 4 item paths... root -> level1 ->
>> level2 ->
>>>> level3 --> level4... showing the discrete path from the root to that
>>>> individual updated doc. The intermediary levels (l1, 2, 3) are not fully
>>>> flushed out with all the OTHER children from the original b+ tree index.
>>>> 
>>>> [[ is this correct so far? If not, please point out my mistakes...]
>>>> 
>>>> Now I issue a query for a document that WAS NOT updated...
>>>> 
>>>> **** this is where I get confused on the logic ***
>>>> 
>>>> this would mean I need to access the original B+ tree index at the root
>> of
>>>> the data file, because the revised B+ paths that are written between
>> each
>>>> of the updated doc revisions at the end of the file are not full
>> indices.
>>>> 
>>>> NOW consider I want to query for one of the changed docs... now I
>> suddenly
>>>> need to scan backwards from the data file's end to find the updated
>> path to
>>>> the new revision of that document.
>>>> 
>>>> (obviously) this isn't what Couch is actually doing... it's doing
>>>> something more elegant, I just can't figure out what or how and that is
>>>> what I was hoping for help with.
>>>> 
>>>> Much thanks guys, I know this is a heavy question to ask.
>>>> 
>>>> Best wishes,
>>>> R
>>>> 
>>>> 
>>>> On Tue, Dec 20, 2011 at 1:35 PM, Robert Dionne <
>>>> [email protected]> wrote:
>>>> 
>>>>> 
>>>>> Robert Dionne
>>>>> Computer Programmer
>>>>> [email protected]
>>>>> 203.231.9961
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Dec 20, 2011, at 3:27 PM, Riyad Kalla wrote:
>>>>> 
>>>>>> Filipe,
>>>>>> 
>>>>>> Thank you for the reply.
>>>>>> 
>>>>>> Maybe I am misunderstanding exactly what couch is writing out; the
>> docs
>>>>>> I've read say that it "rewrites the root node" -- I can't tell if the
>>>>> docs
>>>>>> mean the parent node of the child doc that was changed (as one of the
>> b+
>>>>>> leaves) or if it means the direct path, from the root, to that
>>>>> individual
>>>>>> doc... or if it means the *entire* index...
>>>>>> 
>>>>>> In the case of even rewriting the single parent, with such a shallow
>>>>> tree,
>>>>>> each internal leaf will have a huge fan of nodes; let's say 1-10k in a
>>>>>> decent sized data set.
>>>>>> 
>>>>>> If you are seeing a few K of extra written out after each changed doc
>>>>> then
>>>>>> that cannot be write... I almost assumed my understanding was wrong
>>>>> because
>>>>>> the sheer volume of data would make performance abysmal if it was
>> true.
>>>>>> 
>>>>>> Given that... is it just the changed path, from the root to the new
>> leaf
>>>>>> that is rewritten?
>>>>> 
>>>>> Hi Riyad,
>>>>> 
>>>>> You are correct, it's only the changed path. Interestingly I've just
>>>>> started to document all these internals[1] along with links to the
>> code and
>>>>> other references available.
>>>>> 
>>>>> Cheers,
>>>>> 
>>>>> Bob
>>>>> 
>>>>> 
>>>>> [1] http://bdionne.github.com/couchdb/
>>>>> 
>>>>>> That makes me all sorts of curious as to how Couch
>>>>>> updates/searches the new modified index with the small diff that is
>>>>> written
>>>>>> out.
>>>>>> 
>>>>>> Any pointers to reading that will help me dig down on this (even early
>>>>> bugs
>>>>>> in JIRA?) would be appreciated. I've tried skimming back in 2007/08 on
>>>>>> Damien's blog to see if it wrote about it in depth and so far haven't
>>>>> found
>>>>>> anything as detailed as I am hoping for on this architecture.
>>>>>> 
>>>>>> Best,
>>>>>> Riyad
>>>>>> 
>>>>>> On Tue, Dec 20, 2011 at 1:07 PM, Filipe David Manana <
>>>>> [email protected]>wrote:
>>>>>> 
>>>>>>> On Tue, Dec 20, 2011 at 6:24 PM, Riyad Kalla <[email protected]>
>> wrote:
>>>>>>>> I've been reading everything I can find on the CouchDB file
>> format[1]
>>>>> and
>>>>>>>> am getting bits and pieces here and there, but not a great,
>> concrete,
>>>>>>>> step-by-step explanation of the process.
>>>>>>>> 
>>>>>>>> I'm clear on the use of B+ trees and after reading a few papers on
>> the
>>>>>>>> benefits of log-structured file formats, I understand the benefits
>> of
>>>>>>>> inlining the B+ tree indices directly into the data file as well
>>>>>>> (locality
>>>>>>>> + sequential I/O)... what I'm flummoxed about is how much of the B+
>>>>>>> tree's
>>>>>>>> index is rewritten after every modified document.
>>>>>>>> 
>>>>>>>> Consider a CouchDB file that looks more or less like this:
>>>>>>>> 
>>>>>>>> [idx/header][doc1, rev1][idx/header][doc1, rev2]....
>>>>>>>> 
>>>>>>>> After each revised doc is written and the "b-tree root" is rewritten
>>>>>>> after
>>>>>>>> that, is that just a modified root node of the B+ tree or the entire
>>>>> B+
>>>>>>>> tree?
>>>>>>>> 
>>>>>>>> The reason I ask is because regardless of the answer to my previous
>>>>>>>> question, for a *huge* database will millions of records, that seems
>>>>> like
>>>>>>>> an enormous amount of data to rewrite after every modification. Say
>>>>> the
>>>>>>>> root node had a fanning factor of 133; that would still be alot of
>>>>> data
>>>>>>> to
>>>>>>>> rewrite.
>>>>>>> 
>>>>>>> Hi Riyad,
>>>>>>> 
>>>>>>> Have you observed that in practice?
>>>>>>> 
>>>>>>> Typically the depth of database btrees is not that high even for
>>>>>>> millions of documents. For example I have one around with about 10
>>>>>>> million documents which doesn't have more than 5 or 6 levels if I
>>>>>>> recall correctly.
>>>>>>> 
>>>>>>> So updating a doc, for that particular case, means rewriting 5 or 6
>>>>>>> new nodes plus the document itself. Each node is normally not much
>>>>>>> bigger than 1.2Kb.
>>>>>>> 
>>>>>>> I've written once a tool to analyze database files which reports
>> btree
>>>>>>> depths, however it's not updated to work with recent changes on
>>>>>>> master/1.2.x such as snappy compression and btree sizes:
>>>>>>> 
>>>>>>> https://github.com/fdmanana/couchfoo
>>>>>>> 
>>>>>>> It should work with CouchDB 1.1 (and older) database files.
>>>>>>> 
>>>>>>>> 
>>>>>>>> I am certain I am missing the boat on this; if anyone can pull me
>> out
>>>>> of
>>>>>>>> the water and point me to dry land I'd appreciate it.
>>>>>>>> 
>>>>>>>> Best,
>>>>>>>> R
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> [1]
>>>>>>>> --
>>>>>>>> 
>>>>>>> 
>>>>> 
>> http://jchrisa.net/drl/_design/sofa/_list/post/post-page?startkey=%5B%22CouchDB-Implements-a-Fundamental-Algorithm%22%5D
>>>>>>>> -- http://horicky.blogspot.com/2008/10/couchdb-implementation.html
>>>>>>>> -- http://blog.kodekabuki.com/post/132952897/couchdb-naked
>>>>>>>> -- http://guide.couchdb.org/editions/1/en/btree.html
>>>>>>>> -- http://ayende.com/blog/* (Over my head)
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Filipe David Manana,
>>>>>>> 
>>>>>>> "Reasonable men adapt themselves to the world.
>>>>>>> Unreasonable men adapt the world to themselves.
>>>>>>> That's why all progress depends on unreasonable men."
>>>>>>> 
>>>>> 
>>>>> 
>>>> 
>> 
>>

Re: Understanding the CouchDB file format

Reply via email to