Re: Understanding the CouchDB file format

Robert Newson Thu, 22 Dec 2011 11:38:07 -0800

It reads well as an article but needs some polish before it could be a
great wiki page. I suggest that if it does go up, it is clearly marked
as a draft, and we all chip in to sculpt it into shape.


Particularly, the author is very enthusiastic but this mars the
article (the all-caps, the ad-hoc methods of emphasis). There are a
few parts of the article that are inaccurate (like the assertion we
have good locality for the id and seq trees. If this were true we
wouldn't have seen such a huge improvement in compaction by
temporarily separating them). The 'COMPETE recreation' paragraph also
strikes me as factually awry.

B.

On 22 December 2011 18:52, Riyad Kalla <rka...@gmail.com> wrote:
> Jan,
>
> Thank you and yes, I'd be happy to contribute this to the wiki.
>
> I made some edits early this morning after some feedback. If a few folks
> in-the-know could give it a quick read-through to make sure I didn't get
> anything wrong then I'm happy to work it up on the wiki as well (or send it
> along to the wiki's editor for inclusion).
>
> Just point the way.
>
> Best,
> Riyad
>
> On Thu, Dec 22, 2011 at 2:21 AM, Jan Lehnardt <j...@apache.org> wrote:
>
>> Good writeup! Would you consider contributing it to the CouchDB Wiki?
>>
>>  http://wiki.apache.org/couchdb/
>>
>> Cheers
>> Jan
>> --
>>
>> On Dec 21, 2011, at 21:28 , Riyad Kalla wrote:
>>
>> > Bob,
>> >
>> > Really appreciate the link; Rick has a handful of articles that helped a
>> > lot.
>> >
>> > Along side all the CouchDB reading I've been looking at SSD-optimized
>> data
>> > storage mechanisms and tried to coalesce all of this information into
>> this
>> > post on Couch's file storage format:
>> > https://plus.google.com/u/0/107397941677313236670/posts/CyvwRcvh4vv
>> >
>> > It is uncanny how many things Couch seems to have gotten right with
>> regard
>> > to existing storage systems and future flash-based storage systems. I'd
>> > appreciate any corrections, additions or feedback to the post for anyone
>> > interested.
>> >
>> > Best,
>> > R
>> >
>> > On Wed, Dec 21, 2011 at 12:53 PM, Robert Dionne <
>> > dio...@dionne-associates.com> wrote:
>> >
>> >> I think this is largely correct Riyad, I dug out an old article[1] by
>> Rick
>> >> Ho that you may also find helpful though it might be slightly dated.
>> >> Generally the best performance will be had if the ids are sequential and
>> >> updates are done in bulk. Write heavy applications will eat up a lot of
>> >> space and require compaction. At the leaf nodes what are stored are
>> either
>> >> full_doc_info records or doc_info records which store pointers to the
>> data
>> >> so the main thing that impacts the branching at each level are the key
>> size
>> >> and in the case of views the sizes of the reductions as these are stored
>> >> with the intermediate nodes.
>> >>
>> >> All in all it works pretty well but as always you need to test and
>> >> evaluate it for you specific case to see what the limits are.
>> >>
>> >> Regards,
>> >>
>> >> Bob
>> >>
>> >>
>> >> [1] http://horicky.blogspot.com/2008/10/couchdb-implementation.html
>> >>
>> >>
>> >>
>> >>
>> >> On Dec 21, 2011, at 2:17 PM, Riyad Kalla wrote:
>> >>
>> >>> Adding to this conversation, I found this set of slides by Chris
>> >> explaining
>> >>> the append-only index update format:
>> >>> http://www.slideshare.net/jchrisa/btree-nosql-oak?from=embed
>> >>>
>> >>> Specifically slides 16, 17 and 18.
>> >>>
>> >>> Using this example tree, rewriting the updated path (in reverse order)
>> >>> appended to the end of the file makes sense... you see how index
>> queries
>> >>> can simply read backwards from the end of the file and not only find
>> the
>> >>> latest revisions of docs, but also every other doc that wasn't touched
>> >> (it
>> >>> will just seek into the existing inner nodes of the b+ tree for
>> >> searching).
>> >>>
>> >>> What I am hoping for clarification on is the following pain-point that
>> I
>> >>> perceive with this approach:
>> >>>
>> >>> 1. In a sufficiently shallow B+ tree (like CouchDB), the paths
>> themselves
>> >>> to elements are short (typically no more than 3 to 5 levels deep) as
>> >>> opposed to a trie or some other construct that would have much longer
>> >> paths
>> >>> to elements.
>> >>>
>> >>> 2. Because the depth of the tree is so shallow, the breadth of it
>> becomes
>> >>> large to compensate... more specifically, each internal node can have
>> >> 100s,
>> >>> 1000s or more children. Using the example slides, consider the nodes
>> >>> [A...M] and [R...Z] -- in a good sized CouchDB database, those internal
>> >>> index nodes would have 100s (or more) elements in them pointing at
>> deeper
>> >>> internal nodes that themselves had thousands of elements; instead of
>> the
>> >> 13
>> >>> or so as implied by [A...M].
>> >>>
>> >>> 3. Looking at slide 17 and 18, where you see the direct B+ tree path to
>> >> the
>> >>> update node getting appended to the end of the file after the revision
>> is
>> >>> written (leaf to root ordering: [J' M] -> [A M] -> [A Z]) it implies
>> that
>> >>> those internal nodes with *all* their child elements are getting
>> >> rewritten
>> >>> as well.
>> >>>
>> >>> In this example tree, it is isn't such a big issue... but in a
>> >> sufficiently
>> >>> large CouchDB database, these nodes denoted by [A...M] and [A...Z]
>> could
>> >> be
>> >>> quite large... I don't know the format of the node elements in the B+
>> >> tree,
>> >>> but it would be whatever the size of a node is times however many
>> >> elements
>> >>> are contained at each level (1 for root, say 100 for level 2, 1000 for
>> >>> level 3 and 10,000 for level 4 -- there is a lot of hand-waving going
>> on
>> >>> here, of course it depends on the size of the data store).
>> >>>
>> >>> Am I missing something or is CouchDB really rewriting that much index
>> >>> information between document revisions on every update?
>> >>>
>> >>> What was previously confusing me is I thought it was *only* rewriting a
>> >>> direct path to the updated revision, like [B]>[E]>[J'] and Couch was
>> >>> some-how patching in that updated path info to the B+ index at runtime.
>> >>>
>> >>> If couch is rewriting entire node paths with all their elements then I
>> am
>> >>> no longer confused about the B+ index updates, but am curious about the
>> >>> on-disk cost of this.
>> >>>
>> >>> In my own rough insertion testing, that would explain why I see my
>> >>> collections absolutely explode in size until they are compacted (not
>> >> using
>> >>> bulk insert, but intentionally doing single inserts for a million(s) of
>> >>> docs to see what kind of cost the index path duplication would be
>> like).
>> >>>
>> >>> Can anyone confirm/deny/correct this assessment? I want to make sure I
>> am
>> >>> on the right track understanding this.
>> >>>
>> >>> Best wishes,
>> >>> Riyad
>> >>>
>> >>> On Tue, Dec 20, 2011 at 6:13 PM, Riyad Kalla <rka...@gmail.com> wrote:
>> >>>
>> >>>> @Filipe - I was just not clear on how CouchDB operated; you and Robert
>> >>>> cleared that up for me. Thank you.
>> >>>>
>> >>>> @Robert - The writeup is excellent so far (I am not familiar with
>> >> erlang,
>> >>>> so there is a bit of stickiness there), thank you for taking the time
>> to
>> >>>> put this together!
>> >>>>
>> >>>> At this point I am curious how the _id and _seq indices are read as
>> >> their
>> >>>> data is continually appended to the end of the data file in small
>> >>>> diff-trees for every updated doc.
>> >>>>
>> >>>> If CouchDB kept all the indices in-memory and simply patched-in the
>> >>>> updated paths at runtime (maybe something akin to memory-mapped
>> indices
>> >> in
>> >>>> MongoDB) I would be fairly clear on the operation... but as I
>> understand
>> >>>> it, CouchDB keeps such a small memory footprint by doing no in-memory
>> >>>> caching and relying on the intelligence of the OS and filesystem
>> (and/or
>> >>>> drives) to cache frequently accessed data.
>> >>>>
>> >>>> I am trying to understand the logic used by CouchDB to answer a query
>> >>>> using the index once updates to the tree have been appended to the
>> data
>> >>>> file... for example, consider a CouchDB datastore like the one Filipe
>> >>>> has... 10 million documents and let's say it is freshly compacted.
>> >>>>
>> >>>> If I send in a request to that Couch instance, it hits the header of
>> the
>> >>>> data file along with the index and walks the B+ tree to the leaf node,
>> >>>> where it finds the offset into the data file where the actual doc
>> >> lives...
>> >>>> let's say 1,000,000 bytes away.
>> >>>>
>> >>>> These B+ trees are shallow, so it might look something like this:
>> >>>>
>> >>>> Level 1: 1 node, root node.
>> >>>> Level 2: 100 nodes, inner child nodes
>> >>>> Level 3: 10,000 nodes, inner child nodes
>> >>>> Level 4: 1,000,000, leaf nodes... all with pointers to the data
>> offsets
>> >> in
>> >>>> the data file.
>> >>>>
>> >>>> Now let's say I write 10 updates to documents in that file. There are
>> 10
>> >>>> new revisions appended to the end of the data file *each one*
>> separated
>> >> by
>> >>>> a rewritten B+ path to a leaf node with it's new location at the end
>> of
>> >> the
>> >>>> file. Each of those paths written between each doc revision (say
>> >> roughly 2k
>> >>>> like Filipe mentioned) are just 4 item paths... root -> level1 ->
>> >> level2 ->
>> >>>> level3 --> level4... showing the discrete path from the root to that
>> >>>> individual updated doc. The intermediary levels (l1, 2, 3) are not
>> fully
>> >>>> flushed out with all the OTHER children from the original b+ tree
>> index.
>> >>>>
>> >>>> [[ is this correct so far? If not, please point out my mistakes...]
>> >>>>
>> >>>> Now I issue a query for a document that WAS NOT updated...
>> >>>>
>> >>>> **** this is where I get confused on the logic ***
>> >>>>
>> >>>> this would mean I need to access the original B+ tree index at the
>> root
>> >> of
>> >>>> the data file, because the revised B+ paths that are written between
>> >> each
>> >>>> of the updated doc revisions at the end of the file are not full
>> >> indices.
>> >>>>
>> >>>> NOW consider I want to query for one of the changed docs... now I
>> >> suddenly
>> >>>> need to scan backwards from the data file's end to find the updated
>> >> path to
>> >>>> the new revision of that document.
>> >>>>
>> >>>> (obviously) this isn't what Couch is actually doing... it's doing
>> >>>> something more elegant, I just can't figure out what or how and that
>> is
>> >>>> what I was hoping for help with.
>> >>>>
>> >>>> Much thanks guys, I know this is a heavy question to ask.
>> >>>>
>> >>>> Best wishes,
>> >>>> R
>> >>>>
>> >>>>
>> >>>> On Tue, Dec 20, 2011 at 1:35 PM, Robert Dionne <
>> >>>> dio...@dionne-associates.com> wrote:
>> >>>>
>> >>>>>
>> >>>>> Robert Dionne
>> >>>>> Computer Programmer
>> >>>>> dio...@dionne-associates.com
>> >>>>> 203.231.9961
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> On Dec 20, 2011, at 3:27 PM, Riyad Kalla wrote:
>> >>>>>
>> >>>>>> Filipe,
>> >>>>>>
>> >>>>>> Thank you for the reply.
>> >>>>>>
>> >>>>>> Maybe I am misunderstanding exactly what couch is writing out; the
>> >> docs
>> >>>>>> I've read say that it "rewrites the root node" -- I can't tell if
>> the
>> >>>>> docs
>> >>>>>> mean the parent node of the child doc that was changed (as one of
>> the
>> >> b+
>> >>>>>> leaves) or if it means the direct path, from the root, to that
>> >>>>> individual
>> >>>>>> doc... or if it means the *entire* index...
>> >>>>>>
>> >>>>>> In the case of even rewriting the single parent, with such a shallow
>> >>>>> tree,
>> >>>>>> each internal leaf will have a huge fan of nodes; let's say 1-10k
>> in a
>> >>>>>> decent sized data set.
>> >>>>>>
>> >>>>>> If you are seeing a few K of extra written out after each changed
>> doc
>> >>>>> then
>> >>>>>> that cannot be write... I almost assumed my understanding was wrong
>> >>>>> because
>> >>>>>> the sheer volume of data would make performance abysmal if it was
>> >> true.
>> >>>>>>
>> >>>>>> Given that... is it just the changed path, from the root to the new
>> >> leaf
>> >>>>>> that is rewritten?
>> >>>>>
>> >>>>> Hi Riyad,
>> >>>>>
>> >>>>> You are correct, it's only the changed path. Interestingly I've just
>> >>>>> started to document all these internals[1] along with links to the
>> >> code and
>> >>>>> other references available.
>> >>>>>
>> >>>>> Cheers,
>> >>>>>
>> >>>>> Bob
>> >>>>>
>> >>>>>
>> >>>>> [1] http://bdionne.github.com/couchdb/
>> >>>>>
>> >>>>>> That makes me all sorts of curious as to how Couch
>> >>>>>> updates/searches the new modified index with the small diff that is
>> >>>>> written
>> >>>>>> out.
>> >>>>>>
>> >>>>>> Any pointers to reading that will help me dig down on this (even
>> early
>> >>>>> bugs
>> >>>>>> in JIRA?) would be appreciated. I've tried skimming back in 2007/08
>> on
>> >>>>>> Damien's blog to see if it wrote about it in depth and so far
>> haven't
>> >>>>> found
>> >>>>>> anything as detailed as I am hoping for on this architecture.
>> >>>>>>
>> >>>>>> Best,
>> >>>>>> Riyad
>> >>>>>>
>> >>>>>> On Tue, Dec 20, 2011 at 1:07 PM, Filipe David Manana <
>> >>>>> fdman...@apache.org>wrote:
>> >>>>>>
>> >>>>>>> On Tue, Dec 20, 2011 at 6:24 PM, Riyad Kalla <rka...@gmail.com>
>> >> wrote:
>> >>>>>>>> I've been reading everything I can find on the CouchDB file
>> >> format[1]
>> >>>>> and
>> >>>>>>>> am getting bits and pieces here and there, but not a great,
>> >> concrete,
>> >>>>>>>> step-by-step explanation of the process.
>> >>>>>>>>
>> >>>>>>>> I'm clear on the use of B+ trees and after reading a few papers on
>> >> the
>> >>>>>>>> benefits of log-structured file formats, I understand the benefits
>> >> of
>> >>>>>>>> inlining the B+ tree indices directly into the data file as well
>> >>>>>>> (locality
>> >>>>>>>> + sequential I/O)... what I'm flummoxed about is how much of the
>> B+
>> >>>>>>> tree's
>> >>>>>>>> index is rewritten after every modified document.
>> >>>>>>>>
>> >>>>>>>> Consider a CouchDB file that looks more or less like this:
>> >>>>>>>>
>> >>>>>>>> [idx/header][doc1, rev1][idx/header][doc1, rev2]....
>> >>>>>>>>
>> >>>>>>>> After each revised doc is written and the "b-tree root" is
>> rewritten
>> >>>>>>> after
>> >>>>>>>> that, is that just a modified root node of the B+ tree or the
>> entire
>> >>>>> B+
>> >>>>>>>> tree?
>> >>>>>>>>
>> >>>>>>>> The reason I ask is because regardless of the answer to my
>> previous
>> >>>>>>>> question, for a *huge* database will millions of records, that
>> seems
>> >>>>> like
>> >>>>>>>> an enormous amount of data to rewrite after every modification.
>> Say
>> >>>>> the
>> >>>>>>>> root node had a fanning factor of 133; that would still be alot of
>> >>>>> data
>> >>>>>>> to
>> >>>>>>>> rewrite.
>> >>>>>>>
>> >>>>>>> Hi Riyad,
>> >>>>>>>
>> >>>>>>> Have you observed that in practice?
>> >>>>>>>
>> >>>>>>> Typically the depth of database btrees is not that high even for
>> >>>>>>> millions of documents. For example I have one around with about 10
>> >>>>>>> million documents which doesn't have more than 5 or 6 levels if I
>> >>>>>>> recall correctly.
>> >>>>>>>
>> >>>>>>> So updating a doc, for that particular case, means rewriting 5 or 6
>> >>>>>>> new nodes plus the document itself. Each node is normally not much
>> >>>>>>> bigger than 1.2Kb.
>> >>>>>>>
>> >>>>>>> I've written once a tool to analyze database files which reports
>> >> btree
>> >>>>>>> depths, however it's not updated to work with recent changes on
>> >>>>>>> master/1.2.x such as snappy compression and btree sizes:
>> >>>>>>>
>> >>>>>>> https://github.com/fdmanana/couchfoo
>> >>>>>>>
>> >>>>>>> It should work with CouchDB 1.1 (and older) database files.
>> >>>>>>>
>> >>>>>>>>
>> >>>>>>>> I am certain I am missing the boat on this; if anyone can pull me
>> >> out
>> >>>>> of
>> >>>>>>>> the water and point me to dry land I'd appreciate it.
>> >>>>>>>>
>> >>>>>>>> Best,
>> >>>>>>>> R
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> [1]
>> >>>>>>>> --
>> >>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>
>> http://jchrisa.net/drl/_design/sofa/_list/post/post-page?startkey=%5B%22CouchDB-Implements-a-Fundamental-Algorithm%22%5D
>> >>>>>>>> --
>> http://horicky.blogspot.com/2008/10/couchdb-implementation.html
>> >>>>>>>> -- http://blog.kodekabuki.com/post/132952897/couchdb-naked
>> >>>>>>>> -- http://guide.couchdb.org/editions/1/en/btree.html
>> >>>>>>>> -- http://ayende.com/blog/* (Over my head)
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> --
>> >>>>>>> Filipe David Manana,
>> >>>>>>>
>> >>>>>>> "Reasonable men adapt themselves to the world.
>> >>>>>>> Unreasonable men adapt the world to themselves.
>> >>>>>>> That's why all progress depends on unreasonable men."
>> >>>>>>>
>> >>>>>
>> >>>>>
>> >>>>
>> >>
>> >>
>>
>>

Re: Understanding the CouchDB file format

Reply via email to