Re: Understanding the CouchDB file format

Riyad Kalla Wed, 21 Dec 2011 12:48:57 -0800

Thank you Robert, fixed.

On Wed, Dec 21, 2011 at 1:42 PM, Robert Dionne <[email protected]
> wrote:


> Riyad,
>
> Your welcome. At a quick glance your post has one error, internal nodes do
> contain values (from the reductions). The appendix in the couchdb book also
> makes this error[1] which I've opened a ticket for.
>
> Cheers,
>
> Bob
>
>
> [1] https://github.com/oreilly/couchdb-guide/issues/450
>
>
>
>
> On Dec 21, 2011, at 3:28 PM, Riyad Kalla wrote:
>
> > Bob,
> >
> > Really appreciate the link; Rick has a handful of articles that helped a
> > lot.
> >
> > Along side all the CouchDB reading I've been looking at SSD-optimized
> data
> > storage mechanisms and tried to coalesce all of this information into
> this
> > post on Couch's file storage format:
> > https://plus.google.com/u/0/107397941677313236670/posts/CyvwRcvh4vv
> >
> > It is uncanny how many things Couch seems to have gotten right with
> regard
> > to existing storage systems and future flash-based storage systems. I'd
> > appreciate any corrections, additions or feedback to the post for anyone
> > interested.
> >
> > Best,
> > R
> >
> > On Wed, Dec 21, 2011 at 12:53 PM, Robert Dionne <
> > [email protected]> wrote:
> >
> >> I think this is largely correct Riyad, I dug out an old article[1] by
> Rick
> >> Ho that you may also find helpful though it might be slightly dated.
> >> Generally the best performance will be had if the ids are sequential and
> >> updates are done in bulk. Write heavy applications will eat up a lot of
> >> space and require compaction. At the leaf nodes what are stored are
> either
> >> full_doc_info records or doc_info records which store pointers to the
> data
> >> so the main thing that impacts the branching at each level are the key
> size
> >> and in the case of views the sizes of the reductions as these are stored
> >> with the intermediate nodes.
> >>
> >> All in all it works pretty well but as always you need to test and
> >> evaluate it for you specific case to see what the limits are.
> >>
> >> Regards,
> >>
> >> Bob
> >>
> >>
> >> [1] http://horicky.blogspot.com/2008/10/couchdb-implementation.html
> >>
> >>
> >>
> >>
> >> On Dec 21, 2011, at 2:17 PM, Riyad Kalla wrote:
> >>
> >>> Adding to this conversation, I found this set of slides by Chris
> >> explaining
> >>> the append-only index update format:
> >>> http://www.slideshare.net/jchrisa/btree-nosql-oak?from=embed
> >>>
> >>> Specifically slides 16, 17 and 18.
> >>>
> >>> Using this example tree, rewriting the updated path (in reverse order)
> >>> appended to the end of the file makes sense... you see how index
> queries
> >>> can simply read backwards from the end of the file and not only find
> the
> >>> latest revisions of docs, but also every other doc that wasn't touched
> >> (it
> >>> will just seek into the existing inner nodes of the b+ tree for
> >> searching).
> >>>
> >>> What I am hoping for clarification on is the following pain-point that
> I
> >>> perceive with this approach:
> >>>
> >>> 1. In a sufficiently shallow B+ tree (like CouchDB), the paths
> themselves
> >>> to elements are short (typically no more than 3 to 5 levels deep) as
> >>> opposed to a trie or some other construct that would have much longer
> >> paths
> >>> to elements.
> >>>
> >>> 2. Because the depth of the tree is so shallow, the breadth of it
> becomes
> >>> large to compensate... more specifically, each internal node can have
> >> 100s,
> >>> 1000s or more children. Using the example slides, consider the nodes
> >>> [A...M] and [R...Z] -- in a good sized CouchDB database, those internal
> >>> index nodes would have 100s (or more) elements in them pointing at
> deeper
> >>> internal nodes that themselves had thousands of elements; instead of
> the
> >> 13
> >>> or so as implied by [A...M].
> >>>
> >>> 3. Looking at slide 17 and 18, where you see the direct B+ tree path to
> >> the
> >>> update node getting appended to the end of the file after the revision
> is
> >>> written (leaf to root ordering: [J' M] -> [A M] -> [A Z]) it implies
> that
> >>> those internal nodes with *all* their child elements are getting
> >> rewritten
> >>> as well.
> >>>
> >>> In this example tree, it is isn't such a big issue... but in a
> >> sufficiently
> >>> large CouchDB database, these nodes denoted by [A...M] and [A...Z]
> could
> >> be
> >>> quite large... I don't know the format of the node elements in the B+
> >> tree,
> >>> but it would be whatever the size of a node is times however many
> >> elements
> >>> are contained at each level (1 for root, say 100 for level 2, 1000 for
> >>> level 3 and 10,000 for level 4 -- there is a lot of hand-waving going
> on
> >>> here, of course it depends on the size of the data store).
> >>>
> >>> Am I missing something or is CouchDB really rewriting that much index
> >>> information between document revisions on every update?
> >>>
> >>> What was previously confusing me is I thought it was *only* rewriting a
> >>> direct path to the updated revision, like [B]>[E]>[J'] and Couch was
> >>> some-how patching in that updated path info to the B+ index at runtime.
> >>>
> >>> If couch is rewriting entire node paths with all their elements then I
> am
> >>> no longer confused about the B+ index updates, but am curious about the
> >>> on-disk cost of this.
> >>>
> >>> In my own rough insertion testing, that would explain why I see my
> >>> collections absolutely explode in size until they are compacted (not
> >> using
> >>> bulk insert, but intentionally doing single inserts for a million(s) of
> >>> docs to see what kind of cost the index path duplication would be
> like).
> >>>
> >>> Can anyone confirm/deny/correct this assessment? I want to make sure I
> am
> >>> on the right track understanding this.
> >>>
> >>> Best wishes,
> >>> Riyad
> >>>
> >>> On Tue, Dec 20, 2011 at 6:13 PM, Riyad Kalla <[email protected]> wrote:
> >>>
> >>>> @Filipe - I was just not clear on how CouchDB operated; you and Robert
> >>>> cleared that up for me. Thank you.
> >>>>
> >>>> @Robert - The writeup is excellent so far (I am not familiar with
> >> erlang,
> >>>> so there is a bit of stickiness there), thank you for taking the time
> to
> >>>> put this together!
> >>>>
> >>>> At this point I am curious how the _id and _seq indices are read as
> >> their
> >>>> data is continually appended to the end of the data file in small
> >>>> diff-trees for every updated doc.
> >>>>
> >>>> If CouchDB kept all the indices in-memory and simply patched-in the
> >>>> updated paths at runtime (maybe something akin to memory-mapped
> indices
> >> in
> >>>> MongoDB) I would be fairly clear on the operation... but as I
> understand
> >>>> it, CouchDB keeps such a small memory footprint by doing no in-memory
> >>>> caching and relying on the intelligence of the OS and filesystem
> (and/or
> >>>> drives) to cache frequently accessed data.
> >>>>
> >>>> I am trying to understand the logic used by CouchDB to answer a query
> >>>> using the index once updates to the tree have been appended to the
> data
> >>>> file... for example, consider a CouchDB datastore like the one Filipe
> >>>> has... 10 million documents and let's say it is freshly compacted.
> >>>>
> >>>> If I send in a request to that Couch instance, it hits the header of
> the
> >>>> data file along with the index and walks the B+ tree to the leaf node,
> >>>> where it finds the offset into the data file where the actual doc
> >> lives...
> >>>> let's say 1,000,000 bytes away.
> >>>>
> >>>> These B+ trees are shallow, so it might look something like this:
> >>>>
> >>>> Level 1: 1 node, root node.
> >>>> Level 2: 100 nodes, inner child nodes
> >>>> Level 3: 10,000 nodes, inner child nodes
> >>>> Level 4: 1,000,000, leaf nodes... all with pointers to the data
> offsets
> >> in
> >>>> the data file.
> >>>>
> >>>> Now let's say I write 10 updates to documents in that file. There are
> 10
> >>>> new revisions appended to the end of the data file *each one*
> separated
> >> by
> >>>> a rewritten B+ path to a leaf node with it's new location at the end
> of
> >> the
> >>>> file. Each of those paths written between each doc revision (say
> >> roughly 2k
> >>>> like Filipe mentioned) are just 4 item paths... root -> level1 ->
> >> level2 ->
> >>>> level3 --> level4... showing the discrete path from the root to that
> >>>> individual updated doc. The intermediary levels (l1, 2, 3) are not
> fully
> >>>> flushed out with all the OTHER children from the original b+ tree
> index.
> >>>>
> >>>> [[ is this correct so far? If not, please point out my mistakes...]
> >>>>
> >>>> Now I issue a query for a document that WAS NOT updated...
> >>>>
> >>>> **** this is where I get confused on the logic ***
> >>>>
> >>>> this would mean I need to access the original B+ tree index at the
> root
> >> of
> >>>> the data file, because the revised B+ paths that are written between
> >> each
> >>>> of the updated doc revisions at the end of the file are not full
> >> indices.
> >>>>
> >>>> NOW consider I want to query for one of the changed docs... now I
> >> suddenly
> >>>> need to scan backwards from the data file's end to find the updated
> >> path to
> >>>> the new revision of that document.
> >>>>
> >>>> (obviously) this isn't what Couch is actually doing... it's doing
> >>>> something more elegant, I just can't figure out what or how and that
> is
> >>>> what I was hoping for help with.
> >>>>
> >>>> Much thanks guys, I know this is a heavy question to ask.
> >>>>
> >>>> Best wishes,
> >>>> R
> >>>>
> >>>>
> >>>> On Tue, Dec 20, 2011 at 1:35 PM, Robert Dionne <
> >>>> [email protected]> wrote:
> >>>>
> >>>>>
> >>>>> Robert Dionne
> >>>>> Computer Programmer
> >>>>> [email protected]
> >>>>> 203.231.9961
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Dec 20, 2011, at 3:27 PM, Riyad Kalla wrote:
> >>>>>
> >>>>>> Filipe,
> >>>>>>
> >>>>>> Thank you for the reply.
> >>>>>>
> >>>>>> Maybe I am misunderstanding exactly what couch is writing out; the
> >> docs
> >>>>>> I've read say that it "rewrites the root node" -- I can't tell if
> the
> >>>>> docs
> >>>>>> mean the parent node of the child doc that was changed (as one of
> the
> >> b+
> >>>>>> leaves) or if it means the direct path, from the root, to that
> >>>>> individual
> >>>>>> doc... or if it means the *entire* index...
> >>>>>>
> >>>>>> In the case of even rewriting the single parent, with such a shallow
> >>>>> tree,
> >>>>>> each internal leaf will have a huge fan of nodes; let's say 1-10k
> in a
> >>>>>> decent sized data set.
> >>>>>>
> >>>>>> If you are seeing a few K of extra written out after each changed
> doc
> >>>>> then
> >>>>>> that cannot be write... I almost assumed my understanding was wrong
> >>>>> because
> >>>>>> the sheer volume of data would make performance abysmal if it was
> >> true.
> >>>>>>
> >>>>>> Given that... is it just the changed path, from the root to the new
> >> leaf
> >>>>>> that is rewritten?
> >>>>>
> >>>>> Hi Riyad,
> >>>>>
> >>>>> You are correct, it's only the changed path. Interestingly I've just
> >>>>> started to document all these internals[1] along with links to the
> >> code and
> >>>>> other references available.
> >>>>>
> >>>>> Cheers,
> >>>>>
> >>>>> Bob
> >>>>>
> >>>>>
> >>>>> [1] http://bdionne.github.com/couchdb/
> >>>>>
> >>>>>> That makes me all sorts of curious as to how Couch
> >>>>>> updates/searches the new modified index with the small diff that is
> >>>>> written
> >>>>>> out.
> >>>>>>
> >>>>>> Any pointers to reading that will help me dig down on this (even
> early
> >>>>> bugs
> >>>>>> in JIRA?) would be appreciated. I've tried skimming back in 2007/08
> on
> >>>>>> Damien's blog to see if it wrote about it in depth and so far
> haven't
> >>>>> found
> >>>>>> anything as detailed as I am hoping for on this architecture.
> >>>>>>
> >>>>>> Best,
> >>>>>> Riyad
> >>>>>>
> >>>>>> On Tue, Dec 20, 2011 at 1:07 PM, Filipe David Manana <
> >>>>> [email protected]>wrote:
> >>>>>>
> >>>>>>> On Tue, Dec 20, 2011 at 6:24 PM, Riyad Kalla <[email protected]>
> >> wrote:
> >>>>>>>> I've been reading everything I can find on the CouchDB file
> >> format[1]
> >>>>> and
> >>>>>>>> am getting bits and pieces here and there, but not a great,
> >> concrete,
> >>>>>>>> step-by-step explanation of the process.
> >>>>>>>>
> >>>>>>>> I'm clear on the use of B+ trees and after reading a few papers on
> >> the
> >>>>>>>> benefits of log-structured file formats, I understand the benefits
> >> of
> >>>>>>>> inlining the B+ tree indices directly into the data file as well
> >>>>>>> (locality
> >>>>>>>> + sequential I/O)... what I'm flummoxed about is how much of the
> B+
> >>>>>>> tree's
> >>>>>>>> index is rewritten after every modified document.
> >>>>>>>>
> >>>>>>>> Consider a CouchDB file that looks more or less like this:
> >>>>>>>>
> >>>>>>>> [idx/header][doc1, rev1][idx/header][doc1, rev2]....
> >>>>>>>>
> >>>>>>>> After each revised doc is written and the "b-tree root" is
> rewritten
> >>>>>>> after
> >>>>>>>> that, is that just a modified root node of the B+ tree or the
> entire
> >>>>> B+
> >>>>>>>> tree?
> >>>>>>>>
> >>>>>>>> The reason I ask is because regardless of the answer to my
> previous
> >>>>>>>> question, for a *huge* database will millions of records, that
> seems
> >>>>> like
> >>>>>>>> an enormous amount of data to rewrite after every modification.
> Say
> >>>>> the
> >>>>>>>> root node had a fanning factor of 133; that would still be alot of
> >>>>> data
> >>>>>>> to
> >>>>>>>> rewrite.
> >>>>>>>
> >>>>>>> Hi Riyad,
> >>>>>>>
> >>>>>>> Have you observed that in practice?
> >>>>>>>
> >>>>>>> Typically the depth of database btrees is not that high even for
> >>>>>>> millions of documents. For example I have one around with about 10
> >>>>>>> million documents which doesn't have more than 5 or 6 levels if I
> >>>>>>> recall correctly.
> >>>>>>>
> >>>>>>> So updating a doc, for that particular case, means rewriting 5 or 6
> >>>>>>> new nodes plus the document itself. Each node is normally not much
> >>>>>>> bigger than 1.2Kb.
> >>>>>>>
> >>>>>>> I've written once a tool to analyze database files which reports
> >> btree
> >>>>>>> depths, however it's not updated to work with recent changes on
> >>>>>>> master/1.2.x such as snappy compression and btree sizes:
> >>>>>>>
> >>>>>>> https://github.com/fdmanana/couchfoo
> >>>>>>>
> >>>>>>> It should work with CouchDB 1.1 (and older) database files.
> >>>>>>>
> >>>>>>>>
> >>>>>>>> I am certain I am missing the boat on this; if anyone can pull me
> >> out
> >>>>> of
> >>>>>>>> the water and point me to dry land I'd appreciate it.
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> R
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> [1]
> >>>>>>>> --
> >>>>>>>>
> >>>>>>>
> >>>>>
> >>
> http://jchrisa.net/drl/_design/sofa/_list/post/post-page?startkey=%5B%22CouchDB-Implements-a-Fundamental-Algorithm%22%5D
> >>>>>>>> --
> http://horicky.blogspot.com/2008/10/couchdb-implementation.html
> >>>>>>>> -- http://blog.kodekabuki.com/post/132952897/couchdb-naked
> >>>>>>>> -- http://guide.couchdb.org/editions/1/en/btree.html
> >>>>>>>> -- http://ayende.com/blog/* (Over my head)
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Filipe David Manana,
> >>>>>>>
> >>>>>>> "Reasonable men adapt themselves to the world.
> >>>>>>> Unreasonable men adapt the world to themselves.
> >>>>>>> That's why all progress depends on unreasonable men."
> >>>>>>>
> >>>>>
> >>>>>
> >>>>
> >>
> >>
>
>

Re: Understanding the CouchDB file format

Reply via email to