Re: Understanding the CouchDB file format

Riyad Kalla Tue, 20 Dec 2011 17:14:06 -0800

@Filipe - I was just not clear on how CouchDB operated; you and Robert
cleared that up for me. Thank you.

@Robert - The writeup is excellent so far (I am not familiar with erlang,
so there is a bit of stickiness there), thank you for taking the time to
put this together!

At this point I am curious how the _id and _seq indices are read as their
data is continually appended to the end of the data file in small
diff-trees for every updated doc.

If CouchDB kept all the indices in-memory and simply patched-in the updated
paths at runtime (maybe something akin to memory-mapped indices in MongoDB)
I would be fairly clear on the operation... but as I understand it, CouchDB
keeps such a small memory footprint by doing no in-memory caching and
relying on the intelligence of the OS and filesystem (and/or drives) to
cache frequently accessed data.

I am trying to understand the logic used by CouchDB to answer a query using
the index once updates to the tree have been appended to the data file...
for example, consider a CouchDB datastore like the one Filipe has... 10
million documents and let's say it is freshly compacted.

If I send in a request to that Couch instance, it hits the header of the
data file along with the index and walks the B+ tree to the leaf node,
where it finds the offset into the data file where the actual doc lives...
let's say 1,000,000 bytes away.

These B+ trees are shallow, so it might look something like this:

Level 1: 1 node, root node.
Level 2: 100 nodes, inner child nodes
Level 3: 10,000 nodes, inner child nodes
Level 4: 1,000,000, leaf nodes... all with pointers to the data offsets in
the data file.

Now let's say I write 10 updates to documents in that file. There are 10
new revisions appended to the end of the data file *each one* separated by
a rewritten B+ path to a leaf node with it's new location at the end of the
file. Each of those paths written between each doc revision (say roughly 2k
like Filipe mentioned) are just 4 item paths... root -> level1 -> level2 ->
level3 --> level4... showing the discrete path from the root to that
individual updated doc. The intermediary levels (l1, 2, 3) are not fully
flushed out with all the OTHER children from the original b+ tree index.

[[ is this correct so far? If not, please point out my mistakes...]

Now I issue a query for a document that WAS NOT updated...

**** this is where I get confused on the logic ***

this would mean I need to access the original B+ tree index at the root of
the data file, because the revised B+ paths that are written between each
of the updated doc revisions at the end of the file are not full indices.

NOW consider I want to query for one of the changed docs... now I suddenly
need to scan backwards from the data file's end to find the updated path to
the new revision of that document.

(obviously) this isn't what Couch is actually doing... it's doing something
more elegant, I just can't figure out what or how and that is what I was
hoping for help with.

Much thanks guys, I know this is a heavy question to ask.

Best wishes,
R

On Tue, Dec 20, 2011 at 1:35 PM, Robert Dionne <dio...@dionne-associates.com
> wrote:

>
> Robert Dionne
> Computer Programmer
> dio...@dionne-associates.com
> 203.231.9961
>
>
>
>
> On Dec 20, 2011, at 3:27 PM, Riyad Kalla wrote:
>
> > Filipe,
> >
> > Thank you for the reply.
> >
> > Maybe I am misunderstanding exactly what couch is writing out; the docs
> > I've read say that it "rewrites the root node" -- I can't tell if the
> docs
> > mean the parent node of the child doc that was changed (as one of the b+
> > leaves) or if it means the direct path, from the root, to that individual
> > doc... or if it means the *entire* index...
> >
> > In the case of even rewriting the single parent, with such a shallow
> tree,
> > each internal leaf will have a huge fan of nodes; let's say 1-10k in a
> > decent sized data set.
> >
> > If you are seeing a few K of extra written out after each changed doc
> then
> > that cannot be write... I almost assumed my understanding was wrong
> because
> > the sheer volume of data would make performance abysmal if it was true.
> >
> > Given that... is it just the changed path, from the root to the new leaf
> > that is rewritten?
>
> Hi Riyad,
>
> You are correct, it's only the changed path. Interestingly I've just
> started to document all these internals[1] along with links to the code and
> other references available.
>
> Cheers,
>
> Bob
>
>
> [1] http://bdionne.github.com/couchdb/
>
> > That makes me all sorts of curious as to how Couch
> > updates/searches the new modified index with the small diff that is
> written
> > out.
> >
> > Any pointers to reading that will help me dig down on this (even early
> bugs
> > in JIRA?) would be appreciated. I've tried skimming back in 2007/08 on
> > Damien's blog to see if it wrote about it in depth and so far haven't
> found
> > anything as detailed as I am hoping for on this architecture.
> >
> > Best,
> > Riyad
> >
> > On Tue, Dec 20, 2011 at 1:07 PM, Filipe David Manana <
> fdman...@apache.org>wrote:
> >
> >> On Tue, Dec 20, 2011 at 6:24 PM, Riyad Kalla <rka...@gmail.com> wrote:
> >>> I've been reading everything I can find on the CouchDB file format[1]
> and
> >>> am getting bits and pieces here and there, but not a great, concrete,
> >>> step-by-step explanation of the process.
> >>>
> >>> I'm clear on the use of B+ trees and after reading a few papers on the
> >>> benefits of log-structured file formats, I understand the benefits of
> >>> inlining the B+ tree indices directly into the data file as well
> >> (locality
> >>> + sequential I/O)... what I'm flummoxed about is how much of the B+
> >> tree's
> >>> index is rewritten after every modified document.
> >>>
> >>> Consider a CouchDB file that looks more or less like this:
> >>>
> >>> [idx/header][doc1, rev1][idx/header][doc1, rev2]....
> >>>
> >>> After each revised doc is written and the "b-tree root" is rewritten
> >> after
> >>> that, is that just a modified root node of the B+ tree or the entire B+
> >>> tree?
> >>>
> >>> The reason I ask is because regardless of the answer to my previous
> >>> question, for a *huge* database will millions of records, that seems
> like
> >>> an enormous amount of data to rewrite after every modification. Say the
> >>> root node had a fanning factor of 133; that would still be alot of data
> >> to
> >>> rewrite.
> >>
> >> Hi Riyad,
> >>
> >> Have you observed that in practice?
> >>
> >> Typically the depth of database btrees is not that high even for
> >> millions of documents. For example I have one around with about 10
> >> million documents which doesn't have more than 5 or 6 levels if I
> >> recall correctly.
> >>
> >> So updating a doc, for that particular case, means rewriting 5 or 6
> >> new nodes plus the document itself. Each node is normally not much
> >> bigger than 1.2Kb.
> >>
> >> I've written once a tool to analyze database files which reports btree
> >> depths, however it's not updated to work with recent changes on
> >> master/1.2.x such as snappy compression and btree sizes:
> >>
> >> https://github.com/fdmanana/couchfoo
> >>
> >> It should work with CouchDB 1.1 (and older) database files.
> >>
> >>>
> >>> I am certain I am missing the boat on this; if anyone can pull me out
> of
> >>> the water and point me to dry land I'd appreciate it.
> >>>
> >>> Best,
> >>> R
> >>>
> >>>
> >>>
> >>> [1]
> >>> --
> >>>
> >>
> http://jchrisa.net/drl/_design/sofa/_list/post/post-page?startkey=%5B%22CouchDB-Implements-a-Fundamental-Algorithm%22%5D
> >>> -- http://horicky.blogspot.com/2008/10/couchdb-implementation.html
> >>> -- http://blog.kodekabuki.com/post/132952897/couchdb-naked
> >>> -- http://guide.couchdb.org/editions/1/en/btree.html
> >>> -- http://ayende.com/blog/* (Over my head)
> >>
> >>
> >>
> >> --
> >> Filipe David Manana,
> >>
> >> "Reasonable men adapt themselves to the world.
> >> Unreasonable men adapt the world to themselves.
> >> That's why all progress depends on unreasonable men."
> >>
>
>

Re: Understanding the CouchDB file format

Reply via email to