Riyad, Your welcome. At a quick glance your post has one error, internal nodes do contain values (from the reductions). The appendix in the couchdb book also makes this error[1] which I've opened a ticket for.
Cheers, Bob [1] https://github.com/oreilly/couchdb-guide/issues/450 On Dec 21, 2011, at 3:28 PM, Riyad Kalla wrote: > Bob, > > Really appreciate the link; Rick has a handful of articles that helped a > lot. > > Along side all the CouchDB reading I've been looking at SSD-optimized data > storage mechanisms and tried to coalesce all of this information into this > post on Couch's file storage format: > https://plus.google.com/u/0/107397941677313236670/posts/CyvwRcvh4vv > > It is uncanny how many things Couch seems to have gotten right with regard > to existing storage systems and future flash-based storage systems. I'd > appreciate any corrections, additions or feedback to the post for anyone > interested. > > Best, > R > > On Wed, Dec 21, 2011 at 12:53 PM, Robert Dionne < > [email protected]> wrote: > >> I think this is largely correct Riyad, I dug out an old article[1] by Rick >> Ho that you may also find helpful though it might be slightly dated. >> Generally the best performance will be had if the ids are sequential and >> updates are done in bulk. Write heavy applications will eat up a lot of >> space and require compaction. At the leaf nodes what are stored are either >> full_doc_info records or doc_info records which store pointers to the data >> so the main thing that impacts the branching at each level are the key size >> and in the case of views the sizes of the reductions as these are stored >> with the intermediate nodes. >> >> All in all it works pretty well but as always you need to test and >> evaluate it for you specific case to see what the limits are. >> >> Regards, >> >> Bob >> >> >> [1] http://horicky.blogspot.com/2008/10/couchdb-implementation.html >> >> >> >> >> On Dec 21, 2011, at 2:17 PM, Riyad Kalla wrote: >> >>> Adding to this conversation, I found this set of slides by Chris >> explaining >>> the append-only index update format: >>> http://www.slideshare.net/jchrisa/btree-nosql-oak?from=embed >>> >>> Specifically slides 16, 17 and 18. >>> >>> Using this example tree, rewriting the updated path (in reverse order) >>> appended to the end of the file makes sense... you see how index queries >>> can simply read backwards from the end of the file and not only find the >>> latest revisions of docs, but also every other doc that wasn't touched >> (it >>> will just seek into the existing inner nodes of the b+ tree for >> searching). >>> >>> What I am hoping for clarification on is the following pain-point that I >>> perceive with this approach: >>> >>> 1. In a sufficiently shallow B+ tree (like CouchDB), the paths themselves >>> to elements are short (typically no more than 3 to 5 levels deep) as >>> opposed to a trie or some other construct that would have much longer >> paths >>> to elements. >>> >>> 2. Because the depth of the tree is so shallow, the breadth of it becomes >>> large to compensate... more specifically, each internal node can have >> 100s, >>> 1000s or more children. Using the example slides, consider the nodes >>> [A...M] and [R...Z] -- in a good sized CouchDB database, those internal >>> index nodes would have 100s (or more) elements in them pointing at deeper >>> internal nodes that themselves had thousands of elements; instead of the >> 13 >>> or so as implied by [A...M]. >>> >>> 3. Looking at slide 17 and 18, where you see the direct B+ tree path to >> the >>> update node getting appended to the end of the file after the revision is >>> written (leaf to root ordering: [J' M] -> [A M] -> [A Z]) it implies that >>> those internal nodes with *all* their child elements are getting >> rewritten >>> as well. >>> >>> In this example tree, it is isn't such a big issue... but in a >> sufficiently >>> large CouchDB database, these nodes denoted by [A...M] and [A...Z] could >> be >>> quite large... I don't know the format of the node elements in the B+ >> tree, >>> but it would be whatever the size of a node is times however many >> elements >>> are contained at each level (1 for root, say 100 for level 2, 1000 for >>> level 3 and 10,000 for level 4 -- there is a lot of hand-waving going on >>> here, of course it depends on the size of the data store). >>> >>> Am I missing something or is CouchDB really rewriting that much index >>> information between document revisions on every update? >>> >>> What was previously confusing me is I thought it was *only* rewriting a >>> direct path to the updated revision, like [B]>[E]>[J'] and Couch was >>> some-how patching in that updated path info to the B+ index at runtime. >>> >>> If couch is rewriting entire node paths with all their elements then I am >>> no longer confused about the B+ index updates, but am curious about the >>> on-disk cost of this. >>> >>> In my own rough insertion testing, that would explain why I see my >>> collections absolutely explode in size until they are compacted (not >> using >>> bulk insert, but intentionally doing single inserts for a million(s) of >>> docs to see what kind of cost the index path duplication would be like). >>> >>> Can anyone confirm/deny/correct this assessment? I want to make sure I am >>> on the right track understanding this. >>> >>> Best wishes, >>> Riyad >>> >>> On Tue, Dec 20, 2011 at 6:13 PM, Riyad Kalla <[email protected]> wrote: >>> >>>> @Filipe - I was just not clear on how CouchDB operated; you and Robert >>>> cleared that up for me. Thank you. >>>> >>>> @Robert - The writeup is excellent so far (I am not familiar with >> erlang, >>>> so there is a bit of stickiness there), thank you for taking the time to >>>> put this together! >>>> >>>> At this point I am curious how the _id and _seq indices are read as >> their >>>> data is continually appended to the end of the data file in small >>>> diff-trees for every updated doc. >>>> >>>> If CouchDB kept all the indices in-memory and simply patched-in the >>>> updated paths at runtime (maybe something akin to memory-mapped indices >> in >>>> MongoDB) I would be fairly clear on the operation... but as I understand >>>> it, CouchDB keeps such a small memory footprint by doing no in-memory >>>> caching and relying on the intelligence of the OS and filesystem (and/or >>>> drives) to cache frequently accessed data. >>>> >>>> I am trying to understand the logic used by CouchDB to answer a query >>>> using the index once updates to the tree have been appended to the data >>>> file... for example, consider a CouchDB datastore like the one Filipe >>>> has... 10 million documents and let's say it is freshly compacted. >>>> >>>> If I send in a request to that Couch instance, it hits the header of the >>>> data file along with the index and walks the B+ tree to the leaf node, >>>> where it finds the offset into the data file where the actual doc >> lives... >>>> let's say 1,000,000 bytes away. >>>> >>>> These B+ trees are shallow, so it might look something like this: >>>> >>>> Level 1: 1 node, root node. >>>> Level 2: 100 nodes, inner child nodes >>>> Level 3: 10,000 nodes, inner child nodes >>>> Level 4: 1,000,000, leaf nodes... all with pointers to the data offsets >> in >>>> the data file. >>>> >>>> Now let's say I write 10 updates to documents in that file. There are 10 >>>> new revisions appended to the end of the data file *each one* separated >> by >>>> a rewritten B+ path to a leaf node with it's new location at the end of >> the >>>> file. Each of those paths written between each doc revision (say >> roughly 2k >>>> like Filipe mentioned) are just 4 item paths... root -> level1 -> >> level2 -> >>>> level3 --> level4... showing the discrete path from the root to that >>>> individual updated doc. The intermediary levels (l1, 2, 3) are not fully >>>> flushed out with all the OTHER children from the original b+ tree index. >>>> >>>> [[ is this correct so far? If not, please point out my mistakes...] >>>> >>>> Now I issue a query for a document that WAS NOT updated... >>>> >>>> **** this is where I get confused on the logic *** >>>> >>>> this would mean I need to access the original B+ tree index at the root >> of >>>> the data file, because the revised B+ paths that are written between >> each >>>> of the updated doc revisions at the end of the file are not full >> indices. >>>> >>>> NOW consider I want to query for one of the changed docs... now I >> suddenly >>>> need to scan backwards from the data file's end to find the updated >> path to >>>> the new revision of that document. >>>> >>>> (obviously) this isn't what Couch is actually doing... it's doing >>>> something more elegant, I just can't figure out what or how and that is >>>> what I was hoping for help with. >>>> >>>> Much thanks guys, I know this is a heavy question to ask. >>>> >>>> Best wishes, >>>> R >>>> >>>> >>>> On Tue, Dec 20, 2011 at 1:35 PM, Robert Dionne < >>>> [email protected]> wrote: >>>> >>>>> >>>>> Robert Dionne >>>>> Computer Programmer >>>>> [email protected] >>>>> 203.231.9961 >>>>> >>>>> >>>>> >>>>> >>>>> On Dec 20, 2011, at 3:27 PM, Riyad Kalla wrote: >>>>> >>>>>> Filipe, >>>>>> >>>>>> Thank you for the reply. >>>>>> >>>>>> Maybe I am misunderstanding exactly what couch is writing out; the >> docs >>>>>> I've read say that it "rewrites the root node" -- I can't tell if the >>>>> docs >>>>>> mean the parent node of the child doc that was changed (as one of the >> b+ >>>>>> leaves) or if it means the direct path, from the root, to that >>>>> individual >>>>>> doc... or if it means the *entire* index... >>>>>> >>>>>> In the case of even rewriting the single parent, with such a shallow >>>>> tree, >>>>>> each internal leaf will have a huge fan of nodes; let's say 1-10k in a >>>>>> decent sized data set. >>>>>> >>>>>> If you are seeing a few K of extra written out after each changed doc >>>>> then >>>>>> that cannot be write... I almost assumed my understanding was wrong >>>>> because >>>>>> the sheer volume of data would make performance abysmal if it was >> true. >>>>>> >>>>>> Given that... is it just the changed path, from the root to the new >> leaf >>>>>> that is rewritten? >>>>> >>>>> Hi Riyad, >>>>> >>>>> You are correct, it's only the changed path. Interestingly I've just >>>>> started to document all these internals[1] along with links to the >> code and >>>>> other references available. >>>>> >>>>> Cheers, >>>>> >>>>> Bob >>>>> >>>>> >>>>> [1] http://bdionne.github.com/couchdb/ >>>>> >>>>>> That makes me all sorts of curious as to how Couch >>>>>> updates/searches the new modified index with the small diff that is >>>>> written >>>>>> out. >>>>>> >>>>>> Any pointers to reading that will help me dig down on this (even early >>>>> bugs >>>>>> in JIRA?) would be appreciated. I've tried skimming back in 2007/08 on >>>>>> Damien's blog to see if it wrote about it in depth and so far haven't >>>>> found >>>>>> anything as detailed as I am hoping for on this architecture. >>>>>> >>>>>> Best, >>>>>> Riyad >>>>>> >>>>>> On Tue, Dec 20, 2011 at 1:07 PM, Filipe David Manana < >>>>> [email protected]>wrote: >>>>>> >>>>>>> On Tue, Dec 20, 2011 at 6:24 PM, Riyad Kalla <[email protected]> >> wrote: >>>>>>>> I've been reading everything I can find on the CouchDB file >> format[1] >>>>> and >>>>>>>> am getting bits and pieces here and there, but not a great, >> concrete, >>>>>>>> step-by-step explanation of the process. >>>>>>>> >>>>>>>> I'm clear on the use of B+ trees and after reading a few papers on >> the >>>>>>>> benefits of log-structured file formats, I understand the benefits >> of >>>>>>>> inlining the B+ tree indices directly into the data file as well >>>>>>> (locality >>>>>>>> + sequential I/O)... what I'm flummoxed about is how much of the B+ >>>>>>> tree's >>>>>>>> index is rewritten after every modified document. >>>>>>>> >>>>>>>> Consider a CouchDB file that looks more or less like this: >>>>>>>> >>>>>>>> [idx/header][doc1, rev1][idx/header][doc1, rev2].... >>>>>>>> >>>>>>>> After each revised doc is written and the "b-tree root" is rewritten >>>>>>> after >>>>>>>> that, is that just a modified root node of the B+ tree or the entire >>>>> B+ >>>>>>>> tree? >>>>>>>> >>>>>>>> The reason I ask is because regardless of the answer to my previous >>>>>>>> question, for a *huge* database will millions of records, that seems >>>>> like >>>>>>>> an enormous amount of data to rewrite after every modification. Say >>>>> the >>>>>>>> root node had a fanning factor of 133; that would still be alot of >>>>> data >>>>>>> to >>>>>>>> rewrite. >>>>>>> >>>>>>> Hi Riyad, >>>>>>> >>>>>>> Have you observed that in practice? >>>>>>> >>>>>>> Typically the depth of database btrees is not that high even for >>>>>>> millions of documents. For example I have one around with about 10 >>>>>>> million documents which doesn't have more than 5 or 6 levels if I >>>>>>> recall correctly. >>>>>>> >>>>>>> So updating a doc, for that particular case, means rewriting 5 or 6 >>>>>>> new nodes plus the document itself. Each node is normally not much >>>>>>> bigger than 1.2Kb. >>>>>>> >>>>>>> I've written once a tool to analyze database files which reports >> btree >>>>>>> depths, however it's not updated to work with recent changes on >>>>>>> master/1.2.x such as snappy compression and btree sizes: >>>>>>> >>>>>>> https://github.com/fdmanana/couchfoo >>>>>>> >>>>>>> It should work with CouchDB 1.1 (and older) database files. >>>>>>> >>>>>>>> >>>>>>>> I am certain I am missing the boat on this; if anyone can pull me >> out >>>>> of >>>>>>>> the water and point me to dry land I'd appreciate it. >>>>>>>> >>>>>>>> Best, >>>>>>>> R >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> [1] >>>>>>>> -- >>>>>>>> >>>>>>> >>>>> >> http://jchrisa.net/drl/_design/sofa/_list/post/post-page?startkey=%5B%22CouchDB-Implements-a-Fundamental-Algorithm%22%5D >>>>>>>> -- http://horicky.blogspot.com/2008/10/couchdb-implementation.html >>>>>>>> -- http://blog.kodekabuki.com/post/132952897/couchdb-naked >>>>>>>> -- http://guide.couchdb.org/editions/1/en/btree.html >>>>>>>> -- http://ayende.com/blog/* (Over my head) >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Filipe David Manana, >>>>>>> >>>>>>> "Reasonable men adapt themselves to the world. >>>>>>> Unreasonable men adapt the world to themselves. >>>>>>> That's why all progress depends on unreasonable men." >>>>>>> >>>>> >>>>> >>>> >> >>
