Bob, Really appreciate the link; Rick has a handful of articles that helped a lot.
Along side all the CouchDB reading I've been looking at SSD-optimized data storage mechanisms and tried to coalesce all of this information into this post on Couch's file storage format: https://plus.google.com/u/0/107397941677313236670/posts/CyvwRcvh4vv It is uncanny how many things Couch seems to have gotten right with regard to existing storage systems and future flash-based storage systems. I'd appreciate any corrections, additions or feedback to the post for anyone interested. Best, R On Wed, Dec 21, 2011 at 12:53 PM, Robert Dionne < dio...@dionne-associates.com> wrote: > I think this is largely correct Riyad, I dug out an old article[1] by Rick > Ho that you may also find helpful though it might be slightly dated. > Generally the best performance will be had if the ids are sequential and > updates are done in bulk. Write heavy applications will eat up a lot of > space and require compaction. At the leaf nodes what are stored are either > full_doc_info records or doc_info records which store pointers to the data > so the main thing that impacts the branching at each level are the key size > and in the case of views the sizes of the reductions as these are stored > with the intermediate nodes. > > All in all it works pretty well but as always you need to test and > evaluate it for you specific case to see what the limits are. > > Regards, > > Bob > > > [1] http://horicky.blogspot.com/2008/10/couchdb-implementation.html > > > > > On Dec 21, 2011, at 2:17 PM, Riyad Kalla wrote: > > > Adding to this conversation, I found this set of slides by Chris > explaining > > the append-only index update format: > > http://www.slideshare.net/jchrisa/btree-nosql-oak?from=embed > > > > Specifically slides 16, 17 and 18. > > > > Using this example tree, rewriting the updated path (in reverse order) > > appended to the end of the file makes sense... you see how index queries > > can simply read backwards from the end of the file and not only find the > > latest revisions of docs, but also every other doc that wasn't touched > (it > > will just seek into the existing inner nodes of the b+ tree for > searching). > > > > What I am hoping for clarification on is the following pain-point that I > > perceive with this approach: > > > > 1. In a sufficiently shallow B+ tree (like CouchDB), the paths themselves > > to elements are short (typically no more than 3 to 5 levels deep) as > > opposed to a trie or some other construct that would have much longer > paths > > to elements. > > > > 2. Because the depth of the tree is so shallow, the breadth of it becomes > > large to compensate... more specifically, each internal node can have > 100s, > > 1000s or more children. Using the example slides, consider the nodes > > [A...M] and [R...Z] -- in a good sized CouchDB database, those internal > > index nodes would have 100s (or more) elements in them pointing at deeper > > internal nodes that themselves had thousands of elements; instead of the > 13 > > or so as implied by [A...M]. > > > > 3. Looking at slide 17 and 18, where you see the direct B+ tree path to > the > > update node getting appended to the end of the file after the revision is > > written (leaf to root ordering: [J' M] -> [A M] -> [A Z]) it implies that > > those internal nodes with *all* their child elements are getting > rewritten > > as well. > > > > In this example tree, it is isn't such a big issue... but in a > sufficiently > > large CouchDB database, these nodes denoted by [A...M] and [A...Z] could > be > > quite large... I don't know the format of the node elements in the B+ > tree, > > but it would be whatever the size of a node is times however many > elements > > are contained at each level (1 for root, say 100 for level 2, 1000 for > > level 3 and 10,000 for level 4 -- there is a lot of hand-waving going on > > here, of course it depends on the size of the data store). > > > > Am I missing something or is CouchDB really rewriting that much index > > information between document revisions on every update? > > > > What was previously confusing me is I thought it was *only* rewriting a > > direct path to the updated revision, like [B]>[E]>[J'] and Couch was > > some-how patching in that updated path info to the B+ index at runtime. > > > > If couch is rewriting entire node paths with all their elements then I am > > no longer confused about the B+ index updates, but am curious about the > > on-disk cost of this. > > > > In my own rough insertion testing, that would explain why I see my > > collections absolutely explode in size until they are compacted (not > using > > bulk insert, but intentionally doing single inserts for a million(s) of > > docs to see what kind of cost the index path duplication would be like). > > > > Can anyone confirm/deny/correct this assessment? I want to make sure I am > > on the right track understanding this. > > > > Best wishes, > > Riyad > > > > On Tue, Dec 20, 2011 at 6:13 PM, Riyad Kalla <rka...@gmail.com> wrote: > > > >> @Filipe - I was just not clear on how CouchDB operated; you and Robert > >> cleared that up for me. Thank you. > >> > >> @Robert - The writeup is excellent so far (I am not familiar with > erlang, > >> so there is a bit of stickiness there), thank you for taking the time to > >> put this together! > >> > >> At this point I am curious how the _id and _seq indices are read as > their > >> data is continually appended to the end of the data file in small > >> diff-trees for every updated doc. > >> > >> If CouchDB kept all the indices in-memory and simply patched-in the > >> updated paths at runtime (maybe something akin to memory-mapped indices > in > >> MongoDB) I would be fairly clear on the operation... but as I understand > >> it, CouchDB keeps such a small memory footprint by doing no in-memory > >> caching and relying on the intelligence of the OS and filesystem (and/or > >> drives) to cache frequently accessed data. > >> > >> I am trying to understand the logic used by CouchDB to answer a query > >> using the index once updates to the tree have been appended to the data > >> file... for example, consider a CouchDB datastore like the one Filipe > >> has... 10 million documents and let's say it is freshly compacted. > >> > >> If I send in a request to that Couch instance, it hits the header of the > >> data file along with the index and walks the B+ tree to the leaf node, > >> where it finds the offset into the data file where the actual doc > lives... > >> let's say 1,000,000 bytes away. > >> > >> These B+ trees are shallow, so it might look something like this: > >> > >> Level 1: 1 node, root node. > >> Level 2: 100 nodes, inner child nodes > >> Level 3: 10,000 nodes, inner child nodes > >> Level 4: 1,000,000, leaf nodes... all with pointers to the data offsets > in > >> the data file. > >> > >> Now let's say I write 10 updates to documents in that file. There are 10 > >> new revisions appended to the end of the data file *each one* separated > by > >> a rewritten B+ path to a leaf node with it's new location at the end of > the > >> file. Each of those paths written between each doc revision (say > roughly 2k > >> like Filipe mentioned) are just 4 item paths... root -> level1 -> > level2 -> > >> level3 --> level4... showing the discrete path from the root to that > >> individual updated doc. The intermediary levels (l1, 2, 3) are not fully > >> flushed out with all the OTHER children from the original b+ tree index. > >> > >> [[ is this correct so far? If not, please point out my mistakes...] > >> > >> Now I issue a query for a document that WAS NOT updated... > >> > >> **** this is where I get confused on the logic *** > >> > >> this would mean I need to access the original B+ tree index at the root > of > >> the data file, because the revised B+ paths that are written between > each > >> of the updated doc revisions at the end of the file are not full > indices. > >> > >> NOW consider I want to query for one of the changed docs... now I > suddenly > >> need to scan backwards from the data file's end to find the updated > path to > >> the new revision of that document. > >> > >> (obviously) this isn't what Couch is actually doing... it's doing > >> something more elegant, I just can't figure out what or how and that is > >> what I was hoping for help with. > >> > >> Much thanks guys, I know this is a heavy question to ask. > >> > >> Best wishes, > >> R > >> > >> > >> On Tue, Dec 20, 2011 at 1:35 PM, Robert Dionne < > >> dio...@dionne-associates.com> wrote: > >> > >>> > >>> Robert Dionne > >>> Computer Programmer > >>> dio...@dionne-associates.com > >>> 203.231.9961 > >>> > >>> > >>> > >>> > >>> On Dec 20, 2011, at 3:27 PM, Riyad Kalla wrote: > >>> > >>>> Filipe, > >>>> > >>>> Thank you for the reply. > >>>> > >>>> Maybe I am misunderstanding exactly what couch is writing out; the > docs > >>>> I've read say that it "rewrites the root node" -- I can't tell if the > >>> docs > >>>> mean the parent node of the child doc that was changed (as one of the > b+ > >>>> leaves) or if it means the direct path, from the root, to that > >>> individual > >>>> doc... or if it means the *entire* index... > >>>> > >>>> In the case of even rewriting the single parent, with such a shallow > >>> tree, > >>>> each internal leaf will have a huge fan of nodes; let's say 1-10k in a > >>>> decent sized data set. > >>>> > >>>> If you are seeing a few K of extra written out after each changed doc > >>> then > >>>> that cannot be write... I almost assumed my understanding was wrong > >>> because > >>>> the sheer volume of data would make performance abysmal if it was > true. > >>>> > >>>> Given that... is it just the changed path, from the root to the new > leaf > >>>> that is rewritten? > >>> > >>> Hi Riyad, > >>> > >>> You are correct, it's only the changed path. Interestingly I've just > >>> started to document all these internals[1] along with links to the > code and > >>> other references available. > >>> > >>> Cheers, > >>> > >>> Bob > >>> > >>> > >>> [1] http://bdionne.github.com/couchdb/ > >>> > >>>> That makes me all sorts of curious as to how Couch > >>>> updates/searches the new modified index with the small diff that is > >>> written > >>>> out. > >>>> > >>>> Any pointers to reading that will help me dig down on this (even early > >>> bugs > >>>> in JIRA?) would be appreciated. I've tried skimming back in 2007/08 on > >>>> Damien's blog to see if it wrote about it in depth and so far haven't > >>> found > >>>> anything as detailed as I am hoping for on this architecture. > >>>> > >>>> Best, > >>>> Riyad > >>>> > >>>> On Tue, Dec 20, 2011 at 1:07 PM, Filipe David Manana < > >>> fdman...@apache.org>wrote: > >>>> > >>>>> On Tue, Dec 20, 2011 at 6:24 PM, Riyad Kalla <rka...@gmail.com> > wrote: > >>>>>> I've been reading everything I can find on the CouchDB file > format[1] > >>> and > >>>>>> am getting bits and pieces here and there, but not a great, > concrete, > >>>>>> step-by-step explanation of the process. > >>>>>> > >>>>>> I'm clear on the use of B+ trees and after reading a few papers on > the > >>>>>> benefits of log-structured file formats, I understand the benefits > of > >>>>>> inlining the B+ tree indices directly into the data file as well > >>>>> (locality > >>>>>> + sequential I/O)... what I'm flummoxed about is how much of the B+ > >>>>> tree's > >>>>>> index is rewritten after every modified document. > >>>>>> > >>>>>> Consider a CouchDB file that looks more or less like this: > >>>>>> > >>>>>> [idx/header][doc1, rev1][idx/header][doc1, rev2].... > >>>>>> > >>>>>> After each revised doc is written and the "b-tree root" is rewritten > >>>>> after > >>>>>> that, is that just a modified root node of the B+ tree or the entire > >>> B+ > >>>>>> tree? > >>>>>> > >>>>>> The reason I ask is because regardless of the answer to my previous > >>>>>> question, for a *huge* database will millions of records, that seems > >>> like > >>>>>> an enormous amount of data to rewrite after every modification. Say > >>> the > >>>>>> root node had a fanning factor of 133; that would still be alot of > >>> data > >>>>> to > >>>>>> rewrite. > >>>>> > >>>>> Hi Riyad, > >>>>> > >>>>> Have you observed that in practice? > >>>>> > >>>>> Typically the depth of database btrees is not that high even for > >>>>> millions of documents. For example I have one around with about 10 > >>>>> million documents which doesn't have more than 5 or 6 levels if I > >>>>> recall correctly. > >>>>> > >>>>> So updating a doc, for that particular case, means rewriting 5 or 6 > >>>>> new nodes plus the document itself. Each node is normally not much > >>>>> bigger than 1.2Kb. > >>>>> > >>>>> I've written once a tool to analyze database files which reports > btree > >>>>> depths, however it's not updated to work with recent changes on > >>>>> master/1.2.x such as snappy compression and btree sizes: > >>>>> > >>>>> https://github.com/fdmanana/couchfoo > >>>>> > >>>>> It should work with CouchDB 1.1 (and older) database files. > >>>>> > >>>>>> > >>>>>> I am certain I am missing the boat on this; if anyone can pull me > out > >>> of > >>>>>> the water and point me to dry land I'd appreciate it. > >>>>>> > >>>>>> Best, > >>>>>> R > >>>>>> > >>>>>> > >>>>>> > >>>>>> [1] > >>>>>> -- > >>>>>> > >>>>> > >>> > http://jchrisa.net/drl/_design/sofa/_list/post/post-page?startkey=%5B%22CouchDB-Implements-a-Fundamental-Algorithm%22%5D > >>>>>> -- http://horicky.blogspot.com/2008/10/couchdb-implementation.html > >>>>>> -- http://blog.kodekabuki.com/post/132952897/couchdb-naked > >>>>>> -- http://guide.couchdb.org/editions/1/en/btree.html > >>>>>> -- http://ayende.com/blog/* (Over my head) > >>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> Filipe David Manana, > >>>>> > >>>>> "Reasonable men adapt themselves to the world. > >>>>> Unreasonable men adapt the world to themselves. > >>>>> That's why all progress depends on unreasonable men." > >>>>> > >>> > >>> > >> > >