Thank you Robert, fixed. On Wed, Dec 21, 2011 at 1:42 PM, Robert Dionne <[email protected] > wrote:
> Riyad, > > Your welcome. At a quick glance your post has one error, internal nodes do > contain values (from the reductions). The appendix in the couchdb book also > makes this error[1] which I've opened a ticket for. > > Cheers, > > Bob > > > [1] https://github.com/oreilly/couchdb-guide/issues/450 > > > > > On Dec 21, 2011, at 3:28 PM, Riyad Kalla wrote: > > > Bob, > > > > Really appreciate the link; Rick has a handful of articles that helped a > > lot. > > > > Along side all the CouchDB reading I've been looking at SSD-optimized > data > > storage mechanisms and tried to coalesce all of this information into > this > > post on Couch's file storage format: > > https://plus.google.com/u/0/107397941677313236670/posts/CyvwRcvh4vv > > > > It is uncanny how many things Couch seems to have gotten right with > regard > > to existing storage systems and future flash-based storage systems. I'd > > appreciate any corrections, additions or feedback to the post for anyone > > interested. > > > > Best, > > R > > > > On Wed, Dec 21, 2011 at 12:53 PM, Robert Dionne < > > [email protected]> wrote: > > > >> I think this is largely correct Riyad, I dug out an old article[1] by > Rick > >> Ho that you may also find helpful though it might be slightly dated. > >> Generally the best performance will be had if the ids are sequential and > >> updates are done in bulk. Write heavy applications will eat up a lot of > >> space and require compaction. At the leaf nodes what are stored are > either > >> full_doc_info records or doc_info records which store pointers to the > data > >> so the main thing that impacts the branching at each level are the key > size > >> and in the case of views the sizes of the reductions as these are stored > >> with the intermediate nodes. > >> > >> All in all it works pretty well but as always you need to test and > >> evaluate it for you specific case to see what the limits are. > >> > >> Regards, > >> > >> Bob > >> > >> > >> [1] http://horicky.blogspot.com/2008/10/couchdb-implementation.html > >> > >> > >> > >> > >> On Dec 21, 2011, at 2:17 PM, Riyad Kalla wrote: > >> > >>> Adding to this conversation, I found this set of slides by Chris > >> explaining > >>> the append-only index update format: > >>> http://www.slideshare.net/jchrisa/btree-nosql-oak?from=embed > >>> > >>> Specifically slides 16, 17 and 18. > >>> > >>> Using this example tree, rewriting the updated path (in reverse order) > >>> appended to the end of the file makes sense... you see how index > queries > >>> can simply read backwards from the end of the file and not only find > the > >>> latest revisions of docs, but also every other doc that wasn't touched > >> (it > >>> will just seek into the existing inner nodes of the b+ tree for > >> searching). > >>> > >>> What I am hoping for clarification on is the following pain-point that > I > >>> perceive with this approach: > >>> > >>> 1. In a sufficiently shallow B+ tree (like CouchDB), the paths > themselves > >>> to elements are short (typically no more than 3 to 5 levels deep) as > >>> opposed to a trie or some other construct that would have much longer > >> paths > >>> to elements. > >>> > >>> 2. Because the depth of the tree is so shallow, the breadth of it > becomes > >>> large to compensate... more specifically, each internal node can have > >> 100s, > >>> 1000s or more children. Using the example slides, consider the nodes > >>> [A...M] and [R...Z] -- in a good sized CouchDB database, those internal > >>> index nodes would have 100s (or more) elements in them pointing at > deeper > >>> internal nodes that themselves had thousands of elements; instead of > the > >> 13 > >>> or so as implied by [A...M]. > >>> > >>> 3. Looking at slide 17 and 18, where you see the direct B+ tree path to > >> the > >>> update node getting appended to the end of the file after the revision > is > >>> written (leaf to root ordering: [J' M] -> [A M] -> [A Z]) it implies > that > >>> those internal nodes with *all* their child elements are getting > >> rewritten > >>> as well. > >>> > >>> In this example tree, it is isn't such a big issue... but in a > >> sufficiently > >>> large CouchDB database, these nodes denoted by [A...M] and [A...Z] > could > >> be > >>> quite large... I don't know the format of the node elements in the B+ > >> tree, > >>> but it would be whatever the size of a node is times however many > >> elements > >>> are contained at each level (1 for root, say 100 for level 2, 1000 for > >>> level 3 and 10,000 for level 4 -- there is a lot of hand-waving going > on > >>> here, of course it depends on the size of the data store). > >>> > >>> Am I missing something or is CouchDB really rewriting that much index > >>> information between document revisions on every update? > >>> > >>> What was previously confusing me is I thought it was *only* rewriting a > >>> direct path to the updated revision, like [B]>[E]>[J'] and Couch was > >>> some-how patching in that updated path info to the B+ index at runtime. > >>> > >>> If couch is rewriting entire node paths with all their elements then I > am > >>> no longer confused about the B+ index updates, but am curious about the > >>> on-disk cost of this. > >>> > >>> In my own rough insertion testing, that would explain why I see my > >>> collections absolutely explode in size until they are compacted (not > >> using > >>> bulk insert, but intentionally doing single inserts for a million(s) of > >>> docs to see what kind of cost the index path duplication would be > like). > >>> > >>> Can anyone confirm/deny/correct this assessment? I want to make sure I > am > >>> on the right track understanding this. > >>> > >>> Best wishes, > >>> Riyad > >>> > >>> On Tue, Dec 20, 2011 at 6:13 PM, Riyad Kalla <[email protected]> wrote: > >>> > >>>> @Filipe - I was just not clear on how CouchDB operated; you and Robert > >>>> cleared that up for me. Thank you. > >>>> > >>>> @Robert - The writeup is excellent so far (I am not familiar with > >> erlang, > >>>> so there is a bit of stickiness there), thank you for taking the time > to > >>>> put this together! > >>>> > >>>> At this point I am curious how the _id and _seq indices are read as > >> their > >>>> data is continually appended to the end of the data file in small > >>>> diff-trees for every updated doc. > >>>> > >>>> If CouchDB kept all the indices in-memory and simply patched-in the > >>>> updated paths at runtime (maybe something akin to memory-mapped > indices > >> in > >>>> MongoDB) I would be fairly clear on the operation... but as I > understand > >>>> it, CouchDB keeps such a small memory footprint by doing no in-memory > >>>> caching and relying on the intelligence of the OS and filesystem > (and/or > >>>> drives) to cache frequently accessed data. > >>>> > >>>> I am trying to understand the logic used by CouchDB to answer a query > >>>> using the index once updates to the tree have been appended to the > data > >>>> file... for example, consider a CouchDB datastore like the one Filipe > >>>> has... 10 million documents and let's say it is freshly compacted. > >>>> > >>>> If I send in a request to that Couch instance, it hits the header of > the > >>>> data file along with the index and walks the B+ tree to the leaf node, > >>>> where it finds the offset into the data file where the actual doc > >> lives... > >>>> let's say 1,000,000 bytes away. > >>>> > >>>> These B+ trees are shallow, so it might look something like this: > >>>> > >>>> Level 1: 1 node, root node. > >>>> Level 2: 100 nodes, inner child nodes > >>>> Level 3: 10,000 nodes, inner child nodes > >>>> Level 4: 1,000,000, leaf nodes... all with pointers to the data > offsets > >> in > >>>> the data file. > >>>> > >>>> Now let's say I write 10 updates to documents in that file. There are > 10 > >>>> new revisions appended to the end of the data file *each one* > separated > >> by > >>>> a rewritten B+ path to a leaf node with it's new location at the end > of > >> the > >>>> file. Each of those paths written between each doc revision (say > >> roughly 2k > >>>> like Filipe mentioned) are just 4 item paths... root -> level1 -> > >> level2 -> > >>>> level3 --> level4... showing the discrete path from the root to that > >>>> individual updated doc. The intermediary levels (l1, 2, 3) are not > fully > >>>> flushed out with all the OTHER children from the original b+ tree > index. > >>>> > >>>> [[ is this correct so far? If not, please point out my mistakes...] > >>>> > >>>> Now I issue a query for a document that WAS NOT updated... > >>>> > >>>> **** this is where I get confused on the logic *** > >>>> > >>>> this would mean I need to access the original B+ tree index at the > root > >> of > >>>> the data file, because the revised B+ paths that are written between > >> each > >>>> of the updated doc revisions at the end of the file are not full > >> indices. > >>>> > >>>> NOW consider I want to query for one of the changed docs... now I > >> suddenly > >>>> need to scan backwards from the data file's end to find the updated > >> path to > >>>> the new revision of that document. > >>>> > >>>> (obviously) this isn't what Couch is actually doing... it's doing > >>>> something more elegant, I just can't figure out what or how and that > is > >>>> what I was hoping for help with. > >>>> > >>>> Much thanks guys, I know this is a heavy question to ask. > >>>> > >>>> Best wishes, > >>>> R > >>>> > >>>> > >>>> On Tue, Dec 20, 2011 at 1:35 PM, Robert Dionne < > >>>> [email protected]> wrote: > >>>> > >>>>> > >>>>> Robert Dionne > >>>>> Computer Programmer > >>>>> [email protected] > >>>>> 203.231.9961 > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> On Dec 20, 2011, at 3:27 PM, Riyad Kalla wrote: > >>>>> > >>>>>> Filipe, > >>>>>> > >>>>>> Thank you for the reply. > >>>>>> > >>>>>> Maybe I am misunderstanding exactly what couch is writing out; the > >> docs > >>>>>> I've read say that it "rewrites the root node" -- I can't tell if > the > >>>>> docs > >>>>>> mean the parent node of the child doc that was changed (as one of > the > >> b+ > >>>>>> leaves) or if it means the direct path, from the root, to that > >>>>> individual > >>>>>> doc... or if it means the *entire* index... > >>>>>> > >>>>>> In the case of even rewriting the single parent, with such a shallow > >>>>> tree, > >>>>>> each internal leaf will have a huge fan of nodes; let's say 1-10k > in a > >>>>>> decent sized data set. > >>>>>> > >>>>>> If you are seeing a few K of extra written out after each changed > doc > >>>>> then > >>>>>> that cannot be write... I almost assumed my understanding was wrong > >>>>> because > >>>>>> the sheer volume of data would make performance abysmal if it was > >> true. > >>>>>> > >>>>>> Given that... is it just the changed path, from the root to the new > >> leaf > >>>>>> that is rewritten? > >>>>> > >>>>> Hi Riyad, > >>>>> > >>>>> You are correct, it's only the changed path. Interestingly I've just > >>>>> started to document all these internals[1] along with links to the > >> code and > >>>>> other references available. > >>>>> > >>>>> Cheers, > >>>>> > >>>>> Bob > >>>>> > >>>>> > >>>>> [1] http://bdionne.github.com/couchdb/ > >>>>> > >>>>>> That makes me all sorts of curious as to how Couch > >>>>>> updates/searches the new modified index with the small diff that is > >>>>> written > >>>>>> out. > >>>>>> > >>>>>> Any pointers to reading that will help me dig down on this (even > early > >>>>> bugs > >>>>>> in JIRA?) would be appreciated. I've tried skimming back in 2007/08 > on > >>>>>> Damien's blog to see if it wrote about it in depth and so far > haven't > >>>>> found > >>>>>> anything as detailed as I am hoping for on this architecture. > >>>>>> > >>>>>> Best, > >>>>>> Riyad > >>>>>> > >>>>>> On Tue, Dec 20, 2011 at 1:07 PM, Filipe David Manana < > >>>>> [email protected]>wrote: > >>>>>> > >>>>>>> On Tue, Dec 20, 2011 at 6:24 PM, Riyad Kalla <[email protected]> > >> wrote: > >>>>>>>> I've been reading everything I can find on the CouchDB file > >> format[1] > >>>>> and > >>>>>>>> am getting bits and pieces here and there, but not a great, > >> concrete, > >>>>>>>> step-by-step explanation of the process. > >>>>>>>> > >>>>>>>> I'm clear on the use of B+ trees and after reading a few papers on > >> the > >>>>>>>> benefits of log-structured file formats, I understand the benefits > >> of > >>>>>>>> inlining the B+ tree indices directly into the data file as well > >>>>>>> (locality > >>>>>>>> + sequential I/O)... what I'm flummoxed about is how much of the > B+ > >>>>>>> tree's > >>>>>>>> index is rewritten after every modified document. > >>>>>>>> > >>>>>>>> Consider a CouchDB file that looks more or less like this: > >>>>>>>> > >>>>>>>> [idx/header][doc1, rev1][idx/header][doc1, rev2].... > >>>>>>>> > >>>>>>>> After each revised doc is written and the "b-tree root" is > rewritten > >>>>>>> after > >>>>>>>> that, is that just a modified root node of the B+ tree or the > entire > >>>>> B+ > >>>>>>>> tree? > >>>>>>>> > >>>>>>>> The reason I ask is because regardless of the answer to my > previous > >>>>>>>> question, for a *huge* database will millions of records, that > seems > >>>>> like > >>>>>>>> an enormous amount of data to rewrite after every modification. > Say > >>>>> the > >>>>>>>> root node had a fanning factor of 133; that would still be alot of > >>>>> data > >>>>>>> to > >>>>>>>> rewrite. > >>>>>>> > >>>>>>> Hi Riyad, > >>>>>>> > >>>>>>> Have you observed that in practice? > >>>>>>> > >>>>>>> Typically the depth of database btrees is not that high even for > >>>>>>> millions of documents. For example I have one around with about 10 > >>>>>>> million documents which doesn't have more than 5 or 6 levels if I > >>>>>>> recall correctly. > >>>>>>> > >>>>>>> So updating a doc, for that particular case, means rewriting 5 or 6 > >>>>>>> new nodes plus the document itself. Each node is normally not much > >>>>>>> bigger than 1.2Kb. > >>>>>>> > >>>>>>> I've written once a tool to analyze database files which reports > >> btree > >>>>>>> depths, however it's not updated to work with recent changes on > >>>>>>> master/1.2.x such as snappy compression and btree sizes: > >>>>>>> > >>>>>>> https://github.com/fdmanana/couchfoo > >>>>>>> > >>>>>>> It should work with CouchDB 1.1 (and older) database files. > >>>>>>> > >>>>>>>> > >>>>>>>> I am certain I am missing the boat on this; if anyone can pull me > >> out > >>>>> of > >>>>>>>> the water and point me to dry land I'd appreciate it. > >>>>>>>> > >>>>>>>> Best, > >>>>>>>> R > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> [1] > >>>>>>>> -- > >>>>>>>> > >>>>>>> > >>>>> > >> > http://jchrisa.net/drl/_design/sofa/_list/post/post-page?startkey=%5B%22CouchDB-Implements-a-Fundamental-Algorithm%22%5D > >>>>>>>> -- > http://horicky.blogspot.com/2008/10/couchdb-implementation.html > >>>>>>>> -- http://blog.kodekabuki.com/post/132952897/couchdb-naked > >>>>>>>> -- http://guide.couchdb.org/editions/1/en/btree.html > >>>>>>>> -- http://ayende.com/blog/* (Over my head) > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> Filipe David Manana, > >>>>>>> > >>>>>>> "Reasonable men adapt themselves to the world. > >>>>>>> Unreasonable men adapt the world to themselves. > >>>>>>> That's why all progress depends on unreasonable men." > >>>>>>> > >>>>> > >>>>> > >>>> > >> > >> > >
