It reads well as an article but needs some polish before it could be a great wiki page. I suggest that if it does go up, it is clearly marked as a draft, and we all chip in to sculpt it into shape.
Particularly, the author is very enthusiastic but this mars the article (the all-caps, the ad-hoc methods of emphasis). There are a few parts of the article that are inaccurate (like the assertion we have good locality for the id and seq trees. If this were true we wouldn't have seen such a huge improvement in compaction by temporarily separating them). The 'COMPETE recreation' paragraph also strikes me as factually awry. B. On 22 December 2011 18:52, Riyad Kalla <rka...@gmail.com> wrote: > Jan, > > Thank you and yes, I'd be happy to contribute this to the wiki. > > I made some edits early this morning after some feedback. If a few folks > in-the-know could give it a quick read-through to make sure I didn't get > anything wrong then I'm happy to work it up on the wiki as well (or send it > along to the wiki's editor for inclusion). > > Just point the way. > > Best, > Riyad > > On Thu, Dec 22, 2011 at 2:21 AM, Jan Lehnardt <j...@apache.org> wrote: > >> Good writeup! Would you consider contributing it to the CouchDB Wiki? >> >> http://wiki.apache.org/couchdb/ >> >> Cheers >> Jan >> -- >> >> On Dec 21, 2011, at 21:28 , Riyad Kalla wrote: >> >> > Bob, >> > >> > Really appreciate the link; Rick has a handful of articles that helped a >> > lot. >> > >> > Along side all the CouchDB reading I've been looking at SSD-optimized >> data >> > storage mechanisms and tried to coalesce all of this information into >> this >> > post on Couch's file storage format: >> > https://plus.google.com/u/0/107397941677313236670/posts/CyvwRcvh4vv >> > >> > It is uncanny how many things Couch seems to have gotten right with >> regard >> > to existing storage systems and future flash-based storage systems. I'd >> > appreciate any corrections, additions or feedback to the post for anyone >> > interested. >> > >> > Best, >> > R >> > >> > On Wed, Dec 21, 2011 at 12:53 PM, Robert Dionne < >> > dio...@dionne-associates.com> wrote: >> > >> >> I think this is largely correct Riyad, I dug out an old article[1] by >> Rick >> >> Ho that you may also find helpful though it might be slightly dated. >> >> Generally the best performance will be had if the ids are sequential and >> >> updates are done in bulk. Write heavy applications will eat up a lot of >> >> space and require compaction. At the leaf nodes what are stored are >> either >> >> full_doc_info records or doc_info records which store pointers to the >> data >> >> so the main thing that impacts the branching at each level are the key >> size >> >> and in the case of views the sizes of the reductions as these are stored >> >> with the intermediate nodes. >> >> >> >> All in all it works pretty well but as always you need to test and >> >> evaluate it for you specific case to see what the limits are. >> >> >> >> Regards, >> >> >> >> Bob >> >> >> >> >> >> [1] http://horicky.blogspot.com/2008/10/couchdb-implementation.html >> >> >> >> >> >> >> >> >> >> On Dec 21, 2011, at 2:17 PM, Riyad Kalla wrote: >> >> >> >>> Adding to this conversation, I found this set of slides by Chris >> >> explaining >> >>> the append-only index update format: >> >>> http://www.slideshare.net/jchrisa/btree-nosql-oak?from=embed >> >>> >> >>> Specifically slides 16, 17 and 18. >> >>> >> >>> Using this example tree, rewriting the updated path (in reverse order) >> >>> appended to the end of the file makes sense... you see how index >> queries >> >>> can simply read backwards from the end of the file and not only find >> the >> >>> latest revisions of docs, but also every other doc that wasn't touched >> >> (it >> >>> will just seek into the existing inner nodes of the b+ tree for >> >> searching). >> >>> >> >>> What I am hoping for clarification on is the following pain-point that >> I >> >>> perceive with this approach: >> >>> >> >>> 1. In a sufficiently shallow B+ tree (like CouchDB), the paths >> themselves >> >>> to elements are short (typically no more than 3 to 5 levels deep) as >> >>> opposed to a trie or some other construct that would have much longer >> >> paths >> >>> to elements. >> >>> >> >>> 2. Because the depth of the tree is so shallow, the breadth of it >> becomes >> >>> large to compensate... more specifically, each internal node can have >> >> 100s, >> >>> 1000s or more children. Using the example slides, consider the nodes >> >>> [A...M] and [R...Z] -- in a good sized CouchDB database, those internal >> >>> index nodes would have 100s (or more) elements in them pointing at >> deeper >> >>> internal nodes that themselves had thousands of elements; instead of >> the >> >> 13 >> >>> or so as implied by [A...M]. >> >>> >> >>> 3. Looking at slide 17 and 18, where you see the direct B+ tree path to >> >> the >> >>> update node getting appended to the end of the file after the revision >> is >> >>> written (leaf to root ordering: [J' M] -> [A M] -> [A Z]) it implies >> that >> >>> those internal nodes with *all* their child elements are getting >> >> rewritten >> >>> as well. >> >>> >> >>> In this example tree, it is isn't such a big issue... but in a >> >> sufficiently >> >>> large CouchDB database, these nodes denoted by [A...M] and [A...Z] >> could >> >> be >> >>> quite large... I don't know the format of the node elements in the B+ >> >> tree, >> >>> but it would be whatever the size of a node is times however many >> >> elements >> >>> are contained at each level (1 for root, say 100 for level 2, 1000 for >> >>> level 3 and 10,000 for level 4 -- there is a lot of hand-waving going >> on >> >>> here, of course it depends on the size of the data store). >> >>> >> >>> Am I missing something or is CouchDB really rewriting that much index >> >>> information between document revisions on every update? >> >>> >> >>> What was previously confusing me is I thought it was *only* rewriting a >> >>> direct path to the updated revision, like [B]>[E]>[J'] and Couch was >> >>> some-how patching in that updated path info to the B+ index at runtime. >> >>> >> >>> If couch is rewriting entire node paths with all their elements then I >> am >> >>> no longer confused about the B+ index updates, but am curious about the >> >>> on-disk cost of this. >> >>> >> >>> In my own rough insertion testing, that would explain why I see my >> >>> collections absolutely explode in size until they are compacted (not >> >> using >> >>> bulk insert, but intentionally doing single inserts for a million(s) of >> >>> docs to see what kind of cost the index path duplication would be >> like). >> >>> >> >>> Can anyone confirm/deny/correct this assessment? I want to make sure I >> am >> >>> on the right track understanding this. >> >>> >> >>> Best wishes, >> >>> Riyad >> >>> >> >>> On Tue, Dec 20, 2011 at 6:13 PM, Riyad Kalla <rka...@gmail.com> wrote: >> >>> >> >>>> @Filipe - I was just not clear on how CouchDB operated; you and Robert >> >>>> cleared that up for me. Thank you. >> >>>> >> >>>> @Robert - The writeup is excellent so far (I am not familiar with >> >> erlang, >> >>>> so there is a bit of stickiness there), thank you for taking the time >> to >> >>>> put this together! >> >>>> >> >>>> At this point I am curious how the _id and _seq indices are read as >> >> their >> >>>> data is continually appended to the end of the data file in small >> >>>> diff-trees for every updated doc. >> >>>> >> >>>> If CouchDB kept all the indices in-memory and simply patched-in the >> >>>> updated paths at runtime (maybe something akin to memory-mapped >> indices >> >> in >> >>>> MongoDB) I would be fairly clear on the operation... but as I >> understand >> >>>> it, CouchDB keeps such a small memory footprint by doing no in-memory >> >>>> caching and relying on the intelligence of the OS and filesystem >> (and/or >> >>>> drives) to cache frequently accessed data. >> >>>> >> >>>> I am trying to understand the logic used by CouchDB to answer a query >> >>>> using the index once updates to the tree have been appended to the >> data >> >>>> file... for example, consider a CouchDB datastore like the one Filipe >> >>>> has... 10 million documents and let's say it is freshly compacted. >> >>>> >> >>>> If I send in a request to that Couch instance, it hits the header of >> the >> >>>> data file along with the index and walks the B+ tree to the leaf node, >> >>>> where it finds the offset into the data file where the actual doc >> >> lives... >> >>>> let's say 1,000,000 bytes away. >> >>>> >> >>>> These B+ trees are shallow, so it might look something like this: >> >>>> >> >>>> Level 1: 1 node, root node. >> >>>> Level 2: 100 nodes, inner child nodes >> >>>> Level 3: 10,000 nodes, inner child nodes >> >>>> Level 4: 1,000,000, leaf nodes... all with pointers to the data >> offsets >> >> in >> >>>> the data file. >> >>>> >> >>>> Now let's say I write 10 updates to documents in that file. There are >> 10 >> >>>> new revisions appended to the end of the data file *each one* >> separated >> >> by >> >>>> a rewritten B+ path to a leaf node with it's new location at the end >> of >> >> the >> >>>> file. Each of those paths written between each doc revision (say >> >> roughly 2k >> >>>> like Filipe mentioned) are just 4 item paths... root -> level1 -> >> >> level2 -> >> >>>> level3 --> level4... showing the discrete path from the root to that >> >>>> individual updated doc. The intermediary levels (l1, 2, 3) are not >> fully >> >>>> flushed out with all the OTHER children from the original b+ tree >> index. >> >>>> >> >>>> [[ is this correct so far? If not, please point out my mistakes...] >> >>>> >> >>>> Now I issue a query for a document that WAS NOT updated... >> >>>> >> >>>> **** this is where I get confused on the logic *** >> >>>> >> >>>> this would mean I need to access the original B+ tree index at the >> root >> >> of >> >>>> the data file, because the revised B+ paths that are written between >> >> each >> >>>> of the updated doc revisions at the end of the file are not full >> >> indices. >> >>>> >> >>>> NOW consider I want to query for one of the changed docs... now I >> >> suddenly >> >>>> need to scan backwards from the data file's end to find the updated >> >> path to >> >>>> the new revision of that document. >> >>>> >> >>>> (obviously) this isn't what Couch is actually doing... it's doing >> >>>> something more elegant, I just can't figure out what or how and that >> is >> >>>> what I was hoping for help with. >> >>>> >> >>>> Much thanks guys, I know this is a heavy question to ask. >> >>>> >> >>>> Best wishes, >> >>>> R >> >>>> >> >>>> >> >>>> On Tue, Dec 20, 2011 at 1:35 PM, Robert Dionne < >> >>>> dio...@dionne-associates.com> wrote: >> >>>> >> >>>>> >> >>>>> Robert Dionne >> >>>>> Computer Programmer >> >>>>> dio...@dionne-associates.com >> >>>>> 203.231.9961 >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> On Dec 20, 2011, at 3:27 PM, Riyad Kalla wrote: >> >>>>> >> >>>>>> Filipe, >> >>>>>> >> >>>>>> Thank you for the reply. >> >>>>>> >> >>>>>> Maybe I am misunderstanding exactly what couch is writing out; the >> >> docs >> >>>>>> I've read say that it "rewrites the root node" -- I can't tell if >> the >> >>>>> docs >> >>>>>> mean the parent node of the child doc that was changed (as one of >> the >> >> b+ >> >>>>>> leaves) or if it means the direct path, from the root, to that >> >>>>> individual >> >>>>>> doc... or if it means the *entire* index... >> >>>>>> >> >>>>>> In the case of even rewriting the single parent, with such a shallow >> >>>>> tree, >> >>>>>> each internal leaf will have a huge fan of nodes; let's say 1-10k >> in a >> >>>>>> decent sized data set. >> >>>>>> >> >>>>>> If you are seeing a few K of extra written out after each changed >> doc >> >>>>> then >> >>>>>> that cannot be write... I almost assumed my understanding was wrong >> >>>>> because >> >>>>>> the sheer volume of data would make performance abysmal if it was >> >> true. >> >>>>>> >> >>>>>> Given that... is it just the changed path, from the root to the new >> >> leaf >> >>>>>> that is rewritten? >> >>>>> >> >>>>> Hi Riyad, >> >>>>> >> >>>>> You are correct, it's only the changed path. Interestingly I've just >> >>>>> started to document all these internals[1] along with links to the >> >> code and >> >>>>> other references available. >> >>>>> >> >>>>> Cheers, >> >>>>> >> >>>>> Bob >> >>>>> >> >>>>> >> >>>>> [1] http://bdionne.github.com/couchdb/ >> >>>>> >> >>>>>> That makes me all sorts of curious as to how Couch >> >>>>>> updates/searches the new modified index with the small diff that is >> >>>>> written >> >>>>>> out. >> >>>>>> >> >>>>>> Any pointers to reading that will help me dig down on this (even >> early >> >>>>> bugs >> >>>>>> in JIRA?) would be appreciated. I've tried skimming back in 2007/08 >> on >> >>>>>> Damien's blog to see if it wrote about it in depth and so far >> haven't >> >>>>> found >> >>>>>> anything as detailed as I am hoping for on this architecture. >> >>>>>> >> >>>>>> Best, >> >>>>>> Riyad >> >>>>>> >> >>>>>> On Tue, Dec 20, 2011 at 1:07 PM, Filipe David Manana < >> >>>>> fdman...@apache.org>wrote: >> >>>>>> >> >>>>>>> On Tue, Dec 20, 2011 at 6:24 PM, Riyad Kalla <rka...@gmail.com> >> >> wrote: >> >>>>>>>> I've been reading everything I can find on the CouchDB file >> >> format[1] >> >>>>> and >> >>>>>>>> am getting bits and pieces here and there, but not a great, >> >> concrete, >> >>>>>>>> step-by-step explanation of the process. >> >>>>>>>> >> >>>>>>>> I'm clear on the use of B+ trees and after reading a few papers on >> >> the >> >>>>>>>> benefits of log-structured file formats, I understand the benefits >> >> of >> >>>>>>>> inlining the B+ tree indices directly into the data file as well >> >>>>>>> (locality >> >>>>>>>> + sequential I/O)... what I'm flummoxed about is how much of the >> B+ >> >>>>>>> tree's >> >>>>>>>> index is rewritten after every modified document. >> >>>>>>>> >> >>>>>>>> Consider a CouchDB file that looks more or less like this: >> >>>>>>>> >> >>>>>>>> [idx/header][doc1, rev1][idx/header][doc1, rev2].... >> >>>>>>>> >> >>>>>>>> After each revised doc is written and the "b-tree root" is >> rewritten >> >>>>>>> after >> >>>>>>>> that, is that just a modified root node of the B+ tree or the >> entire >> >>>>> B+ >> >>>>>>>> tree? >> >>>>>>>> >> >>>>>>>> The reason I ask is because regardless of the answer to my >> previous >> >>>>>>>> question, for a *huge* database will millions of records, that >> seems >> >>>>> like >> >>>>>>>> an enormous amount of data to rewrite after every modification. >> Say >> >>>>> the >> >>>>>>>> root node had a fanning factor of 133; that would still be alot of >> >>>>> data >> >>>>>>> to >> >>>>>>>> rewrite. >> >>>>>>> >> >>>>>>> Hi Riyad, >> >>>>>>> >> >>>>>>> Have you observed that in practice? >> >>>>>>> >> >>>>>>> Typically the depth of database btrees is not that high even for >> >>>>>>> millions of documents. For example I have one around with about 10 >> >>>>>>> million documents which doesn't have more than 5 or 6 levels if I >> >>>>>>> recall correctly. >> >>>>>>> >> >>>>>>> So updating a doc, for that particular case, means rewriting 5 or 6 >> >>>>>>> new nodes plus the document itself. Each node is normally not much >> >>>>>>> bigger than 1.2Kb. >> >>>>>>> >> >>>>>>> I've written once a tool to analyze database files which reports >> >> btree >> >>>>>>> depths, however it's not updated to work with recent changes on >> >>>>>>> master/1.2.x such as snappy compression and btree sizes: >> >>>>>>> >> >>>>>>> https://github.com/fdmanana/couchfoo >> >>>>>>> >> >>>>>>> It should work with CouchDB 1.1 (and older) database files. >> >>>>>>> >> >>>>>>>> >> >>>>>>>> I am certain I am missing the boat on this; if anyone can pull me >> >> out >> >>>>> of >> >>>>>>>> the water and point me to dry land I'd appreciate it. >> >>>>>>>> >> >>>>>>>> Best, >> >>>>>>>> R >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> [1] >> >>>>>>>> -- >> >>>>>>>> >> >>>>>>> >> >>>>> >> >> >> http://jchrisa.net/drl/_design/sofa/_list/post/post-page?startkey=%5B%22CouchDB-Implements-a-Fundamental-Algorithm%22%5D >> >>>>>>>> -- >> http://horicky.blogspot.com/2008/10/couchdb-implementation.html >> >>>>>>>> -- http://blog.kodekabuki.com/post/132952897/couchdb-naked >> >>>>>>>> -- http://guide.couchdb.org/editions/1/en/btree.html >> >>>>>>>> -- http://ayende.com/blog/* (Over my head) >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> -- >> >>>>>>> Filipe David Manana, >> >>>>>>> >> >>>>>>> "Reasonable men adapt themselves to the world. >> >>>>>>> Unreasonable men adapt the world to themselves. >> >>>>>>> That's why all progress depends on unreasonable men." >> >>>>>>> >> >>>>> >> >>>>> >> >>>> >> >> >> >> >> >>