Randall, Spot on; we are on the same page now. I'll go through the post again tomorrow morning and reword it a bit so the thoughts aren't so fragmented.
Best, Riyad P.S.> Are there discussions anywhere on alternative file formats for the indices that the Couch community has considered in the past? (old JIRA or mailing list discussions) or has this file format been the only approach from the initial work started in 05/06? The reason I ask is because of the index fragmentation you mentioned that can occur over time... I'd be curious if an in-place-edited format for the indices as separate file(s) from the append-only document revisions would yield better performance over time. The splits would still be expensive, but you'd be able to pre-allocate your blocks for each node on each split to help a bit. Even leveraging an append-only, pre-allocated approach to the index might work where split nodes are appended to the end of the index file with the full node size (e.g. 133 elements) pre-allocated and the index marked ready for compaction or something. So it would be a hybrid approach... when splits aren't needed, you do an in-place edit to the index on-disk. If a split is needed, you use a technique similar to the one now where the effected node hierarchy is appended to the end of the file. You still run the risk of costly index rebuilds in the cast of a failed in-place edit, but you wouldn't run the risk of any data loss... I am just wondering for large data sets if this would yield some significant benefits putting Couch somewhere between a Mongo and Couch (performance wise). Just pie-in-the-sky thinking... I am sure the senior team here has talked this stuff to death years ago. My apologies if this is re-treading road covered in beaten horse bodies ;) I just find this category of problems interesting. On Thu, Dec 22, 2011 at 6:10 PM, Randall Leeds <randall.le...@gmail.com>wrote: > On Thu, Dec 22, 2011 at 12:11, Riyad Kalla <rka...@gmail.com> wrote: > > On Thu, Dec 22, 2011 at 12:38 PM, Robert Newson <rnew...@apache.org> > wrote: > > > >> There are a > >> few parts of the article that are inaccurate (like the assertion we > >> have good locality for the id and seq trees. If this were true we > >> wouldn't have seen such a huge improvement in compaction by > >> temporarily separating them). > > > > > > I'd look forward to more detail on this... it was my understanding the > > updates were appended out in the [doc rev][_id idx diff][_seq idx diff] > > format at the end of the data file. Sounds like I may have misunderstood > > that? > > > > Riyad, as you pointed out earlier, all the inner nodes are rewritten > up to the root. The two btrees are not written in parallel, though, > which means that for deep trees all the updated nodes are written > before the other tree's nodes are written. Also remember that the > trees themselves end up pretty fragmented since older nodes that > haven't changed are back toward the beginning of the file. In general, > I'm not sure there's much that's useful to mention about locality in > the trees. Also, updating these trees requires reading the old values, > so there is still seeking that occurs (if the pages aren't cached by > the OS). > > > > >> The 'COMPETE recreation' paragraph also > >> strikes me as factually awry. > >> > > > > I'd appreciate a detailed correction on this if it is wrong; all the > > digging I've done (in this thread and other partial resources) suggests > > that the path from the changed doc ref back to the root (including a copy > > of all internal nodes and all of their child references) is written so as > > being able to read-back into the index from the tail of the data file > > quickly... specifically slides 17, 18 and 19 from this slidedeck ( > > http://www.slideshare.net/jchrisa/btree-nosql-oak?from=embed) -- note > that > > the interim nodes [A..M] and [A..Z] are rewritten (including any and all > > child pointers they contained). > > > > This is what I was referring to; I should either clean up my wording > > (confusing) or I got it wrong in which case I'd appreciate any and all > > corrections. > > Right. It mostly seems a bit confusing to me. > "it DOES NOT just rewrite the nodes pathing from the leaf to the node > and ONLY the connections for that single document" > That doesn't sound quite right, but I can tell what you're trying to > say is accurate. If I'm right, you mean that every changed inner node > is rewritten in its entirety rather than having a single pointer to > the new child updated in place. > > Cheers. Thanks for taking the time to write this up. > > -Randall >