Forgot to post a link to Benoit's big commit for that thing: https://github.com/refuge/couch_core/commit/14ad584c993e7a25c000d8db905fd0d0a88a24b4
On Fri, Mar 15, 2013 at 4:59 PM, Paul Davis <paul.joseph.da...@gmail.com> wrote: > So the biggest thing here is that this won't work in a cluster. The > issue being that each step in the view chain needs to run across the > cluster so putting into every view updater is going to break things > there. > > That said BenoƮt has a feature in rcouch that I we're going to be > working on getting back into mainline CouchDB that will enable this > exact feature. The basic idea is to give each view an update seq index > so that you can more or less have a _changes feed based off a view. > > This way writing something like Cloudant's dbcopy will be a relatively > straightforward enhancement to stream views by update sequence either > through a new updater directly or into a second database like dbcopy. > > Some general comments on the branch you've posted though. First, don't > merge master into topic branches without a very specific reason. If > you just want to make sure you're not going to introduce crazy > conflicts you should rebase your work frequently. This not only makes > history easier to read and reason about but also avoids some crazy > unintended interactions with various merge combinations. > > Second, I'm a big fan of refactoring and rewriting things to try and > understand them or learn a new language, but you should remove those > sorts of things when you're wanting to have someone review a patch. I > spent a considerable amount of time reading diffs just to end up > having to convince myself that they weren't a behavior change. > > There's also a lot of debug logging left in here that would need to be > removed before consideration. My general work flow for features of > this size is to sit down and write it as best I can, then make a > commit before compiling, then I make it compile and add that as a > commit, and then I'll add debugging as a suss out runtime issues and > do a "git add -p" to only pull in the bits I mean to commit. > > All in all this looks like a pretty good stab at the intended feature. > I'd definitely keep with the Erlang if you're interested. Definitely > feel free to ask if you need any help with things. > > On Fri, Mar 15, 2013 at 3:47 PM, nicholas a. evans <n...@ekenosen.net> wrote: >> Yes, there is *totally* stuff I need assistance with. :) This was my >> first attempt at learning Erlang, and I worked myself into a couple of >> dead ends and blindly coded myself into a few corners. It's been a >> few weeks since I looked at this, so it would take a little while for >> me to get back up to speed on it myself. I had at least two really >> big questions or bugs that were leaving me stumped... but I don't >> remember what they were. ;) I'd need to scan through to remember >> what was working and what was broken and what was simply >> un-implemented when I left off. >> >> I *finally* convinced my bosses to give Cloudant a chance (which we >> have happily switched over to during the last few weeks), and we're >> going to be using their dbcopy implementation of incremental iterative >> map reduce. That removed both the urgency and (more importantly) the >> ability to work on it during business hours. But I *am* interested in >> getting this (or something very like it) into core CouchDB. >> >> -- >> Nick >> >> On Fri, Mar 15, 2013 at 3:58 PM, Paul Davis <paul.joseph.da...@gmail.com> >> wrote: >>> Definitely been meaning to provide a review for this. I'll try and get >>> to it in a few hours. >>> >>> On Fri, Mar 15, 2013 at 2:36 PM, Noah Slater <nsla...@apache.org> wrote: >>>> Hey Nick, >>>> >>>> This looks pretty great. Is it something you need assistance with? Anything >>>> I can do to help? >>>> >>>> Thanks, >>>> >>>> >>>> On 22 January 2013 15:18, nicholas a. evans <n...@ekenosen.net> wrote: >>>> >>>>> I've started hacking on an implementation of iterative map reduce that I'm >>>>> calling "nested views". It's not nearly finished, but well overdue for >>>>> feedback from people who know what they're doing with CouchDB and Erlang. >>>>> :) >>>>> >>>>> **************************** >>>>> My bosses are open to a short term contract with an Erlang hacking CouchDB >>>>> committer to get this into a safely usable state ASAP. >>>>> **************************** >>>>> >>>>> == Intended Usage >>>>> >>>>> To create a nested view, add "views" inside of a view like so: >>>>> >>>>> { >>>>> "_id": "_design/ddoc", >>>>> "language": "javascript", >>>>> "views": { >>>>> "foo_count": { >>>>> "map": "function (doc) { emit(doc.foo, 1); }", >>>>> "reduce": "_sum", >>>>> "views": { >>>>> "sorted_by_count": { >>>>> "map":"function(doc) { emit(doc.value, nil); }" >>>>> } >>>>> } >>>>> } >>>>> } >>>>> } >>>>> >>>>> The parent row is reduced with group_level=exact, and each reduced row is >>>>> sent to the nested view map functions as a document that looks like so: >>>>> >>>>> { >>>>> "_id": json_encode(parent_view_key), >>>>> "key": parent_view_key, "value": reduced_value, >>>>> } >>>>> >>>>> Nested views would be queried using the standard view API like so: >>>>> >>>>> GET /db/_design/ddoc/_view/foo_count/_view/sorted_by_count?query=args >>>>> >>>>> Using include_docs=true would return the reduced parent row as the doc. >>>>> >>>>> == Comparison to Cloudant's "dbcopy" >>>>> >>>>> The trade-offs between this approach and chaining to a different design >>>>> doc >>>>> or a different database are similar to the trade-offs of storing all of >>>>> your >>>>> views in one design doc versus a view per design doc: >>>>> >>>>> * simplicity of one design doc vs multiple >>>>> * consistency within design doc index vs eventual consistency >>>>> * high vs low latency between doc insertion and view availability >>>>> * entire index is invalidated when one function changes >>>>> * some views may need to be compacted more frequently than others >>>>> >>>>> I have some application code that could take advantage of each approach, >>>>> and >>>>> ultimately I'd like to see both styles supported natively by Apache >>>>> CouchDB. >>>>> >>>>> == Implementation notes >>>>> >>>>> New record fields: #mrview{path=[], nested_id_btree=nil}. >>>>> >>>>> Each nested view creates a new View#mrview object which is stored directly >>>>> on the #mrstate.views list. The view state saved in the #mrheader will >>>>> include a second btree in parent views. View#mrview.nested_id_btree keeps >>>>> track of ToRemove entries for incremental updates (analogous to >>>>> #mrstate.id_btree). View#mrview.path represents the "path" of parent >>>>> views from >>>>> the root as a list. It will be used used (by >>>>> couch_mrview_util:get_view/4) for >>>>> view lookup, influences the sort order of the views in #mrst.views, >>>>> simplifies >>>>> include_docs, detects "safe" top-level views during the initial mapper >>>>> index >>>>> updater pass, and ensures uniqueness when comparing one view to another >>>>> (e.g. >>>>> views could share their map definitions, but have different data sources). >>>>> If >>>>> two different paths share identical function code, they will still result >>>>> in two >>>>> View indexes. >>>>> >>>>> In origin/master, couch_mrview_updater:update/2 starts up two linked >>>>> processes from the parent index updater process; one for mapping, one for >>>>> writing (reductions stored in the btree are handled by the couch_btree >>>>> module). I've created a third process to handle the nested mapping. The >>>>> writer >>>>> process traverses the view tree depth-first, and should send the reduced >>>>> rows >>>>> (group_level=exact) to the nested mapper process. The writer process >>>>> waits for >>>>> the nested results before continuing, because otherwise the mrstate would >>>>> claim >>>>> a higher update_seq than the nested views have completed indexing and the >>>>> index >>>>> file would be temporarily inconsistent. >>>>> >>>>> My (incomplete) code is at http://github.com/nevans/couchdb (in master). >>>>> I did >>>>> a lot of refactoring just so I could understand the CouchDB code better >>>>> and to >>>>> test my knowledge of Erlang. Most of these were splitting up large >>>>> functions, >>>>> renaming variables, and changing recursive functions into higher order >>>>> functions >>>>> passed to lists:foldl (I have the same aversion to explicit looping in >>>>> ruby when >>>>> Enumerable#inject can do the job). I noticed that master doesn't contain >>>>> a >>>>> single merge commit(!) so obviously this would need to be cleaned up a lot >>>>> (cosmetic refactorings could be backed out or placed in separate commits) >>>>> if >>>>> there is ever any interest in merging it. >>>>> >>>>> == Implementation progress >>>>> >>>>> First-pass "done" so far: >>>>> >>>>> * Tests! >>>>> * First do no harm: still passing original tests. >>>>> * Stupid basic test that creates a single simple nested view >>>>> and tries to update it >>>>> * Creation >>>>> * Parse the design docs to create the nested views. >>>>> * Updates >>>>> * Traversal of nested views by writer process. >>>>> * Visiting each nested view by nested mapper process. >>>>> * although its buggy and crashes when I send in sample data. >>>>> * Retrieval >>>>> * nothing yet... >>>>> >>>>> Still incomplete in my branch (with some questions): >>>>> >>>>> * Tests! >>>>> * more complicated nested views >>>>> * queried nested views >>>>> * queried with include_docs >>>>> >>>>> * Creation >>>>> * Add the nested views to the mrheader; read them in and write them >>>>> out. >>>>> couch_mrview_util:init_state/4 >>>>> couch_mrview_util:open_view/5 >>>>> * Initialize the nested_id_btree (with appropriate Reduce, Less, etc). >>>>> couch_mrview_util:open_view/5 >>>>> * Consider nested view defs in computing the index file's MD5. >>>>> couch_mrview_util:ddoc_to_mrst/2 >>>>> Is this automatically accomplished by updating #mrst.views and >>>>> #mrview? >>>>> >>>>> * Updates >>>>> * Update sample data without crashing! >>>>> * Pull in the group/reduced rows and format them for the nested mapper. >>>>> * This keeps stumping me. >>>>> >>>>> * Retrieval >>>>> * Query nested views at all. >>>>> * Query nested views with include_docs. >>>>> >>>>> * Etc (does this matter?) >>>>> * couch_mrview_util:calculate_data_size/2 >>>>> * ??? >>>>> >>>>> == Questions >>>>> >>>>> As an Erlang newbie hacking on unfamiliar source code, I hope I haven't >>>>> made >>>>> extra work for myself by going with this approach vs a view changes feed >>>>> multi-db Cloudant-style approach... Is there anything incredibly stupid >>>>> about >>>>> this approach? :-) >>>>> >>>>> Any thoughts on the intended API? >>>>> >>>>> I'm having a hard time querying the view in its current state from the >>>>> internals >>>>> of couch. I've got the partially updated View state object with its >>>>> updated >>>>> Btrees, which is *not* the same View and Btree that would come back if I >>>>> were to >>>>> use couch_mrview_util:get_view(Db, DDoc, Vname, Args). But I haven't been >>>>> able >>>>> to successfully hack together a version of couch_mrview:query_view that >>>>> will do >>>>> what I want. >>>>> >>>>> Can incremental updates for the most common use case (alternate sort >>>>> indexes) be handled by something simpler than a completely generic >>>>> iterative >>>>> map reduce? >>>>> >>>>> I've branched from origin/master. If I want to put this into production >>>>> for >>>>> myself as soon as its "done", should I use the 1.3 branch instead? >>>>> >>>>> I've probably created some performance regressions in the nested view >>>>> traversal. >>>>> Any thoughts there? How should I test that (other than by timing my own >>>>> datasets)? Right now, getting this done is more important for us than >>>>> getting >>>>> it done optimally. >>>>> >>>>> Is there a "view index version" number that should be incremented >>>>> somewhere, >>>>> so the new mrheader format doesn't break old views? Is there a good way >>>>> to >>>>> update a view file in place (e.g. see an old header and rewrite a new >>>>> header)? >>>>> >>>>> Since the nested map and nested write queues never have more than one >>>>> entry >>>>> at a time, should I even bother with the overhead of couch_work_queue? >>>>> Should I just use the plain old Erlang "send message receive response" >>>>> pattern instead? Since it is done synchronously, should it even use a >>>>> separate process? To avoid CPU bottlenecks, it like "merge_results" >>>>> should >>>>> probably also be in its own process? >>>>> >>>>> On a side not, it seems wasteful to constantly be passing source code to >>>>> the >>>>> view servers. Have you considered using a pattern like redis's evalsha >>>>> scripting? First check (using the checksums for the functions) to see if >>>>> the >>>>> functions have been loaded on that view server. If not then load them. >>>>> All >>>>> other references to functions use the checksum instead. Perhaps >>>>> Spidermonkey is >>>>> smart enough to keep an internal function cache, but it still seems like >>>>> non-trivial IO bandwidth could be wasted on large function bodies. >>>>> >>>>> >>>>> Since you've read this far, a reminder: My bosses are open to a short term >>>>> contract with an Erlang hacking CouchDB committer to get this into a >>>>> safely >>>>> usable state ASAP. >>>>> >>>>> Thanks for any help! I'll also be available in #couchdb and #couchdb-dev. >>>>> >>>>> Nick Evans >>>>> >>>> >>>> >>>> >>>> -- >>>> NS