Hi Nicholas, This looks very interesting. My apologies for not looking into this in more detail (it's CouchDB Conf week). I'll definitely have a closer look soon.
Cheers Jan -- On 22.01.2013, at 16:18, "nicholas a. evans" <n...@ekenosen.net> wrote: > I've started hacking on an implementation of iterative map reduce that I'm > calling "nested views". It's not nearly finished, but well overdue for > feedback from people who know what they're doing with CouchDB and Erlang. :) > > **************************** > My bosses are open to a short term contract with an Erlang hacking CouchDB > committer to get this into a safely usable state ASAP. > **************************** > > == Intended Usage > > To create a nested view, add "views" inside of a view like so: > > { > "_id": "_design/ddoc", > "language": "javascript", > "views": { > "foo_count": { > "map": "function (doc) { emit(doc.foo, 1); }", > "reduce": "_sum", > "views": { > "sorted_by_count": { > "map":"function(doc) { emit(doc.value, nil); }" > } > } > } > } > } > > The parent row is reduced with group_level=exact, and each reduced row is > sent to the nested view map functions as a document that looks like so: > > { > "_id": json_encode(parent_view_key), > "key": parent_view_key, "value": reduced_value, > } > > Nested views would be queried using the standard view API like so: > > GET /db/_design/ddoc/_view/foo_count/_view/sorted_by_count?query=args > > Using include_docs=true would return the reduced parent row as the doc. > > == Comparison to Cloudant's "dbcopy" > > The trade-offs between this approach and chaining to a different design doc > or a different database are similar to the trade-offs of storing all of your > views in one design doc versus a view per design doc: > > * simplicity of one design doc vs multiple > * consistency within design doc index vs eventual consistency > * high vs low latency between doc insertion and view availability > * entire index is invalidated when one function changes > * some views may need to be compacted more frequently than others > > I have some application code that could take advantage of each approach, and > ultimately I'd like to see both styles supported natively by Apache CouchDB. > > == Implementation notes > > New record fields: #mrview{path=[], nested_id_btree=nil}. > > Each nested view creates a new View#mrview object which is stored directly > on the #mrstate.views list. The view state saved in the #mrheader will > include a second btree in parent views. View#mrview.nested_id_btree keeps > track of ToRemove entries for incremental updates (analogous to > #mrstate.id_btree). View#mrview.path represents the "path" of parent views > from > the root as a list. It will be used used (by couch_mrview_util:get_view/4) > for > view lookup, influences the sort order of the views in #mrst.views, simplifies > include_docs, detects "safe" top-level views during the initial mapper index > updater pass, and ensures uniqueness when comparing one view to another (e.g. > views could share their map definitions, but have different data sources). If > two different paths share identical function code, they will still result in > two > View indexes. > > In origin/master, couch_mrview_updater:update/2 starts up two linked > processes from the parent index updater process; one for mapping, one for > writing (reductions stored in the btree are handled by the couch_btree > module). I've created a third process to handle the nested mapping. The > writer > process traverses the view tree depth-first, and should send the reduced rows > (group_level=exact) to the nested mapper process. The writer process waits > for > the nested results before continuing, because otherwise the mrstate would > claim > a higher update_seq than the nested views have completed indexing and the > index > file would be temporarily inconsistent. > > My (incomplete) code is at http://github.com/nevans/couchdb (in master). I > did > a lot of refactoring just so I could understand the CouchDB code better and to > test my knowledge of Erlang. Most of these were splitting up large functions, > renaming variables, and changing recursive functions into higher order > functions > passed to lists:foldl (I have the same aversion to explicit looping in ruby > when > Enumerable#inject can do the job). I noticed that master doesn't contain a > single merge commit(!) so obviously this would need to be cleaned up a lot > (cosmetic refactorings could be backed out or placed in separate commits) if > there is ever any interest in merging it. > > == Implementation progress > > First-pass "done" so far: > > * Tests! > * First do no harm: still passing original tests. > * Stupid basic test that creates a single simple nested view > and tries to update it > * Creation > * Parse the design docs to create the nested views. > * Updates > * Traversal of nested views by writer process. > * Visiting each nested view by nested mapper process. > * although its buggy and crashes when I send in sample data. > * Retrieval > * nothing yet... > > Still incomplete in my branch (with some questions): > > * Tests! > * more complicated nested views > * queried nested views > * queried with include_docs > > * Creation > * Add the nested views to the mrheader; read them in and write them out. > couch_mrview_util:init_state/4 > couch_mrview_util:open_view/5 > * Initialize the nested_id_btree (with appropriate Reduce, Less, etc). > couch_mrview_util:open_view/5 > * Consider nested view defs in computing the index file's MD5. > couch_mrview_util:ddoc_to_mrst/2 > Is this automatically accomplished by updating #mrst.views and #mrview? > > * Updates > * Update sample data without crashing! > * Pull in the group/reduced rows and format them for the nested mapper. > * This keeps stumping me. > > * Retrieval > * Query nested views at all. > * Query nested views with include_docs. > > * Etc (does this matter?) > * couch_mrview_util:calculate_data_size/2 > * ??? > > == Questions > > As an Erlang newbie hacking on unfamiliar source code, I hope I haven't made > extra work for myself by going with this approach vs a view changes feed > multi-db Cloudant-style approach... Is there anything incredibly stupid about > this approach? :-) > > Any thoughts on the intended API? > > I'm having a hard time querying the view in its current state from the > internals > of couch. I've got the partially updated View state object with its updated > Btrees, which is *not* the same View and Btree that would come back if I were > to > use couch_mrview_util:get_view(Db, DDoc, Vname, Args). But I haven't been > able > to successfully hack together a version of couch_mrview:query_view that will > do > what I want. > > Can incremental updates for the most common use case (alternate sort > indexes) be handled by something simpler than a completely generic iterative > map reduce? > > I've branched from origin/master. If I want to put this into production for > myself as soon as its "done", should I use the 1.3 branch instead? > > I've probably created some performance regressions in the nested view > traversal. > Any thoughts there? How should I test that (other than by timing my own > datasets)? Right now, getting this done is more important for us than getting > it done optimally. > > Is there a "view index version" number that should be incremented somewhere, > so the new mrheader format doesn't break old views? Is there a good way to > update a view file in place (e.g. see an old header and rewrite a new > header)? > > Since the nested map and nested write queues never have more than one entry > at a time, should I even bother with the overhead of couch_work_queue? > Should I just use the plain old Erlang "send message receive response" > pattern instead? Since it is done synchronously, should it even use a > separate process? To avoid CPU bottlenecks, it like "merge_results" should > probably also be in its own process? > > On a side not, it seems wasteful to constantly be passing source code to the > view servers. Have you considered using a pattern like redis's evalsha > scripting? First check (using the checksums for the functions) to see if the > functions have been loaded on that view server. If not then load them. All > other references to functions use the checksum instead. Perhaps Spidermonkey > is > smart enough to keep an internal function cache, but it still seems like > non-trivial IO bandwidth could be wasted on large function bodies. > > > Since you've read this far, a reminder: My bosses are open to a short term > contract with an Erlang hacking CouchDB committer to get this into a safely > usable state ASAP. > > Thanks for any help! I'll also be available in #couchdb and #couchdb-dev. > > Nick Evans