Any thoughts as to how (or even if) this tree-wise result aggregation would work for externals?
I'm thinking specifically about couchdb-lucene, where multi-node results aggregation is possible, given a framework like you propose here. The results that couchdb-lucene produces can already be aggregated, assuming there's a hook for the merge function (actually, perhaps it's exactly reduce-shaped)... B. On Fri, Feb 20, 2009 at 3:12 AM, Chris Anderson <jch...@apache.org> wrote: > On Thu, Feb 19, 2009 at 6:39 PM, Ben Browning <ben...@gmail.com> wrote: >> Overall the model sounds very similar to what I was thinking. I just >> have a few comments. >> >>> In this model documents are saved to a leaf node depending on a hash >>> of the docid. This means that lookups are easy, and need only to touch >>> the leaf node which holds the doc. Redundancy can be provided by >>> maintaining R replicas of every leaf node. >> >> There are several use-cases where a true hash of the docid won't be the >> optimal partitioning key. The simple case is where you want to partition >> your data by user and in most non-trivial cases you won't be storing >> all of a user's data under one document with the user's id as the docid. >> A fairly simple solution would be allowing the developer to specify a >> javascript >> function somewhere (not sure where this should live...) that takes a docid >> and >> spits out a partition key. Then I could just prefix all my doc ids for >> a specific user >> with that user's id and write the appropriate partition function. >> >>> >>> View queries, on the other hand, must be handled by every node. The >>> requests are proxied down the tree to leaf nodes, which respond >>> normally. Each proxy node then runs a merge sort algorithm (which can >>> sort in constant space proportional to # of input streams) on the view >>> results. This can happen recursively if the tree is deep. >> >> If the developer has control over partition keys as suggested above, it's >> entirely possible to have applications where view queries only need data >> from one partition. It would be great if we could do something smart here or >> have a way for the developer to indicate to Couch that all the data should >> be on only one partition. >> >> These are just nice-to-have features and the described cluster setup could >> still be extremely useful without them. > > I think they are both sensible optimizations. Damien's described the > JS partition function before on IRC, so I think it fits into the > model. As far as restricting view queries to just those docs within a > particular id range, it might make sense to partition by giving each > user their own database, rather than logic on the docid. In the case > where you need data in a single db, but still have some queries that > can be partitioned, its still a good optimization. Luckily even in the > unoptimized case, if a node has no rows to contribute to the final > view result than it should have a low impact on total resources needed > to generate the result. > > Chris > > -- > Chris Anderson > http://jchris.mfdz.com >