Re: Partitioned Clusters

Robert Newson Fri, 20 Feb 2009 02:35:44 -0800

Any thoughts as to how (or even if) this tree-wise result aggregation
would work for externals?


I'm thinking specifically about couchdb-lucene, where multi-node
results aggregation is possible, given a framework like you propose
here. The results that couchdb-lucene produces can already be
aggregated, assuming there's a hook for the merge function (actually,
perhaps it's exactly reduce-shaped)...

B.

On Fri, Feb 20, 2009 at 3:12 AM, Chris Anderson <jch...@apache.org> wrote:
> On Thu, Feb 19, 2009 at 6:39 PM, Ben Browning <ben...@gmail.com> wrote:
>> Overall the model sounds very similar to what I was thinking. I just
>> have a few comments.
>>
>>> In this model documents are saved to a leaf node depending on a hash
>>> of the docid. This means that lookups are easy, and need only to touch
>>> the leaf node which holds the doc. Redundancy can be provided by
>>> maintaining R replicas of every leaf node.
>>
>> There are several use-cases where a true hash of the docid won't be the
>> optimal partitioning key. The simple case is where you want to partition
>> your data by user and in most non-trivial cases you won't be storing
>> all of a user's data under one document with the user's id as the docid.
>> A fairly simple solution would be allowing the developer to specify a 
>> javascript
>> function somewhere (not sure where this should live...) that takes a docid 
>> and
>> spits out a partition key. Then I could just prefix all my doc ids for
>> a specific user
>> with that user's id and write the appropriate partition function.
>>
>>>
>>> View queries, on the other hand, must be handled by every node. The
>>> requests are proxied down the tree to leaf nodes, which respond
>>> normally. Each proxy node then runs a merge sort algorithm (which can
>>> sort in constant space proportional to # of input streams) on the view
>>> results. This can happen recursively if the tree is deep.
>>
>> If the developer has control over partition keys as suggested above, it's
>> entirely possible to have applications where view queries only need data
>> from one partition. It would be great if we could do something smart here or
>> have a way for the developer to indicate to Couch that all the data should
>> be on only one partition.
>>
>> These are just nice-to-have features and the described cluster setup could
>> still be extremely useful without them.
>
> I think they are both sensible optimizations. Damien's described the
> JS partition function before on IRC, so I think it fits into the
> model. As far as restricting view queries to just those docs within a
> particular id range, it might make sense to partition by giving each
> user their own database, rather than logic on the docid. In the case
> where you need data in a single db, but still have some queries that
> can be partitioned, its still a good optimization. Luckily even in the
> unoptimized case, if a node has no rows to contribute to the final
> view result than it should have a low impact on total resources needed
> to generate the result.
>
> Chris
>
> --
> Chris Anderson
> http://jchris.mfdz.com
>

Re: Partitioned Clusters

Reply via email to