This is analogous to the multiple data sources that we have at
Deepdyve<http://www.deepdyve.com>.
In a fully sharded and balanced environment, I have found it *much* more
efficient to put all data sources into a single collection and use a filter
to select one or the other.  The rationale is that data sources are
distributed in size according to a rough long-tail distribution.  For the
largest ones, the filters are about as efficient as a separate index because
they are such a large fraction of the index.  For the small ones, the
filtered query is so fast that other issues form the bottleneck anyway.  The
operational economies of not managing hundreds of indexes and the much
better load balancing makes the integrated solution massively better for
me.  We currently use Katta and this system works really, really well.

One big difference in our environments is that for me, the dominant query
pattern involves most data sources while for you, the dominant pattern will
likely involve a single data source.

On Tue, Feb 9, 2010 at 9:02 PM, Jon Gifford <jon.giff...@gmail.com> wrote:

> 1) Support one index per customer, and many customers (thus, many
> independent indices)
>



-- 
Ted Dunning, CTO
DeepDyve

Reply via email to