This is analogous to the multiple data sources that we have at Deepdyve<http://www.deepdyve.com>. In a fully sharded and balanced environment, I have found it *much* more efficient to put all data sources into a single collection and use a filter to select one or the other. The rationale is that data sources are distributed in size according to a rough long-tail distribution. For the largest ones, the filters are about as efficient as a separate index because they are such a large fraction of the index. For the small ones, the filtered query is so fast that other issues form the bottleneck anyway. The operational economies of not managing hundreds of indexes and the much better load balancing makes the integrated solution massively better for me. We currently use Katta and this system works really, really well.
One big difference in our environments is that for me, the dominant query pattern involves most data sources while for you, the dominant pattern will likely involve a single data source. On Tue, Feb 9, 2010 at 9:02 PM, Jon Gifford <jon.giff...@gmail.com> wrote: > 1) Support one index per customer, and many customers (thus, many > independent indices) > -- Ted Dunning, CTO DeepDyve