On 10-May-07, at 3:02 PM, Daniel Creão wrote:
So, I tried Solr and read about FederatedSearch and CollectionDistribution. An 'all-machines-have-complete-index' strategy (using rsync) can improve
system throughput and concurrency by each station processing different
queries, but each query will spend the same amount of time that a
single-node system (what sucks).

A single-node system _with 1/N the traffic_, sure.

When each of a N-station cluster indexing 1/N of text collection, each will machine spend less time processing queries, but all machines must process the same query at the same time (a 'goodbye, concurrency', IMO), then merge
results.

I don't really understand this.

For huge corpora, you must distribute different parts of the index over multiple servers. For high throughput, you must distribute the same part of the index over multiple servers. These are not competing strategies, and to solve both problems, both solutions must be employed.

Did I get anything wrong (about Hadoop and Solr)?

Is Multiple Masters/FederatedSearch under development? What status? Or did I
should develop it for myself?

Implementation of this in Solr is still in the highly theoretical stage, so is unlikely to happen any time soon.

You might try Nutch, which is basically an implementation of this strategy using Lucene.

-Mike

Reply via email to