On 10-May-07, at 3:02 PM, Daniel Creão wrote:
So, I tried Solr and read about FederatedSearch and
CollectionDistribution.
An 'all-machines-have-complete-index' strategy (using rsync) can
improve
system throughput and concurrency by each station processing different
queries, but each query will spend the same amount of time that a
single-node system (what sucks).
A single-node system _with 1/N the traffic_, sure.
When each of a N-station cluster indexing 1/N of text collection,
each will
machine spend less time processing queries, but all machines must
process
the same query at the same time (a 'goodbye, concurrency', IMO),
then merge
results.
I don't really understand this.
For huge corpora, you must distribute different parts of the index
over multiple servers. For high throughput, you must distribute the
same part of the index over multiple servers. These are not
competing strategies, and to solve both problems, both solutions must
be employed.
Did I get anything wrong (about Hadoop and Solr)?
Is Multiple Masters/FederatedSearch under development? What status?
Or did I
should develop it for myself?
Implementation of this in Solr is still in the highly theoretical
stage, so is unlikely to happen any time soon.
You might try Nutch, which is basically an implementation of this
strategy using Lucene.
-Mike