FederatedSearch and large Lucene distributed indexes

2007-05-10 Thread Daniel Creão

I'm looking forward some strategy to build a large Lucene distributed index,
but after read a lot of mail threads, it seems that Lucene hasn't a 'default
solution' for that yet.

The search engine that I'm building has a 400 GB text-database (right now,
growing every day and without document deleting operation) and thousands
queries per day (a very small time of response is needed). I expected to use
a commodity computers cluster as hardware infrastructure and the distributed
strategy must allow concurrency/parallel processing/searching with all
processors.

Searching on Lucene site, I thought Hadoop was the right choice. After read
some mail threads about this (at java lucene and hadoop, dev and user), it
seems that Hadoop isn't that good (let's put this way) to deal with
distributed indexes for a search engine.

So, I tried Solr and read about FederatedSearch and CollectionDistribution.
An 'all-machines-have-complete-index' strategy (using rsync) can improve
system throughput and concurrency by each station processing different
queries, but each query will spend the same amount of time that a
single-node system (what sucks).

When each of a N-station cluster indexing 1/N of text collection, each will
machine spend less time processing queries, but all machines must process
the same query at the same time (a 'goodbye, concurrency', IMO), then merge
results.

Neither of them looks good for me.

The 'Multiple Masters' (described on FederatedSearch) solution looks good.
"There could be a master for each slice of the index. An external module
could provide the update interface and forward the request to the correct
master based on the unique key field". That's great, has high scalability,
high concurrency, smaller index for each node... I read some papers with
very similar solution [1][2][3] and discussing how to solve some of this
strategy issues.

Did I get anything wrong (about Hadoop and Solr)?

Is Multiple Masters/FederatedSearch under development? What status? Or did I
should develop it for myself?

Thanks for any help.
Daniel

[1] B. Ribeiro-Neto, J. Kitajima, G. Navarro e N. Ziviani. Parallel
Generation of Inverted Files for Distributed Text Collections. In
Proceedings of the XVIII international Conference of the Chilean Computer
Science Society. IEEE Computer Society Washington, 1998.

[2] C. Badue, R. Baeza-Yates, B. Ribeiro-Neto e N. Ziviani. Distributed
Query Processing Using Partitioned Inverted Files. Eighth Symposium on
String Processing and Information Retrieval (SPIRE'01), 2001.

[3] C. Badue, R. Barbosa, B. Ribeiro-Neto e N. Ziviani. Basic issues on the
processing of web queries. In Proceedings of the 28th Annual international
ACM SIGIR Conference on Research and Development in information Retrieval,
2005.


Re: FederatedSearch and large Lucene distributed indexes

2007-05-10 Thread Mike Klaas

On 10-May-07, at 3:02 PM, Daniel Creão wrote:
So, I tried Solr and read about FederatedSearch and  
CollectionDistribution.
An 'all-machines-have-complete-index' strategy (using rsync) can  
improve

system throughput and concurrency by each station processing different
queries, but each query will spend the same amount of time that a
single-node system (what sucks).


A single-node system _with 1/N the traffic_, sure.

When each of a N-station cluster indexing 1/N of text collection,  
each will
machine spend less time processing queries, but all machines must  
process
the same query at the same time (a 'goodbye, concurrency', IMO),  
then merge

results.


I don't really understand this.

For huge corpora, you must distribute different parts of the index  
over multiple servers.  For high throughput, you must distribute the  
same part of the index over multiple servers.  These are not  
competing strategies, and to solve both problems, both solutions must  
be employed.



Did I get anything wrong (about Hadoop and Solr)?

Is Multiple Masters/FederatedSearch under development? What status?  
Or did I

should develop it for myself?


Implementation of this in Solr is still in the highly theoretical  
stage, so is unlikely to happen any time soon.


You might try Nutch, which is basically an implementation of this  
strategy using Lucene.


-Mike