Retrieving large numbers of documents from several disks in parallel

Robert Bart Wed, 21 Dec 2011 18:46:30 -0800

Hi All,


I am running Lucene 3.4 in an application that indexes about 1 billion
factual assertions (Documents) from the web over four separate disks, so
that each disk has a separate index of about 250 million documents. The
Documents are relatively small, less than 1KB each. These indexes provide
data to our web demo (http://openie.cs.washington.edu), where a typical
search needs to retrieve and materialize as many as 3,000 Documents from
each index in order to display a page of results to the user.


In the worst case, a new, uncached query takes around 30 seconds to
complete, with all four disks IO bottlenecked during most of this time. My
implementation uses a separate Thread per disk to (1) call
IndexSearcher.search(Query query, Filter filter, int n) and (2) process the
Documents returned from IndexSearcher.doc(int). Since 30 seconds seems like
a long time to retrieve 3,000 small Documents, I am wondering if I am
overlooking something simple somewhere.


Is there a better method for retrieving documents in bulk?


Is there a better way of parallelizing indexes from separate disks than to
use a MultiReader (which doesn’t seem to parallelize the task of
materializing Documents)


Any other suggestions? I have tried some of the basic ideas on the Lucene
wiki, such as leaving the IndexSearcher open for the life of the process (a
servlet). Any help would be greatly appreciated!


Rob

Retrieving large numbers of documents from several disks in parallel

Reply via email to