I call into question why you "retrieve and materialize as many as 3,000 Documents from each index in order to display a page of results to the user". You have to be doing some post-processing because displaying 12,000 documents to the user is completely useless.
I wonder if this is an "XY" problem, see: http://people.apache.org/~hossman/#xyproblem. You're seeking all over the each disk for 3,000 documents, which will take time no matter what. Especially if you're loading a bunch of fields. So let's back up a bit and ask why you think you need all those documents? Is it something you could push down into the search process? Also, 250M docs/index is a lot of docs. Before continuing, it would be useful to know your raw search performance if you, say, fetched 1 document from each partition, keeping in mind Lance's comments that the first searches load up a bunch of caches and will be slow. And, as he says, you can get around with autowarming. But before going there, let's understand the root of the problem. Is it search speed or just loading all those documents and then doing your post-processing? Best Erick On Thu, Dec 22, 2011 at 3:16 AM, Lance Norskog <goks...@gmail.com> wrote: > Is each index optimized? > > From my vague grasp of Lucene file formats, I think you want to sort > the documents by segment document id, which is the order of documents > on the disk. This lets you materialize documents in their order on the > disk. > > Solr (and other apps) generally use a separate thread per task and > separate index reading classes (not sure which any more). > > As to the cold-start, how many terms are there? You are loading them > into the field cache, right? Solr has a feature called "auto-warming" > which automatically runs common queries each time it reopens an index. > > On Wed, Dec 21, 2011 at 11:11 PM, Paul Libbrecht <p...@hoplahup.net> wrote: >> Michael, >> >> from a physical point of view, it would seem like the order in which the >> documents are read is very significant for the reading speed (feel the >> random access jump as being the issue). >> >> You could: >> - move to ram-disk or ssd to make a difference? >> - use something different than a searcher which might be doing it better >> (pure speculation: does a hit-collector make a difference?) >> >> hope it helps. >> >> paul >> >> >> Le 22 déc. 2011 à 03:45, Robert Bart a écrit : >> >>> Hi All, >>> >>> >>> I am running Lucene 3.4 in an application that indexes about 1 billion >>> factual assertions (Documents) from the web over four separate disks, so >>> that each disk has a separate index of about 250 million documents. The >>> Documents are relatively small, less than 1KB each. These indexes provide >>> data to our web demo (http://openie.cs.washington.edu), where a typical >>> search needs to retrieve and materialize as many as 3,000 Documents from >>> each index in order to display a page of results to the user. >>> >>> >>> In the worst case, a new, uncached query takes around 30 seconds to >>> complete, with all four disks IO bottlenecked during most of this time. My >>> implementation uses a separate Thread per disk to (1) call >>> IndexSearcher.search(Query query, Filter filter, int n) and (2) process the >>> Documents returned from IndexSearcher.doc(int). Since 30 seconds seems like >>> a long time to retrieve 3,000 small Documents, I am wondering if I am >>> overlooking something simple somewhere. >>> >>> >>> Is there a better method for retrieving documents in bulk? >>> >>> >>> Is there a better way of parallelizing indexes from separate disks than to >>> use a MultiReader (which doesn’t seem to parallelize the task of >>> materializing Documents) >>> >>> >>> Any other suggestions? I have tried some of the basic ideas on the Lucene >>> wiki, such as leaving the IndexSearcher open for the life of the process (a >>> servlet). Any help would be greatly appreciated! >>> >>> >>> Rob >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > > > -- > Lance Norskog > goks...@gmail.com > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org