I'll take your word for it, though it seems odd. I'm wondering if there's anything you can do to pre-process the documents at index time to make the post-processing less painful, but that's a wild shot in the dark...
Another possibility would be to fetch only the fields you need to do the post-processing, then go back to the servers for the full documents, something like q=id:(1 2 3 4 5 6 .....), assuming that your post-processing reduces to a reasonable number of documents. I have no real clue whether you could reduce the data returned to few enough fields to really make a difference, but it might be a possibility. But I think you're between a rock and a hard place. If you can't push your processing down to the search, and you really *do* have to load all those docs, you will have painful queries. Have you thought of an SSD? Best Erick On Tue, Dec 27, 2011 at 9:14 PM, Robert Bart <rb...@cs.washington.edu> wrote: > Erick, > > Thanks for your reply! You are probably right to question how many > Documents we are retrieving. We know it isn't best, but significantly > reducing that number will require us to completely rebuild our system. > Before we do that, we were just wondering if there was anything in the > Lucene API or elsewhere that we were missing that might be able to help > out. > > Search times for a single document from each index seem to run about 100ms. > For normal queries (e.g. 3K Documents) the entire search time is spent > loading documents (e.g. going from Fields to Strings). Our post-processing > takes < 1s, and is done only after all docs are loaded. > > The problem with our current setup is that we need to retrieve a large > sample of Documents and aggregate them in certain ways before we can tell > which ones we really needed and which ones we didn't. This doesn't sound > like something a Filter would be able to do, so I'm not sure how well we > could push this down into the search process. Avoiding this aggregation > step is what would require us to completely redesign the system - but maybe > thats just what we'll have to do. > > Anyways, thanks for your help! Any other suggestions would be appreciated, > but if there is no (relatively) easy solution, thats ok. > > Rob > > On Thu, Dec 22, 2011 at 4:51 AM, Erick Erickson > <erickerick...@gmail.com>wrote: > >> I call into question why you "retrieve and materialize as >> many as 3,000 Documents from each index in order to >> display a page of results to the user". You have to be >> doing some post-processing because displaying >> 12,000 documents to the user is completely useless. >> >> I wonder if this is an "XY" problem, see: >> http://people.apache.org/~hossman/#xyproblem. >> >> You're seeking all over the each disk for 3,000 >> documents, which will take time no matter what. >> Especially if you're loading a bunch of fields. >> >> So let's back up a bit and ask why you think you need >> all those documents? Is it something you could push >> down into the search process? >> >> Also, 250M docs/index is a lot of docs. Before continuing, >> it would be useful to know your raw search performance if >> you, say, fetched 1 document from each partition, keeping >> in mind Lance's comments that the first searches load >> up a bunch of caches and will be slow. And, as he says, >> you can get around with autowarming. >> >> But before going there, let's understand the root of the problem. >> Is it search speed or just loading all those documents and >> then doing your post-processing? >> >> >> Best >> Erick >> >> On Thu, Dec 22, 2011 at 3:16 AM, Lance Norskog <goks...@gmail.com> wrote: >> > Is each index optimized? >> > >> > From my vague grasp of Lucene file formats, I think you want to sort >> > the documents by segment document id, which is the order of documents >> > on the disk. This lets you materialize documents in their order on the >> > disk. >> > >> > Solr (and other apps) generally use a separate thread per task and >> > separate index reading classes (not sure which any more). >> > >> > As to the cold-start, how many terms are there? You are loading them >> > into the field cache, right? Solr has a feature called "auto-warming" >> > which automatically runs common queries each time it reopens an index. >> > >> > On Wed, Dec 21, 2011 at 11:11 PM, Paul Libbrecht <p...@hoplahup.net> >> wrote: >> >> Michael, >> >> >> >> from a physical point of view, it would seem like the order in which >> the documents are read is very significant for the reading speed (feel the >> random access jump as being the issue). >> >> >> >> You could: >> >> - move to ram-disk or ssd to make a difference? >> >> - use something different than a searcher which might be doing it >> better (pure speculation: does a hit-collector make a difference?) >> >> >> >> hope it helps. >> >> >> >> paul >> >> >> >> >> >> Le 22 déc. 2011 à 03:45, Robert Bart a écrit : >> >> >> >>> Hi All, >> >>> >> >>> >> >>> I am running Lucene 3.4 in an application that indexes about 1 billion >> >>> factual assertions (Documents) from the web over four separate disks, >> so >> >>> that each disk has a separate index of about 250 million documents. The >> >>> Documents are relatively small, less than 1KB each. These indexes >> provide >> >>> data to our web demo (http://openie.cs.washington.edu), where a >> typical >> >>> search needs to retrieve and materialize as many as 3,000 Documents >> from >> >>> each index in order to display a page of results to the user. >> >>> >> >>> >> >>> In the worst case, a new, uncached query takes around 30 seconds to >> >>> complete, with all four disks IO bottlenecked during most of this >> time. My >> >>> implementation uses a separate Thread per disk to (1) call >> >>> IndexSearcher.search(Query query, Filter filter, int n) and (2) >> process the >> >>> Documents returned from IndexSearcher.doc(int). Since 30 seconds seems >> like >> >>> a long time to retrieve 3,000 small Documents, I am wondering if I am >> >>> overlooking something simple somewhere. >> >>> >> >>> >> >>> Is there a better method for retrieving documents in bulk? >> >>> >> >>> >> >>> Is there a better way of parallelizing indexes from separate disks >> than to >> >>> use a MultiReader (which doesn’t seem to parallelize the task of >> >>> materializing Documents) >> >>> >> >>> >> >>> Any other suggestions? I have tried some of the basic ideas on the >> Lucene >> >>> wiki, such as leaving the IndexSearcher open for the life of the >> process (a >> >>> servlet). Any help would be greatly appreciated! >> >>> >> >>> >> >>> Rob >> >> >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> > >> > >> > >> > -- >> > Lance Norskog >> > goks...@gmail.com >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> > For additional commands, e-mail: java-user-h...@lucene.apache.org >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > > -- > Rob --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org