Re: Retrieving large numbers of documents from several disks in parallel

Erick Erickson Tue, 27 Dec 2011 18:29:41 -0800

I'll take your word for it, though it seems odd. I'm wondering
if there's anything you can do to pre-process the documents
at index time to make the post-processing less painful, but
that's a wild shot in the dark...


Another possibility would be to fetch only the fields you need
to do the post-processing, then go back to the servers for the
full documents, something like q=id:(1 2 3 4 5 6 .....), assuming
that your post-processing reduces to a reasonable number
of documents. I have no real clue whether you could reduce
the data returned to few enough fields to really make a difference,
but it might be a possibility.

But I think you're between a rock and a hard place. If you can't
push your processing down to the search, and you really *do*
have to load all those docs, you will have painful queries.
Have you thought of an SSD?

Best
Erick

On Tue, Dec 27, 2011 at 9:14 PM, Robert Bart <rb...@cs.washington.edu> wrote:
> Erick,
>
> Thanks for your reply! You are probably right to question how many
> Documents we are retrieving. We know it isn't best, but significantly
> reducing that number will require us to completely rebuild our system.
> Before we do that, we were just wondering if there was anything in the
> Lucene API or elsewhere that we were missing that might be able to help
> out.
>
> Search times for a single document from each index seem to run about 100ms.
> For normal queries (e.g. 3K Documents) the entire search time is spent
> loading documents (e.g. going from Fields to Strings). Our post-processing
> takes < 1s, and is done only after all docs are loaded.
>
> The problem with our current setup is that we need to retrieve a large
> sample of Documents and aggregate them in certain ways before we can tell
> which ones we really needed and which ones we didn't. This doesn't sound
> like something a Filter would be able to do, so I'm not sure how well we
> could push this down into the search process. Avoiding this aggregation
> step is what would require us to completely redesign the system - but maybe
> thats just what we'll have to do.
>
> Anyways, thanks for your help! Any other suggestions would be appreciated,
> but if there is no (relatively) easy solution, thats ok.
>
> Rob
>
> On Thu, Dec 22, 2011 at 4:51 AM, Erick Erickson 
> <erickerick...@gmail.com>wrote:
>
>> I call into question why you "retrieve and materialize as
>> many as 3,000 Documents from each index in order to
>> display a page of results to the user". You have to be
>> doing some post-processing because displaying
>> 12,000 documents to the user is completely useless.
>>
>> I wonder if this is an "XY" problem, see:
>> http://people.apache.org/~hossman/#xyproblem.
>>
>> You're seeking all over the each disk for 3,000
>> documents, which will take time no matter what.
>> Especially if you're loading a bunch of fields.
>>
>> So let's back up a bit and ask why you think you need
>> all those documents? Is it something you could push
>> down into the search process?
>>
>> Also, 250M docs/index is a lot of docs. Before continuing,
>> it would be useful to know your raw search performance if
>> you, say, fetched 1 document from each partition, keeping
>> in mind Lance's comments that the first searches load
>> up a bunch of caches and will be slow. And, as he says,
>> you can get around with autowarming.
>>
>> But before going there, let's understand the root of the problem.
>> Is it search speed or just loading all those documents and
>> then doing your post-processing?
>>
>>
>> Best
>> Erick
>>
>> On Thu, Dec 22, 2011 at 3:16 AM, Lance Norskog <goks...@gmail.com> wrote:
>> > Is each index optimized?
>> >
>> > From my vague grasp of Lucene file formats, I think you want to sort
>> > the documents by segment document id, which is the order of documents
>> > on the disk. This lets you materialize documents in their order on the
>> > disk.
>> >
>> > Solr (and other apps) generally use a separate thread per task and
>> > separate index reading classes (not sure which any more).
>> >
>> > As to the cold-start, how many terms are there? You are loading them
>> > into the field cache, right? Solr has a feature called "auto-warming"
>> > which automatically runs common queries each time it reopens an index.
>> >
>> > On Wed, Dec 21, 2011 at 11:11 PM, Paul Libbrecht <p...@hoplahup.net>
>> wrote:
>> >> Michael,
>> >>
>> >> from a physical point of view, it would seem like the order in which
>> the documents are read is very significant for the reading speed (feel the
>> random access jump as being the issue).
>> >>
>> >> You could:
>> >> - move to ram-disk or ssd to make a difference?
>> >> - use something different than a searcher which might be doing it
>> better (pure speculation: does a hit-collector make a difference?)
>> >>
>> >> hope it helps.
>> >>
>> >> paul
>> >>
>> >>
>> >> Le 22 déc. 2011 à 03:45, Robert Bart a écrit :
>> >>
>> >>> Hi All,
>> >>>
>> >>>
>> >>> I am running Lucene 3.4 in an application that indexes about 1 billion
>> >>> factual assertions (Documents) from the web over four separate disks,
>> so
>> >>> that each disk has a separate index of about 250 million documents. The
>> >>> Documents are relatively small, less than 1KB each. These indexes
>> provide
>> >>> data to our web demo (http://openie.cs.washington.edu), where a
>> typical
>> >>> search needs to retrieve and materialize as many as 3,000 Documents
>> from
>> >>> each index in order to display a page of results to the user.
>> >>>
>> >>>
>> >>> In the worst case, a new, uncached query takes around 30 seconds to
>> >>> complete, with all four disks IO bottlenecked during most of this
>> time. My
>> >>> implementation uses a separate Thread per disk to (1) call
>> >>> IndexSearcher.search(Query query, Filter filter, int n) and (2)
>> process the
>> >>> Documents returned from IndexSearcher.doc(int). Since 30 seconds seems
>> like
>> >>> a long time to retrieve 3,000 small Documents, I am wondering if I am
>> >>> overlooking something simple somewhere.
>> >>>
>> >>>
>> >>> Is there a better method for retrieving documents in bulk?
>> >>>
>> >>>
>> >>> Is there a better way of parallelizing indexes from separate disks
>> than to
>> >>> use a MultiReader (which doesn’t seem to parallelize the task of
>> >>> materializing Documents)
>> >>>
>> >>>
>> >>> Any other suggestions? I have tried some of the basic ideas on the
>> Lucene
>> >>> wiki, such as leaving the IndexSearcher open for the life of the
>> process (a
>> >>> servlet). Any help would be greatly appreciated!
>> >>>
>> >>>
>> >>> Rob
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >>
>> >
>> >
>> >
>> > --
>> > Lance Norskog
>> > goks...@gmail.com
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
>
> --
> Rob

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Retrieving large numbers of documents from several disks in parallel

Reply via email to