Re: performance on filtering against thousands of different publications

Cedric Ho Mon, 13 Aug 2007 18:57:03 -0700

On 8/13/07, Erick Erickson <[EMAIL PROTECTED]> wrote:
> Have you tried the very simple techinque if just making an OR clause
> containing all the sources for a particular query and just letting
> it run? I was surprised at the speed...


I think the TermsFilter that I use does exactly that.

>
> But before doing *any* of that, you need to find out, and tell us, what
> exactly is taking the time. Are you opening a new IndexReader for
> each query?
No.

> Are you iterating through a Hits object that has more than
> 100 (maybe it's 200 now) entries? Are you loading each document that
> satisfies the query? Etc. Etc.

Unfortunately, yes. And I know this is another big source for
slowness. But due to some other reason that cannot be worked around at
this stage. I'll have to return all hits for a search for now.
For each document I get the docid (not the internal one in lucene),
date and publication. I've already used FieldCache to cache all the 3
fields.

>
> Put some simple timers in your code and measure exactly what's taking the
> time before tuning your code. Time the call to search. Time the call for
> parsing. Time the assembly of the responses, in, say, blocks of 100.

Time for parsing < 0.01 sec,
Time for assembly of the response, sent over network, etc: ignored.
The 2-3 seconds is only the time to call Searcher.search(query,
filter, n, sort).

>
> You simply cannot improve your code without knowing, through
> measurement, what is taking the time. Virtually every time I've tried to
> improve speed without measuring first, I've been wrong <G>..

I'll have to confess, if I only take first 100 hits, search time can
be bring down to around 1 second. But being unable to do this, I've
also tried to measure the performance taking out each individual
factor, e.g. sort, filter by date, filter by publications. and I found
that filter by publication generally takes the most time. I forgot the
exact measures, but around 0.5~1 seconds improvements taking it out.

>
> BTW, have you looked over the suggestions here?
>
> http://wiki.apache.org/lucene-java/ImproveSearchingSpeed
Yes, I've looked over it a couple of times already =) Get faster
hardware, add RAMs are all good suggestions in my case. We will
eventually spread our indexes to a number of machines. But I would
still like to eliminate any inefficiencies in our search
implementation first.

>
>
> Best
> Erick
>
> On 8/13/07, Cedric Ho <[EMAIL PROTECTED]> wrote:
> >
> > Hi all,
> >
> > My problem is as follows:
> >
> > Our documents each comes from a different publication. And we
> > currently have > 5000 different publication sources.
> >
> > Our clients can choose arbitrarily a subset of the publications while
> > performing search. It is not  uncommon that a search will have to
> > match hundreds or thousands of publications.
> >
> > I currently try to index the publication information as a field in
> > each document. and use a TermsFilter when performing search. However
> > the performance is less than satisfactory. Many simple searches takes
> > more than 2-3 seconds. (our goal: < 0.5seconds).
> >
> > Using the CachingWrapperFilter is great for search speed. But I've
> > done some calculation and figured that it is basically impossible to
> > cache all combination of publications or even some common
> > combinations.
> >
> >
> > Is there any other more effective way to do the filtering?
> >
> > (I know that the slowness is not purely due to the publication filter,
> > we also have some other things that will slow down the search. But
> > this one definitely contributed quite a lot to the overall search
> > time)
> >
> > Regards,
> > Cedric
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>


-- 
[EMAIL PROTECTED]

Re: performance on filtering against thousands of different publications

Reply via email to