I would presume that (like a lot of things) there is power-law at play in the 
popularity of publication sources (i.e. a small number of popular sources and a 
lot of unpopular ones).
The "Zipf" plugin in Luke can be used to illustrate this distribution for the 
values in your "publication source" field.

Given this disparity, it makes sense to only cache Filters for the most popular 
publication sources. Reading a large list of doc ids (the TermDocs) for these 
popular terms takes a lot of time so it makes sense to cache them whereas it 
clearly is not valuable to use exactly the same amount of memory (i.e. a new 
Bitset(reader.maxDoc) ) to cache an unpopular term whose TermDocs can be read 
from disk quickly.
I would use BooleanFilter to combine the user's choices of publication source 
terms and use CachingWrapperFilter around (popular) individual Term Filters 
added to the BooleanFilter rather than using CachingWrapperFilter around the 
BooleanFilter as a whole. This is because your are much more likely to get 
cache hits on the popular individual terms than on a user's particular 
selection of publication sources and these cached items can be combined 
together in the BooleanFilter super fast.

Hope this makes sense
Mark

----- Original Message ----
From: Cedric Ho <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Monday, 13 August, 2007 5:17:52 AM
Subject: performance on filtering against thousands of different publications

Hi all,

My problem is as follows:

Our documents each comes from a different publication. And we
currently have > 5000 different publication sources.

Our clients can choose arbitrarily a subset of the publications while
performing search. It is not  uncommon that a search will have to
match hundreds or thousands of publications.

I currently try to index the publication information as a field in
each document. and use a TermsFilter when performing search. However
the performance is less than satisfactory. Many simple searches takes
more than 2-3 seconds. (our goal: < 0.5seconds).

Using the CachingWrapperFilter is great for search speed. But I've
done some calculation and figured that it is basically impossible to
cache all combination of publications or even some common
combinations.


Is there any other more effective way to do the filtering?

(I know that the slowness is not purely due to the publication filter,
we also have some other things that will slow down the search. But
this one definitely contributed quite a lot to the overall search
time)

Regards,
Cedric

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






      ___________________________________________________________
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to