Re: performance on filtering against thousands of different publications

Cedric Ho Mon, 13 Aug 2007 19:39:36 -0700

On 8/13/07, mark harwood <[EMAIL PROTECTED]> wrote:
> I would presume that (like a lot of things) there is power-law at play in the 
> popularity of publication sources (i.e. a small number of popular sources and 
> a lot of unpopular ones).
> The "Zipf" plugin in Luke can be used to illustrate this distribution for the 
> values in your "publication source" field.


Do u mean it will count the number of documents for each publication source ?

>
> Given this disparity, it makes sense to only cache Filters for the most 
> popular publication sources. Reading a large list of doc ids (the TermDocs) 
> for these popular terms takes a lot of time so it makes sense to cache them 
> whereas it clearly is not valuable to use exactly the same amount of memory 
> (i.e. a new Bitset(reader.maxDoc) ) to cache an unpopular term whose TermDocs 
> can be read from disk quickly.
> I would use BooleanFilter to combine the user's choices of publication source 
> terms and use CachingWrapperFilter around (popular) individual Term Filters 
> added to the BooleanFilter rather than using CachingWrapperFilter around the 
> BooleanFilter as a whole. This is because your are much more likely to get 
> cache hits on the popular individual terms than on a user's particular 
> selection of publication sources and these cached items can be combined 
> together in the BooleanFilter super fast.

We are also thinking about similar methods. i.e. caching some common
filters. Let me give a little more detail here.

Our clients usually search with only the default publication set.
However the default set of publications vary a lot for different
clients and each set can be quite large (hundreds to thousands of
publications).

So we are thinking we may want to use a cache of TermsFilter, where
each TermsFilter filter for a set of publications and maybe use some
LRU policy to manage the cache of filters.

This may eventually work, be we are also looking for other better alternatives.

Thanks,
Cedric

>
> Hope this makes sense
> Mark
>
> ----- Original Message ----
> From: Cedric Ho <[EMAIL PROTECTED]>
> To: java-user@lucene.apache.org
> Sent: Monday, 13 August, 2007 5:17:52 AM
> Subject: performance on filtering against thousands of different publications
>
> Hi all,
>
> My problem is as follows:
>
> Our documents each comes from a different publication. And we
> currently have > 5000 different publication sources.
>
> Our clients can choose arbitrarily a subset of the publications while
> performing search. It is not  uncommon that a search will have to
> match hundreds or thousands of publications.
>
> I currently try to index the publication information as a field in
> each document. and use a TermsFilter when performing search. However
> the performance is less than satisfactory. Many simple searches takes
> more than 2-3 seconds. (our goal: < 0.5seconds).
>
> Using the CachingWrapperFilter is great for search speed. But I've
> done some calculation and figured that it is basically impossible to
> cache all combination of publications or even some common
> combinations.
>
>
> Is there any other more effective way to do the filtering?
>
> (I know that the slowness is not purely due to the publication filter,
> we also have some other things that will slow down the search. But
> this one definitely contributed quite a lot to the overall search
> time)
>
> Regards,
> Cedric
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
>
>
>
>       ___________________________________________________________
> Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
> now.
> http://uk.answers.yahoo.com/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: performance on filtering against thousands of different publications

Reply via email to