On 8/13/07, mark harwood <[EMAIL PROTECTED]> wrote: > I would presume that (like a lot of things) there is power-law at play in the > popularity of publication sources (i.e. a small number of popular sources and > a lot of unpopular ones). > The "Zipf" plugin in Luke can be used to illustrate this distribution for the > values in your "publication source" field.
Do u mean it will count the number of documents for each publication source ? > > Given this disparity, it makes sense to only cache Filters for the most > popular publication sources. Reading a large list of doc ids (the TermDocs) > for these popular terms takes a lot of time so it makes sense to cache them > whereas it clearly is not valuable to use exactly the same amount of memory > (i.e. a new Bitset(reader.maxDoc) ) to cache an unpopular term whose TermDocs > can be read from disk quickly. > I would use BooleanFilter to combine the user's choices of publication source > terms and use CachingWrapperFilter around (popular) individual Term Filters > added to the BooleanFilter rather than using CachingWrapperFilter around the > BooleanFilter as a whole. This is because your are much more likely to get > cache hits on the popular individual terms than on a user's particular > selection of publication sources and these cached items can be combined > together in the BooleanFilter super fast. We are also thinking about similar methods. i.e. caching some common filters. Let me give a little more detail here. Our clients usually search with only the default publication set. However the default set of publications vary a lot for different clients and each set can be quite large (hundreds to thousands of publications). So we are thinking we may want to use a cache of TermsFilter, where each TermsFilter filter for a set of publications and maybe use some LRU policy to manage the cache of filters. This may eventually work, be we are also looking for other better alternatives. Thanks, Cedric > > Hope this makes sense > Mark > > ----- Original Message ---- > From: Cedric Ho <[EMAIL PROTECTED]> > To: java-user@lucene.apache.org > Sent: Monday, 13 August, 2007 5:17:52 AM > Subject: performance on filtering against thousands of different publications > > Hi all, > > My problem is as follows: > > Our documents each comes from a different publication. And we > currently have > 5000 different publication sources. > > Our clients can choose arbitrarily a subset of the publications while > performing search. It is not uncommon that a search will have to > match hundreds or thousands of publications. > > I currently try to index the publication information as a field in > each document. and use a TermsFilter when performing search. However > the performance is less than satisfactory. Many simple searches takes > more than 2-3 seconds. (our goal: < 0.5seconds). > > Using the CachingWrapperFilter is great for search speed. But I've > done some calculation and figured that it is basically impossible to > cache all combination of publications or even some common > combinations. > > > Is there any other more effective way to do the filtering? > > (I know that the slowness is not purely due to the publication filter, > we also have some other things that will slow down the search. But > this one definitely contributed quite a lot to the overall search > time) > > Regards, > Cedric > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > ___________________________________________________________ > Yahoo! Answers - Got a question? Someone out there knows the answer. Try it > now. > http://uk.answers.yahoo.com/ > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]