performance on filtering against thousands of different publications

2007-08-12 Thread Cedric Ho
Hi all, My problem is as follows: Our documents each comes from a different publication. And we currently have > 5000 different publication sources. Our clients can choose arbitrarily a subset of the publications while performing search. It is not uncommon that a search will have to match hundr

Re: performance on filtering against thousands of different publications

2007-08-13 Thread Erick Erickson
Have you tried the very simple techinque if just making an OR clause containing all the sources for a particular query and just letting it run? I was surprised at the speed... But before doing *any* of that, you need to find out, and tell us, what exactly is taking the time. Are you opening a new

Re: performance on filtering against thousands of different publications

2007-08-13 Thread mark harwood
n be combined together in the BooleanFilter super fast. Hope this makes sense Mark - Original Message From: Cedric Ho <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Monday, 13 August, 2007 5:17:52 AM Subject: performance on filtering against thousands of different publication

Re: performance on filtering against thousands of different publications

2007-08-13 Thread Cedric Ho
On 8/13/07, Erick Erickson <[EMAIL PROTECTED]> wrote: > Have you tried the very simple techinque if just making an OR clause > containing all the sources for a particular query and just letting > it run? I was surprised at the speed... I think the TermsFilter that I use does exactly that. > > But

Re: performance on filtering against thousands of different publications

2007-08-13 Thread Cedric Ho
t; Hope this makes sense > Mark > > - Original Message > From: Cedric Ho <[EMAIL PROTECTED]> > To: java-user@lucene.apache.org > Sent: Monday, 13 August, 2007 5:17:52 AM > Subject: performance on filtering against thousands of different publications > > Hi al

Re: performance on filtering against thousands of different publications

2007-08-14 Thread mark harwood
eries. Cheers Mark - Original Message From: Cedric Ho <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Tuesday, 14 August, 2007 3:39:10 AM Subject: Re: performance on filtering against thousands of different publications On 8/13/07, mark harwood <[EMAIL PROTECTED]> wrot

Re: performance on filtering against thousands of different publications

2007-08-14 Thread Steven Rowe
Hi Cedric, Cedric Ho wrote: > On 8/13/07, Erick Erickson <[EMAIL PROTECTED]> wrote: >> Are you iterating through a Hits object that has more than >> 100 (maybe it's 200 now) entries? Are you loading each document that >> satisfies the query? Etc. Etc. > > Unfortunately, yes. And I know this is an

Re: performance on filtering against thousands of different publications

2007-08-14 Thread Cedric Ho
> > Some options: > 1) Try minimise leaping around the disk - maybe sorting your selected terms > will help. Look at methods in TermEnum and TermDocs which you can use to > build your own bitset from your (sorted) list of terms. Thanks, I'll try this method. > 2) Can you add higher-level terms

Re: performance on filtering against thousands of different publications

2007-08-14 Thread Cedric Ho
Hi Steven, Thanks for your clarification. I am using the Searcher.search(query, filter, n, sort) method. I presume this method doesn't have the same problem, since I already pass it the max number of results returned. Regards, Cedric On 8/15/07, Steven Rowe <[EMAIL PROTECTED]> wrote: > Hi Cedr