Re: Performance problems with Lucene 2.9

Michel Nadeau Mon, 30 Nov 2009 08:27:33 -0800

Great, thanks!

So what do you guys think would be the best road for my application? I NEVER
want to retrieve -all- documents, only like maximum 200. I always need to
apply some filters and some sorts. From what I understand, in all cases I
should switch from Hits to a Collector for performance reasons - but then
how will I filter and sort?


- Mike
[email protected]


On Mon, Nov 30, 2009 at 11:20 AM, Uwe Schindler <[email protected]> wrote:

> Hits is deprecated and should no longer be used. The replacements are
> TopDocs *or* Collectors.
>
> If you add a number of max-scoring results you want to have (e.g. to
> display
> the first 10 results of a google-like query on a web page), use TopDocs.
> The
> method for that is Searcher.serach(Query q, int n) (or optionally with
> filter). This will only return the most-relevant results (n hits). This is
> for normal web pages that work like google. You cannot set n to a very high
> number and try to get all results, it would use to much memory (its only to
> collect most relevant docs). Hits was for the same purpose, but was build
> like an random-access result, which it not was, confusing users like you.
> Because of that incorrect usage, Hits was very slow if you try to get all
> hits and consumes all memory.
>
> If you need all results, you have to write a Collector, which is an
> abstract
> class. The search engine then calls collect() for each hit together with
> the
> document number. In your implementation of collect() you can consume the
> document ID and optionally retrieve the score. There are more methods to
> override in Collector, sample implementations are given in JavaDocs. You
> also need to take care of IndexReaders and document base ids.
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: [email protected]
>
>
> > -----Original Message-----
> > From: Michel Nadeau [mailto:[email protected]]
> > Sent: Monday, November 30, 2009 5:10 PM
> > To: [email protected]
> > Subject: Re: Performance problems with Lucene 2.9
> >
> > What is the main difference between Hits and Collectors?
> >
> > - Mike
> > [email protected]
> >
> >
> > On Mon, Nov 30, 2009 at 11:03 AM, Uwe Schindler <[email protected]> wrote:
> >
> > > And if you only have a filter and apply it to all documents, make a
> > > ConstantScoreQuery on top of the filter:
> > >
> > > Query q=new ConstantScoreQuery(cluCF);
> > >
> > > Then remove the filter from your search method call and only execute
> > this
> > > query.
> > >
> > > And if you iterate over all results never-ever use Hits! (its already
> > > deprecated). Write a Collector instead (as you are not interested in
> > > scoring).
> > >
> > > And: If you replace a relational database with Lucene, be sure not to
> > think
> > > in a relational sense with foreign keys / primary keys and so on. In
> > > general
> > > you should flatten everything.
> > >
> > > Uwe
> > >
> > > -----
> > > Uwe Schindler
> > > H.-H.-Meier-Allee 63, D-28213 Bremen
> > > http://www.thetaphi.de
> > > eMail: [email protected]
> > >
> > >
> > > > -----Original Message-----
> > > > From: Shai Erera [mailto:[email protected]]
> > > > Sent: Monday, November 30, 2009 4:56 PM
> > > > To: [email protected]
> > > > Subject: Re: Performance problems with Lucene 2.9
> > > >
> > > > Hi
> > > >
> > > > First you can use MatchAllDocsQuery, which matches all documents. It
> > will
> > > > save a HUGE posting list (TAG:TAG), and performs much faster. For
> > example
> > > > TAG:TAG computes a score for each doc, even though you don't need it.
> > > > MatchAllDocsQuery doesn't.
> > > >
> > > > Second, move away from Hits ! :) Use Collectors instead.
> > > >
> > > > If I understand the chain of filters, do you think you can code them
> > with
> > > > a
> > > > BooleanQuery that is added BooleanClauses, each with is Term
> > > > (field:value)?
> > > > You can add clauses w/ OR, AND, NOT etc.
> > > >
> > > > Note that in Lucene 2.9, you can avoid scoring documents very easily,
> > > > which
> > > > is a performance win if you don't need scores (i.e. if you just want
> > to
> > > > match everything, not caring for scores).
> > > >
> > > > Shai
> > > >
> > > > On Mon, Nov 30, 2009 at 5:47 PM, Michel Nadeau <[email protected]>
> > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > we use Lucene to store around 300 millions of records. We use the
> > index
> > > > > both
> > > > > for conventional searching, but also for all the system's data - we
> > > > > replaced
> > > > > MySQL with Lucene because it was simply not working at all with
> > MySQL
> > > > due
> > > > > to
> > > > > the amount or records. Our problem is that we have HUGE performance
> > > > > problems... whenever we search, it takes forever to return results,
> > and
> > > > > Java
> > > > > uses 100% CPU/RAM.
> > > > >
> > > > > Our index fields are like this:
> > > > >
> > > > > TYPE
> > > > > PK
> > > > > FOREIGN_PK
> > > > > TAG
> > > > > ...other information depending on type...
> > > > >
> > > > > * All fields are Field.Index.UN_TOKENIZED
> > > > > * The field "TAG" always contains the value "TAG".
> > > > >
> > > > > Whenever we search in the index, our query is "TAG:TAG" to match
> all
> > > > > documents, and we do the search like this:
> > > > >
> > > > >        // Search
> > > > >        Hits h = searcher.search(q, cluCF, cluSort);
> > > > >
> > > > > cluCF is a ChainedFilter containing all the other filters (like
> > > > > FOREIGN_PK=12345, TYPE=a, etc.).
> > > > >
> > > > > I know that the method is probably crazy because "TAG:TAG" is
> > matching
> > > > all
> > > > > 300M documents and then it applies filters; so that's probably why
> > > every
> > > > > little query is taking 100% CPU/RAM.... but I don't know how to do
> > it
> > > > > properly.
> > > > >
> > > > > Help ! Any advice is welcome.
> > > > >
> > > > > - Mike
> > > > > [email protected]
> > > > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
> > >
> > >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Performance problems with Lucene 2.9

Reply via email to