RE: Performance problems with Lucene 2.9

Uwe Schindler Mon, 30 Nov 2009 09:42:41 -0800

The total number is also returned in Top(Field)Docs, there is a getter
method.


-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [email protected]


> -----Original Message-----
> From: Michel Nadeau [mailto:[email protected]]
> Sent: Monday, November 30, 2009 6:37 PM
> To: [email protected]
> Subject: Re: Performance problems with Lucene 2.9
> 
> The problem with this method is that I won't be able to know how many
> total
> results / pages a search have?
> 
> For example if I do a search X that returns 1,000,000 records, so 5,000
> pages of 200 items, I will only know if I have more when I'll hit "next
> page" - I won't be able to display "1,000,000 results / Page 1 of 5,000" ?
> 
> - Mike
> [email protected]
> 
> 
> On Mon, Nov 30, 2009 at 12:28 PM, Uwe Schindler <[email protected]> wrote:
> 
> > > Now I have another question... is there a way to specify a "start
> from"
> > so
> > > I
> > > could get page 2, 3, 4, etc.. ?
> >
> > Search the mailing list, this was explained quite often (by others and
> me).
> > The trick is:
> > If you have 200 results per page, with n = 200 you get the top ranking
> > results for the first page. If you want to have the second page,
> reexecute
> > the query with n=400 and display scoreDocs[200..399] and so on. All full
> > text engines work like this, they can only collect top-mathcing
> documents,
> > but no documents in a slot. If n gets too big it gets very slow and uses
> > too
> > much memory. Because of that, Google limits the maximum page number.
> >
> >
> > > On Mon, Nov 30, 2009 at 12:00 PM, Uwe Schindler <[email protected]>
> wrote:
> > >
> > > > > And sorting is done by the
> > > > > collector, Lucene has no idea how to sort.
> > > >
> > > > Sorting is done by the internal collector behind the
> > > > Top(Field)Docs-returning method (your own collectors would have to
> do
> > it
> > > > themselves). If you call search(Query, n,... Sort), internally an
> > > collector
> > > > is created that does the sorting for you and throws away all results
> > > that
> > > > do
> > > > not fall into the first 200 hits (if n=200).
> > > >
> > > > > If you use Sort, the returned
> > > > > TopDocs will be sorted.
> > > > >
> > > > > If you do not sort at all and do not score your results, TopDocs
> is
> > > not
> > > > > very
> > > > > useful, because the first 200 hits cannot be ranked.
> > > > >
> > > > > -----
> > > > > Uwe Schindler
> > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
> > > > > http://www.thetaphi.de
> > > > > eMail: [email protected]
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Michel Nadeau [mailto:[email protected]]
> > > > > > Sent: Monday, November 30, 2009 5:35 PM
> > > > > > To: [email protected]
> > > > > > Subject: Re: Performance problems with Lucene 2.9
> > > > > >
> > > > > > I'll definitely switch to a Collector.
> > > > > >
> > > > > > It's just not clear for me if I should use BooleanQueries or
> > > > > > MatchAllDocuments+Filters ?
> > > > > >
> > > > > > And should I write my own collector or the TopDocs one is
> perfect
> > > for
> > > > me
> > > > > ?
> > > > > >
> > > > > > - Mike
> > > > > > [email protected]
> > > > > >
> > > > > >
> > > > > > On Mon, Nov 30, 2009 at 11:30 AM, Erick Erickson
> > > > > > <[email protected]>wrote:
> > > > > >
> > > > > > > The problem with hits is that a it re-executes the query
> > > > > > > every N documents where N is 100 (?).
> > > > > > >
> > > > > > > So, a loop like
> > > > > > > for (int idx : hits.length) {
> > > > > > >   do something....
> > > > > > > }
> > > > > > >
> > > > > > > Assuming my memory is right and it's every 100, your query
> will
> > > > > > > re-execute (length/100) times. Which is unfortunate.
> > > > > > >
> > > > > > > The very quick test to see where to concentrate first would be
> to
> > > > take
> > > > > > > a time stamp just before you hit your loop.....
> > > > > > >
> > > > > > > This will tell you whether this loop is the culprit, but it
> > really
> > > > > > doesn't
> > > > > > > matter because you'll follow the advice from Uwe and Shai
> anyway
> > > <G>.
> > > > > > >
> > > > > > > Filtering and Sorting are applied to Collectors before you see
> > > > > them.....
> > > > > > >
> > > > > > > The other bit would be to investigate your sorting. Remember
> that
> > > the
> > > > > > > first sort or two take quite a while since the relevant caches
> > are
> > > > > > > populated with first used, so second+ queries should be
> faster.
> > > The
> > > > > > > Wiki has some timing/speedup advice.....
> > > > > > >
> > > > > > > Best
> > > > > > > Erick
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Nov 30, 2009 at 11:10 AM, Michel Nadeau <
> > [email protected]>
> > > > > > wrote:
> > > > > > >
> > > > > > > > What is the main difference between Hits and Collectors?
> > > > > > > >
> > > > > > > > - Mike
> > > > > > > > [email protected]
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, Nov 30, 2009 at 11:03 AM, Uwe Schindler
> > > <[email protected]>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > And if you only have a filter and apply it to all
> documents,
> > > make
> > > > > a
> > > > > > > > > ConstantScoreQuery on top of the filter:
> > > > > > > > >
> > > > > > > > > Query q=new ConstantScoreQuery(cluCF);
> > > > > > > > >
> > > > > > > > > Then remove the filter from your search method call and
> only
> > > > > execute
> > > > > > > this
> > > > > > > > > query.
> > > > > > > > >
> > > > > > > > > And if you iterate over all results never-ever use Hits!
> (its
> > > > > > already
> > > > > > > > > deprecated). Write a Collector instead (as you are not
> > > interested
> > > > > in
> > > > > > > > > scoring).
> > > > > > > > >
> > > > > > > > > And: If you replace a relational database with Lucene, be
> > sure
> > > > not
> > > > > > to
> > > > > > > > think
> > > > > > > > > in a relational sense with foreign keys / primary keys and
> so
> > > on.
> > > > > In
> > > > > > > > > general
> > > > > > > > > you should flatten everything.
> > > > > > > > >
> > > > > > > > > Uwe
> > > > > > > > >
> > > > > > > > > -----
> > > > > > > > > Uwe Schindler
> > > > > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
> > > > > > > > > http://www.thetaphi.de
> > > > > > > > > eMail: [email protected]
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > -----Original Message-----
> > > > > > > > > > From: Shai Erera [mailto:[email protected]]
> > > > > > > > > > Sent: Monday, November 30, 2009 4:56 PM
> > > > > > > > > > To: [email protected]
> > > > > > > > > > Subject: Re: Performance problems with Lucene 2.9
> > > > > > > > > >
> > > > > > > > > > Hi
> > > > > > > > > >
> > > > > > > > > > First you can use MatchAllDocsQuery, which matches all
> > > > > documents.
> > > > > > It
> > > > > > > > will
> > > > > > > > > > save a HUGE posting list (TAG:TAG), and performs much
> > > faster.
> > > > > For
> > > > > > > > example
> > > > > > > > > > TAG:TAG computes a score for each doc, even though you
> > don't
> > > > > need
> > > > > > it.
> > > > > > > > > > MatchAllDocsQuery doesn't.
> > > > > > > > > >
> > > > > > > > > > Second, move away from Hits ! :) Use Collectors instead.
> > > > > > > > > >
> > > > > > > > > > If I understand the chain of filters, do you think you
> can
> > > code
> > > > > > them
> > > > > > > > with
> > > > > > > > > > a
> > > > > > > > > > BooleanQuery that is added BooleanClauses, each with is
> > Term
> > > > > > > > > > (field:value)?
> > > > > > > > > > You can add clauses w/ OR, AND, NOT etc.
> > > > > > > > > >
> > > > > > > > > > Note that in Lucene 2.9, you can avoid scoring documents
> > > very
> > > > > > easily,
> > > > > > > > > > which
> > > > > > > > > > is a performance win if you don't need scores (i.e. if
> you
> > > just
> > > > > > want
> > > > > > > to
> > > > > > > > > > match everything, not caring for scores).
> > > > > > > > > >
> > > > > > > > > > Shai
> > > > > > > > > >
> > > > > > > > > > On Mon, Nov 30, 2009 at 5:47 PM, Michel Nadeau
> > > > > <[email protected]>
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi,
> > > > > > > > > > >
> > > > > > > > > > > we use Lucene to store around 300 millions of records.
> We
> > > use
> > > > > > the
> > > > > > > > index
> > > > > > > > > > > both
> > > > > > > > > > > for conventional searching, but also for all the
> system's
> > > > data
> > > > > -
> > > > > > we
> > > > > > > > > > > replaced
> > > > > > > > > > > MySQL with Lucene because it was simply not working at
> > all
> > > > > with
> > > > > > > MySQL
> > > > > > > > > > due
> > > > > > > > > > > to
> > > > > > > > > > > the amount or records. Our problem is that we have
> HUGE
> > > > > > performance
> > > > > > > > > > > problems... whenever we search, it takes forever to
> > return
> > > > > > results,
> > > > > > > > and
> > > > > > > > > > > Java
> > > > > > > > > > > uses 100% CPU/RAM.
> > > > > > > > > > >
> > > > > > > > > > > Our index fields are like this:
> > > > > > > > > > >
> > > > > > > > > > > TYPE
> > > > > > > > > > > PK
> > > > > > > > > > > FOREIGN_PK
> > > > > > > > > > > TAG
> > > > > > > > > > > ...other information depending on type...
> > > > > > > > > > >
> > > > > > > > > > > * All fields are Field.Index.UN_TOKENIZED
> > > > > > > > > > > * The field "TAG" always contains the value "TAG".
> > > > > > > > > > >
> > > > > > > > > > > Whenever we search in the index, our query is
> "TAG:TAG"
> > to
> > > > > match
> > > > > > > all
> > > > > > > > > > > documents, and we do the search like this:
> > > > > > > > > > >
> > > > > > > > > > >        // Search
> > > > > > > > > > >        Hits h = searcher.search(q, cluCF, cluSort);
> > > > > > > > > > >
> > > > > > > > > > > cluCF is a ChainedFilter containing all the other
> filters
> > > > > (like
> > > > > > > > > > > FOREIGN_PK=12345, TYPE=a, etc.).
> > > > > > > > > > >
> > > > > > > > > > > I know that the method is probably crazy because
> > "TAG:TAG"
> > > is
> > > > > > > > matching
> > > > > > > > > > all
> > > > > > > > > > > 300M documents and then it applies filters; so that's
> > > > probably
> > > > > > why
> > > > > > > > > every
> > > > > > > > > > > little query is taking 100% CPU/RAM.... but I don't
> know
> > > how
> > > > > to
> > > > > > do
> > > > > > > it
> > > > > > > > > > > properly.
> > > > > > > > > > >
> > > > > > > > > > > Help ! Any advice is welcome.
> > > > > > > > > > >
> > > > > > > > > > > - Mike
> > > > > > > > > > > [email protected]
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > ------------------------------------------------------------------
> > > > > --
> > > > > > -
> > > > > > > > > To unsubscribe, e-mail: java-user-
> > > [email protected]
> > > > > > > > > For additional commands, e-mail:
> > > > [email protected]
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > > >
> > > > > ------------------------------------------------------------------
> ---
> > > > > To unsubscribe, e-mail: [email protected]
> > > > > For additional commands, e-mail: [email protected]
> > > >
> > > >
> > > >
> > > > --------------------------------------------------------------------
> -
> > > > To unsubscribe, e-mail: [email protected]
> > > > For additional commands, e-mail: [email protected]
> > > >
> > > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: Performance problems with Lucene 2.9

Reply via email to