Re: BufferedIndexInput.readByte performance

Paul Elschot Fri, 26 May 2006 09:27:03 -0700

On Friday 26 May 2006 16:14, Michael Chan wrote:
> Hi,
> 
> I have a 5gb index containing 2mil documents and am trying to run
> 1mil+ queries against it. Most of the queries are SpanQueries and it
> occurs to me that the search performance is quite slow when using 2, 3
> SpanOrQueries nested inside a SpanNearQuery, which in turn is nested
> inside another SpanNearQuery. The response time is around 3-5 seconds
> even when the index is stored as a RAMDirectory, and
> BufferedIndexInput.readByte() appears to be the bottleneck. Is this
> performance typical? As I don't need any sorting of the results and


That is indeed typical.

> only need the number of results returned, is there anything, besides
> Field.setOmitNorm(true), I can modify to improve performance?

A few things might help:
- use getSpans() on the scorer of the query, iterate the resulting Spans
  and count the number of different doc values.
  This saves the scoring and the sorting on score value.
- Sort the queries alphabetically, to try and maximize cache usage.
- Increase the skip interval when creating the index, by default lucene uses
  16, but nutch uses a higher value. I've never done this myself, but
  you could specifically ask on how to do this.
- In case you use ordered SpanNearQuery, the implementation here might be
  somewhat faster:
  http://issues.apache.org/jira/browse/LUCENE-413
- Specialize the PriorityQueue used in SpanNearQuery, somewhat more intricate
  than this one for disjunctions:
  http://issues.apache.org/jira/browse/LUCENE-365

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: BufferedIndexInput.readByte performance

Reply via email to