On Friday 26 May 2006 16:14, Michael Chan wrote: > Hi, > > I have a 5gb index containing 2mil documents and am trying to run > 1mil+ queries against it. Most of the queries are SpanQueries and it > occurs to me that the search performance is quite slow when using 2, 3 > SpanOrQueries nested inside a SpanNearQuery, which in turn is nested > inside another SpanNearQuery. The response time is around 3-5 seconds > even when the index is stored as a RAMDirectory, and > BufferedIndexInput.readByte() appears to be the bottleneck. Is this > performance typical? As I don't need any sorting of the results and
That is indeed typical. > only need the number of results returned, is there anything, besides > Field.setOmitNorm(true), I can modify to improve performance? A few things might help: - use getSpans() on the scorer of the query, iterate the resulting Spans and count the number of different doc values. This saves the scoring and the sorting on score value. - Sort the queries alphabetically, to try and maximize cache usage. - Increase the skip interval when creating the index, by default lucene uses 16, but nutch uses a higher value. I've never done this myself, but you could specifically ask on how to do this. - In case you use ordered SpanNearQuery, the implementation here might be somewhat faster: http://issues.apache.org/jira/browse/LUCENE-413 - Specialize the PriorityQueue used in SpanNearQuery, somewhat more intricate than this one for disjunctions: http://issues.apache.org/jira/browse/LUCENE-365 Regards, Paul Elschot --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]