Re: Performance of never optimizing

Michael McCandless Tue, 04 Nov 2008 09:37:42 -0800

If possible, you should try to use a larger corpus (eg Wikipedia)rather than multiply Reuters by N, which creates unnatural termfrequency distribution.

The graphs are hard to read because of the spline interpolation.Maybe you could overlay X's where there is a real datapoint?

After the 6 rounds at each doc count, how do you then derive thenumber to put on the graph?

It's best to use a real query log, if possible, to run the queries.If you are expecting your production machines to have plenty of RAM tohold the index, then you should definitely run the queries throughonce, discard it, to get all stuff loaded in RAM including the OScaching all required data in its IO cache.

Not opening/closing a reader per search should change the graphs quitea bit (for the better) and hopefully change some of the odd things youare seeing (in the questions below).


Mike

Justus Pendleton wrote:

Howdy,
I have a couple of questions regarding some Lucene benchmarking andwhat the results mean[3]. (Skip to the numbered list at the end ifyou don't want to read the lengthy exegesis :)
I'm a developer for JIRA[1]. We are currently trying to get a betterunderstanding of Lucene, and our use of it, to cope with the needsof our larger customers. These "large" indexes are only a couplehundred thousand documents but our problem is compounded by the factthat they have a relatively high rate of modification (=delete+insert of new document) and our users expect these modification toshow up in query results pretty much instantly.
Our current default behaviour is a merge factor of 4. We perform anoptimization on the index every 4000 additions. We also perform anoptimize at midnight. Our fundamental problem is that theseoptimizations are locking the index for unacceptably long periods oftime, something that we want to resolve for our next major release,hopefully without undermining search performance too badly.
In the Lucene javadoc there is a comment, and a link to a mailinglist discussion[2], that suggests applications such as JIRA shouldnever perform optimize but should instead set their merge factorvery low.
In an attempt to understand the impact of a) lowering the mergefactor from 4 to 2 and b) never, ever optimizing on an index (overthe course of years and millions of additions/updates) I wanted totry to benchmark Lucene.
I used the contrib/benchmark framework and wrote a small algorithmthat adds documents to an index (using the Reuters doc generator),does a search, does an optimize, then does another search. All thepretty pictures can be seen at:
 http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs
I have several questions, hopefully they aren't overwhelming intheir quantity :-/
1. Why does the merge factor of 4 appear to be faster than the mergefactor of 2?
2. Why does non-optimized searching appear to be faster thanoptimized searching once the index hits ~500,000 documents?
3. There appears to be a fairly sizable performance drop across theboard around 450,000 documents. Why is that?
4. Searching performance appears to decrease towards a fairlypessimistic 20 searches per second (for a relatively simple search).Is this really what we should expect long-term from Lucene?
5. Does my benchmark even make sense? I am far from an expert onbenchmarking so it is possible I'm not measuring what I think I ammeasuring.
Thanks in advance for any insight you can provide. This is an areathat we very much want to understand better as Lucene is a key partof JIRA's success,
Cheers,
Justus
JIRA Developer

[1]: http://www.atlassian.com
[2]: http://www.gossamer-threads.com/lists/lucene/java-dev/47895
[3]: http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Performance of never optimizing

Reply via email to