[ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436934 ] Mike Klaas commented on LUCENE-675: -----------------------------------
A few notes on benchmarks: First, it is important to realize that no benchmark will ever fully-capture all aspects of lucene performance, particularly since so many real-world data distributions are so varied. That said, they are useful tools, especially if they are componentized to measure various aspects of lucene performance (the narrower the goal of the benchmark it, the better a benchmark can be created). It is rather unrealistic to expect to standardize hardware / os ... better to compare before/after numbers on a single configuration, rather than comparing the numbers among configurations. The test process _is_ important, but anything crucial should be built into the test (like the number of iterations; taking the average, etc). Concerning the specifics of this: Requiring reboots is onerous and not an important criterion (at least for unix systems--I'm not sufficiently familiar with windows to comment). Better to stipulate a relatively quiscient machine. Or perhaps not--it might be useful to see how the machine load affects lucene performance. Also, the arithmetic mean is a terrible way of combining results due to its emphasis on outliers. Better is the average over minimum times of small sets of runs. Of course, any scheme has its problems. In general, the most important thing when using benchmarks is being aware of the limitations of the benchmark and methodology used. > Lucene benchmark: objective performance test for Lucene > ------------------------------------------------------- > > Key: LUCENE-675 > URL: http://issues.apache.org/jira/browse/LUCENE-675 > Project: Lucene - Java > Issue Type: Improvement > Reporter: Andrzej Bialecki > Assigned To: Grant Ingersoll > Attachments: LuceneBenchmark.java > > > We need an objective way to measure the performance of Lucene, both indexing > and querying, on a known corpus. This issue is intended to collect comments > and patches implementing a suite of such benchmarking tests. > Regarding the corpus: one of the widely used and freely available corpora is > the original Reuters collection, available from > http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz > or > http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. > I propose to use this corpus as a base for benchmarks. The benchmarking > suite could automatically retrieve it from known locations, and cache it > locally. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]