[ 
http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436934 ] 
            
Mike Klaas commented on LUCENE-675:
-----------------------------------

A few notes on benchmarks:

First, it is important to realize that no benchmark will ever fully-capture all 
aspects of lucene performance, particularly since so many real-world data 
distributions are so varied.  That said, they are useful tools, especially if 
they are componentized to measure various aspects of lucene performance (the 
narrower the goal of the benchmark it, the better a benchmark can be created).

It is rather unrealistic to expect to standardize hardware / os ... better to 
compare before/after numbers on a single configuration, rather than comparing 
the numbers among configurations.  The test process _is_ important, but 
anything crucial should be built into the test (like the number of iterations; 
taking the average, etc).  Concerning the specifics of this: Requiring reboots 
is onerous and not an important criterion (at least for unix systems--I'm not 
sufficiently familiar with windows to comment).  Better to stipulate a 
relatively quiscient machine.  Or perhaps not--it might be useful to see how 
the machine load affects lucene performance.  Also, the arithmetic mean is a 
terrible way of combining results due to its emphasis on outliers.  Better is 
the average over minimum times of small sets of runs.  

Of course, any scheme has its problems.  In general, the most important thing 
when using benchmarks is being aware of the limitations of the benchmark and 
methodology used.

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Assigned To: Grant Ingersoll
>         Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing 
> and querying, on a known corpus. This issue is intended to collect comments 
> and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is 
> the original Reuters collection, available from 
> http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz 
> or 
> http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz.
>  I propose to use this corpus as a base for benchmarks. The benchmarking 
> suite could automatically retrieve it from known locations, and cache it 
> locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to