[jira] Updated: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Doron Cohen (JIRA) Tue, 07 Nov 2006 02:51:09 -0800

     [ http://issues.apache.org/jira/browse/LUCENE-675?page=all ]


Doron Cohen updated LUCENE-675:
-------------------------------

    Attachment: timedata.zip

I tried it and it is working nice! - 
1st run downloaded the documents from the Web before starting to index. 
2nd run started right off - as input docs are already in place - great. 

Seems the only output is what is printed to stdout, right? 

I got something like this: 
----------------------------
     [echo] Working Directory: work
     [java] Testing 4 different permutations.
     [java] #-- ID: td-00_10_10, Sun Nov 05 22:40:49 PST 2006, heap=1065484288 
--
     [java] # source=work\reuters-out, [EMAIL 
PROTECTED]:\devoss\lucene\java\trunk\contrib\benchmark\work\index
     [java] # maxBufferedDocs=10, mergeFactor=10, compound=true, optimize=true
     [java] # Query data: R-reopen, W-warmup, T-retrieve, N-no
     [java] # qd-0110 R W NT [body:salomon]
     [java] # qd-0111 R W T [body:salomon]
     [java] # qd-0100 R NW NT [body:salomon]
...
     [java] # qd-14011 NR W T [body:fo*]
     [java] # qd-14000 NR NW NT [body:fo*]
     [java] # qd-14001 NR NW T [body:fo*]

     [java] Start Time: Sun Nov 05 22:41:38 PST 2006
     [java]  - processed 500, run id=0
     [java]  - processed 1000, run id=0
     [java]  - processed 1500, run id=0
     [java]  - processed 2000, run id=0
     [java] End Time: Sun Nov 05 22:41:48 PST 2006
     [java] warm = Warm Index Reader
     [java] srch = Search Index
     [java] trav = Traverse Hits list, optionally retrieving document

     [java] # testData id       operation       runCnt  recCnt  rec/s   
avgFreeMem      avgTotalMem
     [java] td-00_100_100       addDocument     1       2000    472.0321        
4493681 22611558
     [java] td-00_100_100       optimize        1       1       2.857143        
4229488 22716416
     [java] td-00_100_100       qd-0110-warm    1       2000    40000.0 4250992 
22716416
     [java] td-00_100_100       qd-0110-srch    1       1       Infinity        
4221288 22716416
...
     [java] td-00_100_100       qd-4110-srch    1       1       Infinity        
3993624 22716416
     [java] td-00_100_100       qd-4110-trav    1       0       NaN     3993624 
22716416
     [java] td-00_100_100       qd-4111-warm    1       2000    50000.0 3853192 
22716416
...
BUILD SUCCESSFUL
Total time: 1 minute 0 seconds
----------------------------

I think the "infinity" and "NAN" are caused by op time too short for 
divide-by-sec.
This can be avoided by modifying getRate() in TimeData:
  public double getRate() {
    double rps = (double) count * 1000.0 / (double) (elapsed>0 ? elapsed : 1);
    return rps;
  }

I like much the logic of loading test data from the Web, and the scaleUp and 
maximumDocumentsToIndex params are handy. 

It seems that all the test logic and some of its data (queries) are java coded. 
I initially thought of a setting where we define tasks/jobs that are 
parameterized, like:

- createIndex(params)
- writeToIndex(params):
  - addDocs()
  - optimize()
- readFromIndex(params):
  - searchIndex()
  - fetchData()

..and compose a test by an XML that says which of these simple jobs to run, 
with what params, in which order, serial/parallel, how long/often etc. 
Then creating a different test is as easy as creating a different XML that 
configures that test. 

On the other hand, chances are, I know, that most useful cases would be those 
already defined here - standard and micro-standard, so can ask "why bothering 
changing to define these building blocks". I am not sure here, but thought I'll 
bring it up. 

About Using the driver - seems nice and clean to me. I don't know the Digester 
but it seems to read the config from the XML correctly.

Other comments:
1. I think there is a redundant call to params.showRunData(params.getId()) in 
runBenchmark(File,Options);
2. Seems that rec/sec would be a bit more accurately computed by aggregating 
elapsed times (instead of rate) in showRunData()
3. If TimeData not found (only memData) I think additional 0.0 should be printed
4. columns allignments with tabs and floats is imperfect.:-)
5. It would be nice I think to also get a summary of the results by "task" - 
e.g. srch, optimize, something like:
     [java] # testData id     operation           runCnt     recCnt          
rec/s       avgFreeMem      avgTotalMem
     [java]                   warm                    60       2000       
42,628.8        8,235,758       23,048,192
     [java]                   srch                   120          1          
571.4        8,300,613       23,048,192
     [java]                   optimize                 1          1            
2.9        9,375,732       23,048,192
     [java]                   trav                   120        107       
30,517.8        8,326,046       23,048,192
     [java]                   addDocument              1       2000          
441.8        7,310,929       22,206,872

Attached timedata.zip has modifies TimeData.java and TestData.java for [1 to 5] 
above, and for the NAN/inifinite. 

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Assigned To: Grant Ingersoll
>         Attachments: benchmark.patch, BenchmarkingIndexer.pm, 
> extract_reuters.plx, LuceneBenchmark.java, LuceneIndexer.java, timedata.zip
>
>
> We need an objective way to measure the performance of Lucene, both indexing 
> and querying, on a known corpus. This issue is intended to collect comments 
> and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is 
> the original Reuters collection, available from 
> http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz 
> or 
> http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz.
>  I propose to use this corpus as a base for benchmarks. The benchmarking 
> suite could automatically retrieve it from known locations, and cache it 
> locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Reply via email to