[ http://issues.apache.org/jira/browse/LUCENE-675?page=all ]
Doron Cohen updated LUCENE-675:
-------------------------------
Attachment: timedata.zip
I tried it and it is working nice! -
1st run downloaded the documents from the Web before starting to index.
2nd run started right off - as input docs are already in place - great.
Seems the only output is what is printed to stdout, right?
I got something like this:
----------------------------
[echo] Working Directory: work
[java] Testing 4 different permutations.
[java] #-- ID: td-00_10_10, Sun Nov 05 22:40:49 PST 2006, heap=1065484288
--
[java] # source=work\reuters-out, [EMAIL
PROTECTED]:\devoss\lucene\java\trunk\contrib\benchmark\work\index
[java] # maxBufferedDocs=10, mergeFactor=10, compound=true, optimize=true
[java] # Query data: R-reopen, W-warmup, T-retrieve, N-no
[java] # qd-0110 R W NT [body:salomon]
[java] # qd-0111 R W T [body:salomon]
[java] # qd-0100 R NW NT [body:salomon]
...
[java] # qd-14011 NR W T [body:fo*]
[java] # qd-14000 NR NW NT [body:fo*]
[java] # qd-14001 NR NW T [body:fo*]
[java] Start Time: Sun Nov 05 22:41:38 PST 2006
[java] - processed 500, run id=0
[java] - processed 1000, run id=0
[java] - processed 1500, run id=0
[java] - processed 2000, run id=0
[java] End Time: Sun Nov 05 22:41:48 PST 2006
[java] warm = Warm Index Reader
[java] srch = Search Index
[java] trav = Traverse Hits list, optionally retrieving document
[java] # testData id operation runCnt recCnt rec/s
avgFreeMem avgTotalMem
[java] td-00_100_100 addDocument 1 2000 472.0321
4493681 22611558
[java] td-00_100_100 optimize 1 1 2.857143
4229488 22716416
[java] td-00_100_100 qd-0110-warm 1 2000 40000.0 4250992
22716416
[java] td-00_100_100 qd-0110-srch 1 1 Infinity
4221288 22716416
...
[java] td-00_100_100 qd-4110-srch 1 1 Infinity
3993624 22716416
[java] td-00_100_100 qd-4110-trav 1 0 NaN 3993624
22716416
[java] td-00_100_100 qd-4111-warm 1 2000 50000.0 3853192
22716416
...
BUILD SUCCESSFUL
Total time: 1 minute 0 seconds
----------------------------
I think the "infinity" and "NAN" are caused by op time too short for
divide-by-sec.
This can be avoided by modifying getRate() in TimeData:
public double getRate() {
double rps = (double) count * 1000.0 / (double) (elapsed>0 ? elapsed : 1);
return rps;
}
I like much the logic of loading test data from the Web, and the scaleUp and
maximumDocumentsToIndex params are handy.
It seems that all the test logic and some of its data (queries) are java coded.
I initially thought of a setting where we define tasks/jobs that are
parameterized, like:
- createIndex(params)
- writeToIndex(params):
- addDocs()
- optimize()
- readFromIndex(params):
- searchIndex()
- fetchData()
..and compose a test by an XML that says which of these simple jobs to run,
with what params, in which order, serial/parallel, how long/often etc.
Then creating a different test is as easy as creating a different XML that
configures that test.
On the other hand, chances are, I know, that most useful cases would be those
already defined here - standard and micro-standard, so can ask "why bothering
changing to define these building blocks". I am not sure here, but thought I'll
bring it up.
About Using the driver - seems nice and clean to me. I don't know the Digester
but it seems to read the config from the XML correctly.
Other comments:
1. I think there is a redundant call to params.showRunData(params.getId()) in
runBenchmark(File,Options);
2. Seems that rec/sec would be a bit more accurately computed by aggregating
elapsed times (instead of rate) in showRunData()
3. If TimeData not found (only memData) I think additional 0.0 should be printed
4. columns allignments with tabs and floats is imperfect.:-)
5. It would be nice I think to also get a summary of the results by "task" -
e.g. srch, optimize, something like:
[java] # testData id operation runCnt recCnt
rec/s avgFreeMem avgTotalMem
[java] warm 60 2000
42,628.8 8,235,758 23,048,192
[java] srch 120 1
571.4 8,300,613 23,048,192
[java] optimize 1 1
2.9 9,375,732 23,048,192
[java] trav 120 107
30,517.8 8,326,046 23,048,192
[java] addDocument 1 2000
441.8 7,310,929 22,206,872
Attached timedata.zip has modifies TimeData.java and TestData.java for [1 to 5]
above, and for the NAN/inifinite.
> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
> Key: LUCENE-675
> URL: http://issues.apache.org/jira/browse/LUCENE-675
> Project: Lucene - Java
> Issue Type: Improvement
> Reporter: Andrzej Bialecki
> Assigned To: Grant Ingersoll
> Attachments: benchmark.patch, BenchmarkingIndexer.pm,
> extract_reuters.plx, LuceneBenchmark.java, LuceneIndexer.java, timedata.zip
>
>
> We need an objective way to measure the performance of Lucene, both indexing
> and querying, on a known corpus. This issue is intended to collect comments
> and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is
> the original Reuters collection, available from
> http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz
> or
> http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz.
> I propose to use this corpus as a base for benchmarks. The benchmarking
> suite could automatically retrieve it from known locations, and cache it
> locally.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]