If possible, you should try to use a larger corpus (eg Wikipedia)
rather than multiply Reuters by N, which creates unnatural term
frequency distribution.
The graphs are hard to read because of the spline interpolation.
Maybe you could overlay X's where there is a real datapoint?
After the 6 rounds at each doc count, how do you then derive the
number to put on the graph?
It's best to use a real query log, if possible, to run the queries.
If you are expecting your production machines to have plenty of RAM to
hold the index, then you should definitely run the queries through
once, discard it, to get all stuff loaded in RAM including the OS
caching all required data in its IO cache.
Not opening/closing a reader per search should change the graphs quite
a bit (for the better) and hopefully change some of the odd things you
are seeing (in the questions below).
Mike
Justus Pendleton wrote:
Howdy,
I have a couple of questions regarding some Lucene benchmarking and
what the results mean[3]. (Skip to the numbered list at the end if
you don't want to read the lengthy exegesis :)
I'm a developer for JIRA[1]. We are currently trying to get a better
understanding of Lucene, and our use of it, to cope with the needs
of our larger customers. These "large" indexes are only a couple
hundred thousand documents but our problem is compounded by the fact
that they have a relatively high rate of modification (=delete
+insert of new document) and our users expect these modification to
show up in query results pretty much instantly.
Our current default behaviour is a merge factor of 4. We perform an
optimization on the index every 4000 additions. We also perform an
optimize at midnight. Our fundamental problem is that these
optimizations are locking the index for unacceptably long periods of
time, something that we want to resolve for our next major release,
hopefully without undermining search performance too badly.
In the Lucene javadoc there is a comment, and a link to a mailing
list discussion[2], that suggests applications such as JIRA should
never perform optimize but should instead set their merge factor
very low.
In an attempt to understand the impact of a) lowering the merge
factor from 4 to 2 and b) never, ever optimizing on an index (over
the course of years and millions of additions/updates) I wanted to
try to benchmark Lucene.
I used the contrib/benchmark framework and wrote a small algorithm
that adds documents to an index (using the Reuters doc generator),
does a search, does an optimize, then does another search. All the
pretty pictures can be seen at:
http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs
I have several questions, hopefully they aren't overwhelming in
their quantity :-/
1. Why does the merge factor of 4 appear to be faster than the merge
factor of 2?
2. Why does non-optimized searching appear to be faster than
optimized searching once the index hits ~500,000 documents?
3. There appears to be a fairly sizable performance drop across the
board around 450,000 documents. Why is that?
4. Searching performance appears to decrease towards a fairly
pessimistic 20 searches per second (for a relatively simple search).
Is this really what we should expect long-term from Lucene?
5. Does my benchmark even make sense? I am far from an expert on
benchmarking so it is possible I'm not measuring what I think I am
measuring.
Thanks in advance for any insight you can provide. This is an area
that we very much want to understand better as Lucene is a key part
of JIRA's success,
Cheers,
Justus
JIRA Developer
[1]: http://www.atlassian.com
[2]: http://www.gossamer-threads.com/lists/lucene/java-dev/47895
[3]: http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]