On Apr 3, 2006, at 6:57 AM, Yonik Seeley wrote:
A couple of points:
- Are all the lucene variations using the same index parameters?
max buffered docs, index format (compound or not), mergeFactor, etc
I personally use non-compound index format, max buffered docs=1000,
mergeFactor=10
IndexWriter writer = new IndexWriter(indexDir,
new WhitespaceAnalyzer(), true);
+ writer.setMaxBufferedDocs(1000);
+ writer.setUseCompoundFile(false);
I'll set Lucene to use the non-compound format. KinoSearch only
supports the compound index format, but since it only writes one
segment per indexing session, each file only gets rewritten once and
that's not going to be much of a handicap. Plucene only uses the non-
compound format.
KinoSearch doesn't have max_buffered_docs or merge_factor settings,
since it uses a different merge model based on external sorting and
serialized postings. Currently, it keeps track of the amount of
memory consumed by the in-memory sort pool, and writes a run when
that number hits 20 MB. Version 0.09_02 uses its own external
sorting routine for the first time, so I can and probably should
adapt it use a max_buffered_docs variable, which it will need to poll
a lot less frequently. But that's an optimization for another day.
Plucene is a Lucene 1.3 port, so it doesn't have max_buffered_docs --
but I can set merge_factor to 1000.
- reading in the file line by line probably isn't the fastest (esp
when you just construct another big string out of it).
I'm addressing this issue in my reply to Doug.
- Java settings:
- use the 1.5 JVM if possible, it's much faster than 1.4 in my
experience
Interestingly, 1.5 produces slightly inferior results on my G4. (I
know about the command line alias snafu, BTW: <http://
www.cs.princeton.edu/introcs/11hello/mac.html>).
I'll include results from both 1.4 and 1.5. I'll also include
results for a vanilla compile of Perl 5.8.8, which is definitely
faster than the Perl 5.8.6 Apple ships with OS X Tiger.
- use "-server", it's much faster than "-client"
- use enough heap so too much time isn't taken in GC
Okeedoke.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]