On Apr 3, 2006, at 6:57 AM, Yonik Seeley wrote:

A couple of points:
 - Are all the lucene variations using the same index parameters?
   max buffered docs, index format (compound or not), mergeFactor, etc
   I personally use non-compound index format, max buffered docs=1000,
   mergeFactor=10

     IndexWriter writer = new IndexWriter(indexDir,
       new WhitespaceAnalyzer(), true);
+    writer.setMaxBufferedDocs(1000);
+    writer.setUseCompoundFile(false);

I'll set Lucene to use the non-compound format. KinoSearch only supports the compound index format, but since it only writes one segment per indexing session, each file only gets rewritten once and that's not going to be much of a handicap. Plucene only uses the non- compound format.

KinoSearch doesn't have max_buffered_docs or merge_factor settings, since it uses a different merge model based on external sorting and serialized postings. Currently, it keeps track of the amount of memory consumed by the in-memory sort pool, and writes a run when that number hits 20 MB. Version 0.09_02 uses its own external sorting routine for the first time, so I can and probably should adapt it use a max_buffered_docs variable, which it will need to poll a lot less frequently. But that's an optimization for another day.

Plucene is a Lucene 1.3 port, so it doesn't have max_buffered_docs -- but I can set merge_factor to 1000.

 - reading in the file line by line probably isn't the fastest (esp
when you just construct another big string out of it).

I'm addressing this issue in my reply to Doug.

 - Java settings:
- use the 1.5 JVM if possible, it's much faster than 1.4 in my experience

Interestingly, 1.5 produces slightly inferior results on my G4. (I know about the command line alias snafu, BTW: <http:// www.cs.princeton.edu/introcs/11hello/mac.html>).

I'll include results from both 1.4 and 1.5. I'll also include results for a vanilla compile of Perl 5.8.8, which is definitely faster than the Perl 5.8.6 Apple ships with OS X Tiger.

   - use "-server", it's much faster than "-client"
   - use enough heap so too much time isn't taken in GC

Okeedoke.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to