On Apr 3, 2006, at 10:36 AM, Doug Cutting wrote:
Marvin Humphrey wrote:
IndexWriter writer = new IndexWriter(indexDir,
new WhitespaceAnalyzer(), true);
Please make sure that analyzers are comparable between the various
engines you benchmark. WhitespaceAnalyzer is efficient, but
results in far more tokens and terms than, e.g., StopAnalyzer
(alphabetic character sequences, lowercased, with a 35-word English
stop list).
They're all using WhitespaceAnalyzer or the equivalent. KinoSearch
doesn't offer that class per se, but its Tokenizer class allows you
to specify an arbitrary regex matching one token.
# a WhitespaceAnalyzer in KinoSearch
my $tokenizer = KinoSearch::Analysis::Tokenizer->new(
token_re => qr/\S+/,
);
Since tokens and terms are the atoms and elements of indexing,
their counts are a dominant factor in performance.
Increasing IndexWriter.setMaxBufferedDocs(100) or more will
increase indexing speed by using more Java heap. 10 is the
default. IndexWriter.setUseCompoundFile(false) will also increase
indexing speed. I don't think increasing IndexWriter.setMergeFactor
() should help much. and advise staying with the default (10).
Folks used to set this as a surrogate for setMaxBufferedDocs before
that was a separate paramter.
I'm addressing these issues in my reply to Yonik.
You may need to specify a larger Java heap, with something like -
Xmx500M. The default is around 64MB.
Great, I'll use -Xmx500M.
Also, the -server option is almost always faster with Sun's JVM.
Sun's 1.5 JVM is faster than their 1.4 JVM. I think IBMs JVM may
be generally faster for indexing. The last I checked, one was
fasteer for indexing and the other for searching, but I'm not
certain which was which.
I'm running these on my G4 laptop.
private Document nextDoc(File f) throws Exception {
// the title is the first line, the body is the rest
BufferedReader br = new BufferedReader(new FileReader(f));
String title;
if ( (title = br.readLine()) == null)
throw new Exception("Failed to read title");
StringBuffer buf = new StringBuffer();
String str;
while ( (str = br.readLine()) != null )
buf.append( str );
br.close();
String body = buf.toString();
// add title and body to doc
Document doc = new Document();
Field titleField = new Field("title", title,
Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.NO);
Field bodyField = new Field("body", body,
Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.NO);
doc.add(titleField);
doc.add(bodyField);
You can avoid some buffering by passing a Reader for the body text:
Field bodyField = new Field("body", br,
Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.NO);
The only rub is that you'll have to make sure that the FileReader
is closed. So you could rewrite this method to be something like:
private void indexFile(File f, IndexWriter writer) {
BufferedReader br = new BufferedReader(new FileReader(f));
try {
Document doc = new Document();
... read title from br and add it to doc ...
Field bodyField = new Field("body", br,
Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.NO);
doc.add(bodyField);
writer.addDocument(doc);
} finally {
br.close();
}
}
Does that make sense?
It does. However, there's no constructor for Field which allows
Field.Store.YES, but uses a Reader instead of a String.
Finally, I question your use of Field.Store.YES. Do you really to
use Lucene to store the full content of your documents?
Yes. By default, all fields in KinoSearch are analyzed, stored, and
vectorized (with positions and offsets). This allows use of the
Highlighter with minimum fuss. Savvy users looking to shrink the
size of their indexes can override those defaults.
I'd originally omitted TermVectors from the benchmarking apps because
Plucene doesn't have them. But having KinoSearch and Lucene generate
them isn't going to slow them down enough that Plucene will become
competitive. It makes sense to generate two result sets, one with
the body stored and vectored, and one with the body neither stored
nor vectored. I'll use the Reader constructor for Lucene's unstored
version.
Are you asking this of the other engines?
They were all even before. Now Plucene will have a slight advantage
in one config.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]