Re: Benchmarkers

Marvin Humphrey Mon, 03 Apr 2006 17:02:37 -0700


On Apr 3, 2006, at 10:36 AM, Doug Cutting wrote:

Marvin Humphrey wrote:
    IndexWriter writer = new IndexWriter(indexDir,
      new WhitespaceAnalyzer(), true);
Please make sure that analyzers are comparable between the variousengines you benchmark. WhitespaceAnalyzer is efficient, butresults in far more tokens and terms than, e.g., StopAnalyzer(alphabetic character sequences, lowercased, with a 35-word Englishstop list).

They're all using WhitespaceAnalyzer or the equivalent. KinoSearchdoesn't offer that class per se, but its Tokenizer class allows youto specify an arbitrary regex matching one token.


    # a WhitespaceAnalyzer in KinoSearch
    my $tokenizer = KinoSearch::Analysis::Tokenizer->new(
        token_re => qr/\S+/,
    );

Since tokens and terms are the atoms and elements of indexing,their counts are a dominant factor in performance.
Increasing IndexWriter.setMaxBufferedDocs(100) or more willincrease indexing speed by using more Java heap. 10 is thedefault. IndexWriter.setUseCompoundFile(false) will also increaseindexing speed. I don't think increasing IndexWriter.setMergeFactor() should help much. and advise staying with the default (10).Folks used to set this as a surrogate for setMaxBufferedDocs beforethat was a separate paramter.


I'm addressing these issues in my reply to Yonik.

You may need to specify a larger Java heap, with something like -Xmx500M. The default is around 64MB.


Great, I'll use -Xmx500M.

Also, the -server option is almost always faster with Sun's JVM.Sun's 1.5 JVM is faster than their 1.4 JVM. I think IBMs JVM maybe generally faster for indexing. The last I checked, one wasfasteer for indexing and the other for searching, but I'm notcertain which was which.


I'm running these on my G4 laptop.

  private Document nextDoc(File f) throws Exception {
    // the title is the first line, the body is the rest
    BufferedReader br = new BufferedReader(new FileReader(f));
    String title;
    if ( (title = br.readLine()) == null)
      throw new Exception("Failed to read title");
    StringBuffer buf = new StringBuffer();
    String str;
    while ( (str = br.readLine()) != null )
      buf.append( str );
    br.close();
    String body = buf.toString();
    // add title and body to doc
    Document doc = new Document();
    Field titleField = new Field("title", title,
      Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.NO);
    Field bodyField = new Field("body", body,
      Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.NO);
    doc.add(titleField);
    doc.add(bodyField);


You can avoid some buffering by passing a Reader for the body text:

  Field bodyField = new Field("body", br,
    Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.NO);

The only rub is that you'll have to make sure that the FileReaderis closed. So you could rewrite this method to be something like:


  private void indexFile(File f, IndexWriter writer) {
    BufferedReader br = new BufferedReader(new FileReader(f));
    try {
      Document doc = new Document();

      ... read title from br and add it to doc ...

      Field bodyField = new Field("body", br,
        Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.NO);
      doc.add(bodyField);

      writer.addDocument(doc);
    } finally {
      br.close();
    }
  }

Does that make sense?

It does. However, there's no constructor for Field which allowsField.Store.YES, but uses a Reader instead of a String.

Finally, I question your use of Field.Store.YES. Do you really touse Lucene to store the full content of your documents?

Yes. By default, all fields in KinoSearch are analyzed, stored, andvectorized (with positions and offsets). This allows use of theHighlighter with minimum fuss. Savvy users looking to shrink thesize of their indexes can override those defaults.

I'd originally omitted TermVectors from the benchmarking apps becausePlucene doesn't have them. But having KinoSearch and Lucene generatethem isn't going to slow them down enough that Plucene will becomecompetitive. It makes sense to generate two result sets, one withthe body stored and vectored, and one with the body neither storednor vectored. I'll use the Reader constructor for Lucene's unstoredversion.

  Are you asking this of the other engines?

They were all even before. Now Plucene will have a slight advantagein one config.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Benchmarkers

Reply via email to