Greets,

As mentioned in my previous post, the most significant architectural difference between the Lucene/Plucene indexer and KinoSearch indexer is the merge model. KinoSearch's merge model is considerably more efficient in Perl; I suspect that it may also be incrementally more efficient in Java, though it would take a fair amount of work to find out.

When this new indexer based on KinoSearch indexes a document, it writes stored fields and norms just as Lucene does. However, when the document is inverted, instead of writing a mini-inverted-index, the postings are serialized and dumped into a sort pool.

The serialization algo is designed so that the sorted postings emerge from the sort pool in the order ideal for writing an index after a simple lexical sort. The concatenated components are:

    1) Field number* [unsigned big-endian 16-bit integer]
    2) term
    3) document number [unsigned big-endian 32-bit integer]
    4) positions [array (C, not Perl) of 32-bit integers]
    5) term length [unsigned big-endian 16-bit integer]

    * It is possible to use the field number because of
      KinoSearch's requirement that all fields be defined
      in advance; fields are sorted lexically prior to the
      assignment of field numbers, so sorting by name and
      number produce identical results.

After the sort is executed, the strings are fed into a while loop, where the components are pulled apart (except for field number and term, which remain united for now).

freq (the number of times the term appears in the document) is determined by counting the number of elements in the positions array.

doc_freq is derived by incrementing a count over subsequent loops until the field-number-plus-term string changes. Since each iteration represents one document, the doc_freq is the number of loop iters that pass by.

We now have all the elements needed for writing the tii, tis, frq, and prx files.

Of course, there is a major obstacle which must be overcome with this approach: you can only dump the serialized postings for a small number of documents into the sort pool before you run out of RAM. The answer is to implement an external sorting algorithm. I wrote one, and contributed it to CPAN: <http://search.cpan.org/search? query=sort+external>

The KinoSearch merge model is much better for Perl because there's no way to implement the Lucene merge model without zillions of objects and method calls. The OO overhead for comparing serialized postings is much lower, as the information encoded into the strings can be compared lexically, so objects need not be rebuilt and sorted by member variable.

In Java, OO overhead is less of a factor, but I suspect there are still some gains to be had. There are other advantages: norms and fields are written only once per segment (and segments are written less often) for starters.

The crucial code resides at present in the write_postings() method within the PostingsWriter module.

http://www.rectangular.com/cgi-bin/viewcvs.cgi/ksearch/lib/KinoSearch/ Index/PostingsWriter.pm?rev=1.9&content-type=text/vnd.viewcvs-markup

I don't know whether this merge model will ever be of use to Java Lucene, but it was suggested that I bring it up in this forum after I described it on the Plucene list. I imagine that reproducing Sort::External in Java would be straightforward, if its equivalent does not already exist.

Food for thought, in any case.

Cheers,

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to