KinoSearch merge model

Marvin Humphrey Tue, 27 Sep 2005 14:47:29 -0700

Greets,

As mentioned in my previous post, the most significant architecturaldifference between the Lucene/Plucene indexer and KinoSearch indexeris the merge model. KinoSearch's merge model is considerably moreefficient in Perl; I suspect that it may also be incrementally moreefficient in Java, though it would take a fair amount of work to findout.

When this new indexer based on KinoSearch indexes a document, itwrites stored fields and norms just as Lucene does. However, whenthe document is inverted, instead of writing a mini-inverted-index,the postings are serialized and dumped into a sort pool.

The serialization algo is designed so that the sorted postings emergefrom the sort pool in the order ideal for writing an index after asimple lexical sort. The concatenated components are:


    1) Field number* [unsigned big-endian 16-bit integer]
    2) term
    3) document number [unsigned big-endian 32-bit integer]
    4) positions [array (C, not Perl) of 32-bit integers]
    5) term length [unsigned big-endian 16-bit integer]

    * It is possible to use the field number because of
      KinoSearch's requirement that all fields be defined
      in advance; fields are sorted lexically prior to the
      assignment of field numbers, so sorting by name and
      number produce identical results.

After the sort is executed, the strings are fed into a while loop,where the components are pulled apart (except for field number andterm, which remain united for now).

freq (the number of times the term appears in the document) isdetermined by counting the number of elements in the positions array.

doc_freq is derived by incrementing a count over subsequent loopsuntil the field-number-plus-term string changes. Since eachiteration represents one document, the doc_freq is the number of loopiters that pass by.

We now have all the elements needed for writing the tii, tis, frq,and prx files.

Of course, there is a major obstacle which must be overcome with thisapproach: you can only dump the serialized postings for a smallnumber of documents into the sort pool before you run out of RAM.The answer is to implement an external sorting algorithm. I wroteone, and contributed it to CPAN: <http://search.cpan.org/search?query=sort+external>

The KinoSearch merge model is much better for Perl because there's noway to implement the Lucene merge model without zillions of objectsand method calls. The OO overhead for comparing serialized postingsis much lower, as the information encoded into the strings can becompared lexically, so objects need not be rebuilt and sorted bymember variable.

In Java, OO overhead is less of a factor, but I suspect there arestill some gains to be had. There are other advantages: norms andfields are written only once per segment (and segments are writtenless often) for starters.

The crucial code resides at present in the write_postings() methodwithin the PostingsWriter module.

http://www.rectangular.com/cgi-bin/viewcvs.cgi/ksearch/lib/KinoSearch/Index/PostingsWriter.pm?rev=1.9&content-type=text/vnd.viewcvs-markup

I don't know whether this merge model will ever be of use to JavaLucene, but it was suggested that I bring it up in this forum after Idescribed it on the Plucene list. I imagine that reproducingSort::External in Java would be straightforward, if its equivalentdoes not already exist.


Food for thought, in any case.

Cheers,

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

KinoSearch merge model

Reply via email to