Greets,
As mentioned in my previous post, the most significant architectural
difference between the Lucene/Plucene indexer and KinoSearch indexer
is the merge model. KinoSearch's merge model is considerably more
efficient in Perl; I suspect that it may also be incrementally more
efficient in Java, though it would take a fair amount of work to find
out.
When this new indexer based on KinoSearch indexes a document, it
writes stored fields and norms just as Lucene does. However, when
the document is inverted, instead of writing a mini-inverted-index,
the postings are serialized and dumped into a sort pool.
The serialization algo is designed so that the sorted postings emerge
from the sort pool in the order ideal for writing an index after a
simple lexical sort. The concatenated components are:
1) Field number* [unsigned big-endian 16-bit integer]
2) term
3) document number [unsigned big-endian 32-bit integer]
4) positions [array (C, not Perl) of 32-bit integers]
5) term length [unsigned big-endian 16-bit integer]
* It is possible to use the field number because of
KinoSearch's requirement that all fields be defined
in advance; fields are sorted lexically prior to the
assignment of field numbers, so sorting by name and
number produce identical results.
After the sort is executed, the strings are fed into a while loop,
where the components are pulled apart (except for field number and
term, which remain united for now).
freq (the number of times the term appears in the document) is
determined by counting the number of elements in the positions array.
doc_freq is derived by incrementing a count over subsequent loops
until the field-number-plus-term string changes. Since each
iteration represents one document, the doc_freq is the number of loop
iters that pass by.
We now have all the elements needed for writing the tii, tis, frq,
and prx files.
Of course, there is a major obstacle which must be overcome with this
approach: you can only dump the serialized postings for a small
number of documents into the sort pool before you run out of RAM.
The answer is to implement an external sorting algorithm. I wrote
one, and contributed it to CPAN: <http://search.cpan.org/search?
query=sort+external>
The KinoSearch merge model is much better for Perl because there's no
way to implement the Lucene merge model without zillions of objects
and method calls. The OO overhead for comparing serialized postings
is much lower, as the information encoded into the strings can be
compared lexically, so objects need not be rebuilt and sorted by
member variable.
In Java, OO overhead is less of a factor, but I suspect there are
still some gains to be had. There are other advantages: norms and
fields are written only once per segment (and segments are written
less often) for starters.
The crucial code resides at present in the write_postings() method
within the PostingsWriter module.
http://www.rectangular.com/cgi-bin/viewcvs.cgi/ksearch/lib/KinoSearch/
Index/PostingsWriter.pm?rev=1.9&content-type=text/vnd.viewcvs-markup
I don't know whether this merge model will ever be of use to Java
Lucene, but it was suggested that I bring it up in this forum after I
described it on the Plucene list. I imagine that reproducing
Sort::External in Java would be straightforward, if its equivalent
does not already exist.
Food for thought, in any case.
Cheers,
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]