On Jan 15, 2006, at 3:34 PM, Robert Kirchgessner wrote:
There was even a patch to that problem:
http://issues.apache.org/jira/browse/LUCENE-211
This is a large and somewhat hard-to-read patch. Some stuff looks
familiar. Looks like he's concatenating fieldname along with
tokentext, which is sort-of the right idea, though you need to take
some precautions for field names of differing lengths I didn't
immediately detect. (KinoSearch uses field number (which corresponds
to lexically sorted field name at index-time), encoded as a big-
endian 16-bit int.)
The interesting thing to me is that it doesn't seem to feed an
external sorter. If I understand the concept correctly, he's feeding
a sortpool for minMergeDocuments documents; creating a small mini-
index (minMergeDocuments in size), then falling back to the primary
merge model. If that isn't what that patch does, well... that
concept would still work, and it would be nice not to need an
external sorter.
Yes, the binary format is fully compatible to that of Lucene, as
is the read/write/search logic.
So...
* You use Sun's "Modified UTF-8" (not true UTF-8) to
encode character data.
* The VInt counts at the head of strings represent Java
chars, not Unicode code points or bytes.
* You've run tests with source material containing
null bytes, Unicode characters outside the Basic
Multilingual Plane, and corrupt character data (e.g.,
broken UTF-8), and you are confident that indexes produced
by Lucene and PHPLucene from such data are mutually compatible.
By the way, though the project
emerged as a lucene implementation in PHP I soon switched
to writing a pure C-library with a binding to PHP. Now its
mostly a C-project.
KinoSearch has taken a similar path of late, adding more and more XS
(Perl's C API).
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]