Re: possible segment merge improvement?

Marvin Humphrey Thu, 01 Nov 2007 08:58:34 -0800


On Nov 1, 2007, at 3:04 AM, Michael McCandless wrote:

In KinoSearch, merging of stored fields & term vectors is always a
fast concatenation of the entry for that document, whereas Lucene must
re-interpret/re-number all fields on the doc, in general.  In fact I
think that KinoSearch stores field names directly in the index (ie,
not numbers).

Yes, that's right. <http://xrl.us/73dx> (Link to mail-archives.apache.org)

Ferret and KS had both previously implemented Robert's suggested mod,where no remaps take place if field numbers can be matched up. KSalso expended extra effort to keep field numbers consistent (and Ithink Ferret did too) -- but the possibility that we would have toremap couldn't ever be eliminated.

Going with field names rather than numbers allowed KS to eliminate abig chunk of code. For the price of a small increase in index size,the segment merging process for stored fields and term vectors gotmuch simpler. No more parsing, no more remapping -- it becamepossible to read the record naively as one chunk and copy it, nomatter what.

If Lucene were to go this route, my suggestion would be to start anew subclass of FieldsWriter that uses different index extensions.(KS uses .ds and .dsx: "Document Storage".) IndividualSegmentReaders can then decide which subclass to use based on whichfiles are detected.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: possible segment merge improvement?

Reply via email to