On Nov 1, 2007, at 3:04 AM, Michael McCandless wrote:
In KinoSearch, merging of stored fields & term vectors is always a fast concatenation of the entry for that document, whereas Lucene must re-interpret/re-number all fields on the doc, in general. In fact I think that KinoSearch stores field names directly in the index (ie, not numbers).
Yes, that's right. <http://xrl.us/73dx> (Link to mail- archives.apache.org)
Ferret and KS had both previously implemented Robert's suggested mod, where no remaps take place if field numbers can be matched up. KS also expended extra effort to keep field numbers consistent (and I think Ferret did too) -- but the possibility that we would have to remap couldn't ever be eliminated.
Going with field names rather than numbers allowed KS to eliminate a big chunk of code. For the price of a small increase in index size, the segment merging process for stored fields and term vectors got much simpler. No more parsing, no more remapping -- it became possible to read the record naively as one chunk and copy it, no matter what.
If Lucene were to go this route, my suggestion would be to start a new subclass of FieldsWriter that uses different index extensions. (KS uses .ds and .dsx: "Document Storage".) Individual SegmentReaders can then decide which subclass to use based on which files are detected.
Marvin Humphrey Rectangular Research http://www.rectangular.com/ --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
