[wild brainstorming...]

Another reason to consolidate the freqs, positions, and boosts/norms into one file: we can isolate and distill the code that encodes/ decodes that file into a plugin, weakening the current tight coupling between Lucene and its file format. Changing that index format might then be a little less painful, as we'd just write a new plugin but leave the old one sitting there. We may not be able to write plugin code for the an entire index, but we can write some for each file.

I'm imagining a PostingsWriter interface that each plugin would implement, then a complementary PostingsReader. PostingsReader would look a lot like TermPositions does now, but would add getBoost(). To this, a POSPostingsReader subclass might add getPartOfSpeech().

In addition to the postings file, we might want a stored fields file plugin. Maybe call those interfaces DBWriter and DBReader. This is trickier, because stored fields are not inverted, so if we used different codecs for each field, their output would have to be interleaved. Bleah. Seems more like we'd want to use a plugin for the entire file, with a limited selection of per-field options.

Each segment would have a file recording which codecs were in use. Each field name, once associated with a codec, could not be modified to use another. No more reconciliation of indexed/notIndexed, omitNorms/notOmitNorms.

Does it make sense then to have the Term Dictionary as a plugin? I think so. But maybe rather than ordering all terms first by field name then by term text, each indexed field should have its own dictionary file, ordered by term text. Then the dictionary file could have per-field customization as well.

The point of this exercise is to generalize the high level data structures required by an inverted indexing engine.

  * Term Dictionary
  * Postings
  * Stored Fields Database
  * Term Vectors (optional)

In my view, each of these should have its own pluggable codec.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to