Marvin Humphrey <[email protected]> wrote:
> In another thread, I described a proposed "Snapshot" class. The goal we're
> working towards is pluggable index reading/writing via Architecture and
> DataReader/DataWriter. The constructor for the prototype implementation of
> DataWriter I've worked up in KS takes three arguments: a Snapshot, a
> PolyReader (analogous to Lucene's MultiSegmentReader)... and a "Segment".
>
> The Segment class has three main responsibilities:
>
> * Keep track of how many documents are in the segment (not counting
> deletions).
> * Maintain per-segment field-name-to-field-number associations.
> * Write the "segmeta" file, which stores arbitrary metadata.
Ahh, there's the answer (to my "what about segment metadata"
questions). This is good!
Shouldn't segmeta itself have a format too?
Are you going to provide utility APIs that components can use to deal
with the format number? We don't in Lucene but it's been broached...
eg so a component can register the N formats it's able to deal with,
so a consistent error is thrown if a format is too old or too new,
etc.
> The Segment's doc count is used both at index time...
>
> void
> SegWriter_add_doc(SegWriter *self, Doc *doc)
> {
> i32_t doc_num = Seg_Increment_Doc_Count(self->segment, 1);
> Inverter_Invert_Doc(self->inverter, doc);
> SegWriter_Add_Inverted_Doc(self, self->inverter, doc_num);
> }
>
> ... and at search-time:
>
> i32_t
> SegReader_doc_max(SegReader *self)
> {
> return Seg_Get_Doc_Count(self->segment);
> }
>
> In Lucene, field-name-to-field-number mappings are the province of the
> FieldInfos class, which also tracks field characteristics such as "isStored".
> Lucy uses global field semantics, though, so there's no need for per-segment
> field specs.
>
> The "segmeta" file is used to store both metadata needed by Segment itself and
> metadata belonging to other index components:
>
> {
> "lexicon" : {
> "counts" : {
> "content" : "20576"
> },
> "format" : "2",
> "index_counts" : {
> "content" : "161"
> }
> },
> "postings" : {
> "format" : "1"
> },
> "records" : {
> "format" : "1"
> },
> "segmeta" : {
> "doc_count" : "11054",
> "field_names" : [
> "",
> "title",
> "category",
> "content",
> "url"
> ],
> "format" : "1"
> "name" : "seg_3"
> },
> "term_vectors" : {
> "format" : "1"
> }
> }
>
> Providing a place for plugin indexing components to store arbitrary metadata
> relieves them from the responsibility for writing and parsing metadata
> themselves. In Lucene, metadata classes such as FieldInfos have their own
> binary file formats and maintain their own parsing routines, bloating the
> Lucene file format documentation and adding maintenance overhead. While
> binary formats are necessary for bulk data, for small amounts of metadata they
> hinder bare-eye browsing and provide no significant performance advantage.
>
> In some sense Segment is similar to the Lucene class SegmentInfo. For
> example, both of them store format version data; however, Segment is only
> aware of its own format, and it is up to individual plugins to track their own
> format versions and adjust behavior as needed. SegmentInfo is tightly bound
> to other Lucene classes because it knows too much about them, hindering
> extensibility; Segment, while capable of storing much more data than SegInfo
> since it uses generic scalar-list-mapping data structures, knows nothing about
> any of the plugin components that access that data.
>
> Prototype code:
>
> http://tinyurl.com/proto-seg-bp
> http://tinyurl.com/proto-seg-c
>
> HTML presentation of public API documentation for Perl binding:
>
> http://tinyurl.com/seg-dev-docs
>
> Marvin Humphrey
>
>