Re: Segment

Michael McCandless Tue, 24 Mar 2009 05:11:42 -0700

Marvin Humphrey <[email protected]> wrote:

> In another thread, I described a proposed "Snapshot" class.  The goal we're
> working towards is pluggable index reading/writing via Architecture and
> DataReader/DataWriter.  The constructor for the prototype implementation of
> DataWriter I've worked up in KS takes three arguments: a Snapshot, a
> PolyReader (analogous to Lucene's MultiSegmentReader)... and a "Segment".
>
> The Segment class has three main responsibilities:
>
>  * Keep track of how many documents are in the segment (not counting
>    deletions).
>  * Maintain per-segment field-name-to-field-number associations.
>  * Write the "segmeta" file, which stores arbitrary metadata.


Ahh, there's the answer (to my "what about segment metadata"
questions).  This is good!

Shouldn't segmeta itself have a format too?

Are you going to provide utility APIs that components can use to deal
with the format number?  We don't in Lucene but it's been broached...
eg so a component can register the N formats it's able to deal with,
so a consistent error is thrown if a format is too old or too new,
etc.

> The Segment's doc count is used both at index time...
>
>    void
>    SegWriter_add_doc(SegWriter *self, Doc *doc)
>    {
>        i32_t doc_num = Seg_Increment_Doc_Count(self->segment, 1);
>        Inverter_Invert_Doc(self->inverter, doc);
>        SegWriter_Add_Inverted_Doc(self, self->inverter, doc_num);
>    }
>
> ... and at search-time:
>
>    i32_t
>    SegReader_doc_max(SegReader *self)
>    {
>        return Seg_Get_Doc_Count(self->segment);
>    }
>
> In Lucene, field-name-to-field-number mappings are the province of the
> FieldInfos class, which also tracks field characteristics such as "isStored".
> Lucy uses global field semantics, though, so there's no need for per-segment
> field specs.
>
> The "segmeta" file is used to store both metadata needed by Segment itself and
> metadata belonging to other index components:
>
>    {
>       "lexicon" : {
>          "counts" : {
>             "content" : "20576"
>          },
>          "format" : "2",
>          "index_counts" : {
>             "content" : "161"
>          }
>       },
>       "postings" : {
>          "format" : "1"
>       },
>       "records" : {
>          "format" : "1"
>       },
>       "segmeta" : {
>          "doc_count" : "11054",
>          "field_names" : [
>             "",
>             "title",
>             "category",
>             "content",
>             "url"
>          ],
>          "format" : "1"
>          "name" : "seg_3"
>       },
>       "term_vectors" : {
>          "format" : "1"
>       }
>    }
>
> Providing a place for plugin indexing components to store arbitrary metadata
> relieves them from the responsibility for writing and parsing metadata
> themselves.  In Lucene, metadata classes such as FieldInfos have their own
> binary file formats and maintain their own parsing routines, bloating the
> Lucene file format documentation and adding maintenance overhead.  While
> binary formats are necessary for bulk data, for small amounts of metadata they
> hinder bare-eye browsing and provide no significant performance advantage.
>
> In some sense Segment is similar to the Lucene class SegmentInfo.  For
> example, both of them store format version data; however, Segment is only
> aware of its own format, and it is up to individual plugins to track their own
> format versions and adjust behavior as needed.  SegmentInfo is tightly bound
> to other Lucene classes because it knows too much about them, hindering
> extensibility; Segment, while capable of storing much more data than SegInfo
> since it uses generic scalar-list-mapping data structures, knows nothing about
> any of the plugin components that access that data.
>
> Prototype code:
>
>  http://tinyurl.com/proto-seg-bp
>  http://tinyurl.com/proto-seg-c
>
> HTML presentation of public API documentation for Perl binding:
>
>  http://tinyurl.com/seg-dev-docs
>
> Marvin Humphrey
>
>

Re: Segment

Reply via email to