Segment

Marvin Humphrey Sun, 22 Mar 2009 19:25:23 -0700

Greets,

In another thread, I described a proposed "Snapshot" class.  The goal we're
working towards is pluggable index reading/writing via Architecture and
DataReader/DataWriter.  The constructor for the prototype implementation of
DataWriter I've worked up in KS takes three arguments: a Snapshot, a
PolyReader (analogous to Lucene's MultiSegmentReader)... and a "Segment".


The Segment class has three main responsibilities:

  * Keep track of how many documents are in the segment (not counting
    deletions).
  * Maintain per-segment field-name-to-field-number associations.
  * Write the "segmeta" file, which stores arbitrary metadata.

The Segment's doc count is used both at index time...

    void
    SegWriter_add_doc(SegWriter *self, Doc *doc)
    {
        i32_t doc_num = Seg_Increment_Doc_Count(self->segment, 1);
        Inverter_Invert_Doc(self->inverter, doc);
        SegWriter_Add_Inverted_Doc(self, self->inverter, doc_num);
    }

... and at search-time:

    i32_t
    SegReader_doc_max(SegReader *self)
    {
        return Seg_Get_Doc_Count(self->segment);
    }

In Lucene, field-name-to-field-number mappings are the province of the
FieldInfos class, which also tracks field characteristics such as "isStored".
Lucy uses global field semantics, though, so there's no need for per-segment
field specs.

The "segmeta" file is used to store both metadata needed by Segment itself and
metadata belonging to other index components:

    {
       "lexicon" : { 
          "counts" : { 
             "content" : "20576"
          },  
          "format" : "2",
          "index_counts" : { 
             "content" : "161"
          }   
       },  
       "postings" : { 
          "format" : "1" 
       },  
       "records" : { 
          "format" : "1" 
       },  
       "segmeta" : { 
          "doc_count" : "11054",
          "field_names" : [ 
             "", 
             "title",
             "category",
             "content",
             "url"
          ],  
          "format" : "1"           
          "name" : "seg_3"
       },  
       "term_vectors" : { 
          "format" : "1" 
       }   
    }

Providing a place for plugin indexing components to store arbitrary metadata
relieves them from the responsibility for writing and parsing metadata
themselves.  In Lucene, metadata classes such as FieldInfos have their own
binary file formats and maintain their own parsing routines, bloating the
Lucene file format documentation and adding maintenance overhead.  While
binary formats are necessary for bulk data, for small amounts of metadata they
hinder bare-eye browsing and provide no significant performance advantage.

In some sense Segment is similar to the Lucene class SegmentInfo.  For
example, both of them store format version data; however, Segment is only
aware of its own format, and it is up to individual plugins to track their own
format versions and adjust behavior as needed.  SegmentInfo is tightly bound
to other Lucene classes because it knows too much about them, hindering
extensibility; Segment, while capable of storing much more data than SegInfo
since it uses generic scalar-list-mapping data structures, knows nothing about
any of the plugin components that access that data.

Prototype code:

  http://tinyurl.com/proto-seg-bp
  http://tinyurl.com/proto-seg-c

HTML presentation of public API documentation for Perl binding:

  http://tinyurl.com/seg-dev-docs

Marvin Humphrey

Segment

Reply via email to