On Apr 10, 2008, at 2:37 AM, Michael McCandless wrote:
IMO, the abstract base Posting class should not track text. It should include only one datum: a document number. This keeps it in line with the simplest IR definition for a "posting": one document matching one term.

But how do you then write out a segment with the terms packed, in
sorted order?  Your "generic" layer needs to know how to sort these
Posting lists by term text, right?

Yes. But the generic layer is the semi-serialized export form RawPosting, not the abstract base class Posting. (Would it be clearer if RawPosting was named "FlatPosting"?)

  struct kino_Posting {
      kino_VirtualTable* _;
      kino_ref_t ref;
      chy_u32_t doc_num;
  };

  struct kino_RawPosting {
      kino_VirtualTable* _;
      kino_ref_t ref;
      chy_u32_t doc_num;
      chy_u32_t freq;
      chy_u32_t content_len;
      chy_u32_t aux_len;
      char blob[1];    /* flexible array */
  };

The last member of RawPosting, "blob", uses the C "flexible array" hack which takes advantage of C's lack of bounds checking. A RawPosting consists of a single contiguous memory allocation, but the size of that allocation varies depending on how much information the object needs to carry.

The "blob" is divided into two parts: content, and "aux". Content is term text. "aux" is encoded material which will be written to the postings file. The total allocation for any given RawPosting can be calculated like so:

   offsetof(RawPosting, blob) + posting->content_len + posting->aux_len

Sorting is done by comparing first the "content" component of blob using memcmp(), then breaking ties with content_len and finally, doc_num.

The write method invoked by PostingsWriter is the same -- RawPost_Write_Record -- no matter what Posting subclass is assigned to the field. The first part of the implementing function RawPost_write_record may look familiar to you, as it uses a close variant of Lucene's algo for encoding doc_num and freq together. Note the last line, though:

    void
    RawPost_write_record(RawPosting *self, OutStream *outstream,
                         u32_t last_doc_num)
    {
        const u32_t delta_doc = self->doc_num - last_doc_num;
        char  *const aux_content = self->blob + self->content_len;

        /* Write doc num, freq. */
        if (self->freq == 1) {
            const u32_t doc_code = (delta_doc << 1) | 1;
            OutStream_Write_C32(outstream, doc_code);
        }
        else {
            const u32_t doc_code = delta_doc << 1;
            OutStream_Write_C32(outstream, doc_code);
            OutStream_Write_C32(outstream, self->freq);
        }

        /* Print prepared data. */
        OutStream_Write_Bytes(outstream, aux_content, self->aux_len);
    }

The idea is that information other than doc num and freq gets serialized during the export operation which creates the RawPosting -- then the last line of RawPost_write_record prints the pre-serialized data to the OutStream.

OK I now see that what we call Posting really should be called
PostingList: each instance of this class, in DW, tracks all documents
that contained that term.

I suppose that would be more accurate... though in KS, the "PostingList" class is the replacement for TermDocs, TermPositions, etc.

Whereas for KS, Posting is a single occurrence of term in a single doc, right?

It's one term matching one document.

These are the definitions I'm using, as set out in KinoSearch::Docs::IRTheory at <http://xrl.us/bi8w2>:

    * document - An atomic unit of retrieval.
    * term - An attribute which describes a document.
    * posting - One term indexing one document.
    * term list - The complete list of terms which describe a document.
* posting list - The complete list of documents which a term indexes. * inverted index - A data structure which maps from terms to documents.

Does a Posting contain all
occurrences of the term in the doc (multiple positions) or only one?

In KS, a Posting object can represent multiple occurrences of a term within a single document. That was somewhat arbitrary on my part.

How do you do buffering/flushing?

RawPostings are fed into an object which handles external sorting. The external sorter tracks RAM usage, and checks once per document whether it should flush a sorted run to temporary storage.

After the last document gets added, the external sorter flips to fetching mode. The pre-sorted runs are merged into a sorted stream of RawPostings, and the postings and lexicon files get written out.

Then, for search-time you have a PostingList class which takes the place of TermDocs/TermPositions, and uses an underlying Posting object to read the
file.  (PostingList and its subclasses don't know anything about file
formats.)

Wouldn't PostingList need to know something of the file format?  EG
maybe it's a sparse format (docID or gap encoded each time), or, it's
non-sparse (like norms, column-stride fields).

Each subclass of Posting implements its own Read_Record() method. PostingList just hands off the InStream to its internal Posting object -- it doesn't know or care how much data from the InStream the Posting consumes with each Read_Record().

As I have argued before, the key is to have each Posting subclass wholly define a file format. That makes them pluggable, breaking the tight binding
between the Lucene codebase and the Lucene file format spec.

It's not just Posting that defines the file format.

I could have been clearer -- I meant a postings file format.

I think TermVectorsWriter should be seen as a consumer of the
"inversion" plugin API.  It's just that, unlike the frq/prx writer,
which flushes when RAM is full, the TermVectorsWriter flushes after
each doc.  Ie, the generic code does the inversion, feeding "you"
Posting occurrences, and "you" write this to a file however you want.

Yes, that's how it works in KS, too.  :)

I haven't written a formal Inversion class yet, though. Upon reflection, perhaps it would be better to have "Inversion" be specific to a single field, and "InvertedDocument" collect everything together?

   public void addDocument(Document document) {
      InvertedDocument inverteDoc = invert(document)
      for (i = 0; i < writers.length; i++) {
         writers[i].addInvertedDocument(invertedDoc);
      }
   }

I think that's better in KS, because I can rename "TokenBatch" -- the class that actually does the low-level inverting and that subclasses of Analyzer need to deal with -- to "Inversion".

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to