Re: Pooling of posting objects in DocumentsWriter

Marvin Humphrey Thu, 10 Apr 2008 17:10:57 -0700


On Apr 10, 2008, at 2:37 AM, Michael McCandless wrote:

IMO, the abstract base Posting class should not track text. Itshouldinclude only one datum: a document number. This keeps it in linewith thesimplest IR definition for a "posting": one document matching oneterm.
But how do you then write out a segment with the terms packed, in
sorted order?  Your "generic" layer needs to know how to sort these
Posting lists by term text, right?

Yes. But the generic layer is the semi-serialized export formRawPosting, not the abstract base class Posting. (Would it be clearerif RawPosting was named "FlatPosting"?)


  struct kino_Posting {
      kino_VirtualTable* _;
      kino_ref_t ref;
      chy_u32_t doc_num;
  };

  struct kino_RawPosting {
      kino_VirtualTable* _;
      kino_ref_t ref;
      chy_u32_t doc_num;
      chy_u32_t freq;
      chy_u32_t content_len;
      chy_u32_t aux_len;
      char blob[1];    /* flexible array */
  };

The last member of RawPosting, "blob", uses the C "flexible array"hack which takes advantage of C's lack of bounds checking. ARawPosting consists of a single contiguous memory allocation, but thesize of that allocation varies depending on how much information theobject needs to carry.

The "blob" is divided into two parts: content, and "aux". Content isterm text. "aux" is encoded material which will be written to thepostings file. The total allocation for any given RawPosting can becalculated like so:


   offsetof(RawPosting, blob) + posting->content_len + posting->aux_len

Sorting is done by comparing first the "content" component of blobusing memcmp(), then breaking ties with content_len and finally,doc_num.

The write method invoked by PostingsWriter is the same --RawPost_Write_Record -- no matter what Posting subclass is assigned tothe field. The first part of the implementing functionRawPost_write_record may look familiar to you, as it uses a closevariant of Lucene's algo for encoding doc_num and freq together. Notethe last line, though:


    void
    RawPost_write_record(RawPosting *self, OutStream *outstream,
                         u32_t last_doc_num)
    {
        const u32_t delta_doc = self->doc_num - last_doc_num;
        char  *const aux_content = self->blob + self->content_len;

        /* Write doc num, freq. */
        if (self->freq == 1) {
            const u32_t doc_code = (delta_doc << 1) | 1;
            OutStream_Write_C32(outstream, doc_code);
        }
        else {
            const u32_t doc_code = delta_doc << 1;
            OutStream_Write_C32(outstream, doc_code);
            OutStream_Write_C32(outstream, self->freq);
        }

        /* Print prepared data. */
        OutStream_Write_Bytes(outstream, aux_content, self->aux_len);
    }

The idea is that information other than doc num and freq getsserialized during the export operation which creates the RawPosting --then the last line of RawPost_write_record prints the pre-serializeddata to the OutStream.

OK I now see that what we call Posting really should be called
PostingList: each instance of this class, in DW, tracks all documents
that contained that term.

I suppose that would be more accurate... though in KS, the"PostingList" class is the replacement for TermDocs, TermPositions, etc.

Whereas for KS, Posting is a single occurrence of term in a singledoc, right?


It's one term matching one document.

These are the definitions I'm using, as set out inKinoSearch::Docs::IRTheory at <http://xrl.us/bi8w2>:


    * document - An atomic unit of retrieval.
    * term - An attribute which describes a document.
    * posting - One term indexing one document.
    * term list - The complete list of terms which describe a document.

* posting list - The complete list of documents which a termindexes.* inverted index - A data structure which maps from terms todocuments.

Does a Posting contain all
occurrences of the term in the doc (multiple positions) or only one?

In KS, a Posting object can represent multiple occurrences of a termwithin a single document. That was somewhat arbitrary on my part.

How do you do buffering/flushing?

RawPostings are fed into an object which handles external sorting.The external sorter tracks RAM usage, and checks once per documentwhether it should flush a sorted run to temporary storage.

After the last document gets added, the external sorter flips tofetching mode. The pre-sorted runs are merged into a sorted stream ofRawPostings, and the postings and lexicon files get written out.

Then, for search-time you have a PostingList class which takes theplace ofTermDocs/TermPositions, and uses an underlying Posting object toread the
file.  (PostingList and its subclasses don't know anything about file
formats.)
Wouldn't PostingList need to know something of the file format?  EG
maybe it's a sparse format (docID or gap encoded each time), or, it's
non-sparse (like norms, column-stride fields).

Each subclass of Posting implements its own Read_Record() method.PostingList just hands off the InStream to its internal Posting object-- it doesn't know or care how much data from the InStream the Postingconsumes with each Read_Record().

As I have argued before, the key is to have each Posting subclasswhollydefine a file format. That makes them pluggable, breaking thetight binding
between the Lucene codebase and the Lucene file format spec.
It's not just Posting that defines the file format.


I could have been clearer -- I meant a postings file format.

I think TermVectorsWriter should be seen as a consumer of the
"inversion" plugin API.  It's just that, unlike the frq/prx writer,
which flushes when RAM is full, the TermVectorsWriter flushes after
each doc.  Ie, the generic code does the inversion, feeding "you"
Posting occurrences, and "you" write this to a file however you want.


Yes, that's how it works in KS, too.  :)

I haven't written a formal Inversion class yet, though. Uponreflection, perhaps it would be better to have "Inversion" be specificto a single field, and "InvertedDocument" collect everything together?


   public void addDocument(Document document) {
      InvertedDocument inverteDoc = invert(document)
      for (i = 0; i < writers.length; i++) {
         writers[i].addInvertedDocument(invertedDoc);
      }
   }

I think that's better in KS, because I can rename "TokenBatch" -- theclass that actually does the low-level inverting and that subclassesof Analyzer need to deal with -- to "Inversion".


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Pooling of posting objects in DocumentsWriter

Reply via email to