Re: Pooling of posting objects in DocumentsWriter

Michael McCandless Sat, 12 Apr 2008 02:30:28 -0700

Marvin Humphrey <[EMAIL PROTECTED]> wrote:
>
>  On Apr 10, 2008, at 2:37 AM, Michael McCandless wrote:
>
> >
> > > IMO, the abstract base Posting class should not track text.  It should
> > > include only one datum: a document number.  This keeps it in line with the
> > > simplest IR definition for a "posting": one document matching one term.
> > >
> >
> > But how do you then write out a segment with the terms packed, in
> > sorted order?  Your "generic" layer needs to know how to sort these
> > Posting lists by term text, right?
> >
>
>  Yes.  But the generic layer is the semi-serialized export form RawPosting,
> not the abstract base class Posting.  (Would it be clearer if RawPosting was
> named "FlatPosting"?)
>
>   struct kino_Posting {
>       kino_VirtualTable* _;
>       kino_ref_t ref;
>       chy_u32_t doc_num;
>   };
>
>   struct kino_RawPosting {
>       kino_VirtualTable* _;
>       kino_ref_t ref;
>       chy_u32_t doc_num;
>       chy_u32_t freq;
>       chy_u32_t content_len;
>       chy_u32_t aux_len;
>       char blob[1];    /* flexible array */
>   };
>
> The last member of RawPosting, "blob", uses the C "flexible array" hack
> which takes advantage of C's lack of bounds checking.  A RawPosting consists
> of a single contiguous memory allocation, but the size of that allocation
> varies depending on how much information the object needs to carry.
>
> The "blob" is divided into two parts: content, and "aux".  Content is term
> text.  "aux" is encoded material which will be written to the postings file.
> The total allocation for any given RawPosting can be calculated like so:
>
>    offsetof(RawPosting, blob) + posting->content_len + posting->aux_len


Probably you also need to add in bytes lost to alignment requirements,
when content_len + aux_len doesn't "align" to a 4/8 byte boundary?

> Sorting is done by comparing first the "content" component of blob using
> memcmp(), then breaking ties with content_len and finally, doc_num.
>
> The write method invoked by PostingsWriter is the same --
> RawPost_Write_Record -- no matter what Posting subclass is assigned to the
> field.  The first part of the implementing function RawPost_write_record may
> look familiar to you, as it uses a close variant of Lucene's algo for
> encoding doc_num and freq together.  Note the last line, though:
>
>     void
>     RawPost_write_record(RawPosting *self, OutStream *outstream,
>                          u32_t last_doc_num)
>     {
>         const u32_t delta_doc = self->doc_num - last_doc_num;
>         char  *const aux_content = self->blob + self->content_len;
>
>         /* Write doc num, freq. */
>         if (self->freq == 1) {
>             const u32_t doc_code = (delta_doc << 1) | 1;
>             OutStream_Write_C32(outstream, doc_code);
>         }
>         else {
>             const u32_t doc_code = delta_doc << 1;
>             OutStream_Write_C32(outstream, doc_code);
>             OutStream_Write_C32(outstream, self->freq);
>         }
>
>         /* Print prepared data. */
>         OutStream_Write_Bytes(outstream, aux_content, self->aux_len);
>     }
>
> The idea is that information other than doc num and freq gets serialized
> during the export operation which creates the RawPosting -- then the last
> line of RawPost_write_record prints the pre-serialized data to the
> OutStream.

OK I think I understand the distinction.  It sounds like external
sorter deals with RawPosting and need not know any details of the aux
bytes being carried along.

But: why distinguish between Posting and RawPosting?  Who uses Posting
but not RawPosting?

Or, maybe it's because you do two passes when indexing a document?
First "invert" it, into array of Posting instances, then 2nd pass
across these Posting instances to coalesce multiple term occurrences
in one doc into RawPosting instances?

> > OK I now see that what we call Posting really should be called
> > PostingList: each instance of this class, in DW, tracks all documents
> > that contained that term.
> >
>
> I suppose that would be more accurate... though in KS, the "PostingList"
> class is the replacement for TermDocs, TermPositions, etc.

I see.  Though that is really a search-time vs indexing-time
distinction, referencing the same thing (a "posting list").  Maybe
RawPostingList, to make it clearer that these are the raw bytes for
many docs produced during indexing?

> > Whereas for KS, Posting is a single occurrence of term in a single doc, 
> > right?
> >
>
> It's one term matching one document.
>
> These are the definitions I'm using, as set out in
> KinoSearch::Docs::IRTheory at <http://xrl.us/bi8w2>:
>
>     * document - An atomic unit of retrieval.
>     * term - An attribute which describes a document.
>     * posting - One term indexing one document.
>     * term list - The complete list of terms which describe a document.
>     * posting list - The complete list of documents which a term indexes.
>     * inverted index - A data structure which maps from terms to documents.
>
>
>
> > Does a Posting contain all
> > occurrences of the term in the doc (multiple positions) or only one?
> >
>
> In KS, a Posting object can represent multiple occurrences of a term within
> a single document.  That was somewhat arbitrary on my part.

OK.  In Lucene, a Posting instance holds many documents.  And we write
that in a single pass through the terms in the document.

> > How do you do buffering/flushing?
> >
>
> RawPostings are fed into an object which handles external sorting.  The
> external sorter tracks RAM usage, and checks once per document whether it
> should flush a sorted run to temporary storage.

Got it.  In Lucene we just flush a true segment when RAM is full.  For
early versions of LUCENE-843 I was flushing "intermediate" segments
(like KS's runs), but the cost of merging these runs wasn't that much
better than the cost of merging normal segments, and, it added
unexpected cost to close(), so I fell back to just flushing normal
segments.

> After the last document gets added, the external sorter flips to fetching
> mode.  The pre-sorted runs are merged into a sorted stream of RawPostings,
> and the postings and lexicon files get written out.

OK.

How do multiple threads interact here?  Does each one work with its
own ExternalSorter, and then all threads and all of their runs do
merge sort at the end?  Or do multiple threads feed a single external
sorter?  How finely do you require thread synchronization?

> > > Then, for search-time you have a PostingList class which takes the place 
> > > of
> > > TermDocs/TermPositions, and uses an underlying Posting object to read the
> > > file.  (PostingList and its subclasses don't know anything about file
> > > formats.)
> >
> > Wouldn't PostingList need to know something of the file format?  EG
> > maybe it's a sparse format (docID or gap encoded each time), or, it's
> > non-sparse (like norms, column-stride fields).
>
> Each subclass of Posting implements its own Read_Record() method.
> PostingList just hands off the InStream to its internal Posting object -- it
> doesn't know or care how much data from the InStream the Posting consumes
> with each Read_Record().

OK, it sounds like this means it's always one file per Posting
implementation?  EG, even if you wanted to interleave skip data with
frq data, you couldn't easily do that?  (I'm not sure this really
matters).

> > I think TermVectorsWriter should be seen as a consumer of the
> > "inversion" plugin API.  It's just that, unlike the frq/prx writer,
> > which flushes when RAM is full, the TermVectorsWriter flushes after
> > each doc.  Ie, the generic code does the inversion, feeding "you"
> > Posting occurrences, and "you" write this to a file however you want.
> >
>
> Yes, that's how it works in KS, too.  :)
>
> I haven't written a formal Inversion class yet, though.  Upon reflection,
> perhaps it would be better to have "Inversion" be specific to a single
> field, and "InvertedDocument" collect everything together?
>
>    public void addDocument(Document document) {
>       InvertedDocument inverteDoc = invert(document)
>       for (i = 0; i < writers.length; i++) {
>          writers[i].addInvertedDocument(invertedDoc);
>       }
>    }
>
> I think that's better in KS, because I can rename "TokenBatch" -- the class
> that actually does the low-level inverting and that subclasses of Analyzer
> need to deal with -- to "Inversion".

I'd prefer in Lucene to do "inversion & writing" in one pass (so we
never explicitly materialize the full inverted document).  I'm
thinking that IndexWriter simply passes documents to one or more
DocConsumers.  StoredFieldsWriter is one such consumer.  Another one
is DocInverter, which goes field by field, inverting, and passing the
resulting terms to one or more InvertedDocConsumers, whose base class
has APIs for maintaining the Posting hash, managing writing bytes into
virtual streams (the byte slices) and reading them back (for
flushing), etc.

TermVectorsWriter subclasses InvertedDocConsumer, and flushes after
every doc.  TermDictWriter (needs better name!) also subclasses
InvertedDocConsumer, and writes a term dict (tii/tis) with per-term
offsets into any number of "streams" (frq, prx, etc), optionally also
writing skip data, inlined or separately.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Pooling of posting objects in DocumentsWriter

Reply via email to