Re: Pooling of posting objects in DocumentsWriter

Michael McCandless Thu, 17 Apr 2008 11:38:03 -0700

Marvin Humphrey <[EMAIL PROTECTED]> wrote:
>
>  On Apr 13, 2008, at 2:28 AM, Michael McCandless wrote:
>
> > > In fact, RawPosting "objects" aren't even malloc'd individually -- they're
> > > assigned memory chunks by a custom MemoryPool object (which *does* have to
> > > maintain awareness of alignment issues).  You can't do this:
> >
> > Right, it's that alignment waste that I'd think you need to "amortize"
> > onto the cost of each RawPosting.
>
> Worse than the alignment waste is that it's impossible to know just how
> much memory is required by each RawPosting before it the export operation
> completes, but you have to allocate it up front.  That means planning for a
> worst case scenario.  It's analogous to packing UTF-8 from UTF-16 source.
>
> The way I hacked it was to always ask for the worst-case amount when
> allocating each RawPosting, but allow quick resizing via an immediate call
> to MemPool_Resize().


But that resizing requires copying bytes right?

> I'm perfectly comfortable with such hacks when they are contained to a
> small scope, but this one's not: each Posting subclass has to deal with the
> MemPool_Resize API or waste memory -- theoretically including public
> subclasses.  :(  But perhaps we can do better...

We have this even worse in Lucene because we keep a Posting
(RawPostingList) per-term "open" for many docs (until we flush).  This
is why we use shared byte[] slices: it allows us to quickly "grow" the
size of the byte blob as needed, without wasting too many bytes.  And
no extra copying is done (until it's time to flush).

Whereas, KS "closes" the RawPosting for each unique term in the
document, after the document is done being indexed.  Is that how you
get your "worst case"?  (Because you know how many occurrences a given
term X has in the document).

> > > Flexible arrays, which are a lesser-known but reliable hack, have to be 
> > > the
> > > last element in a C struct:
> > >
> > >  http://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
> >
> > Alas we can't use such neat tricks in Javaland...
>
> Heh.  It works, but it would be better if it weren't there.  And I think
> there is a solution which will be elegant both in Java and in KS.  Let's
> consider a new private class, FlatPosting, intended to take the place of
> RawPosting:
>
>   struct kino_FlatPosting {
>
>      kino_VirtualTable    *_;
>      kino_ref_t            ref;
>      chy_u32_t             doc_num;
>      char                 *content;
>      size_t                content_len;
>      char                 *blob;
>      size_t                blob_len;
>   };
>
> FlatPosting's "content" and "blob" members will be pointers into shared
> buffers.  In Java, they would be array offsets instead.  Unlike RawPosting
> objects, each FlatPosting is the same size -- no flexible array members.
>
> FlatPostings will be pooled within PostingPool parent objects, which will
> own the shared buffers.  Because they are pooled and reused, allocation
> overhead is no longer crucial and FlatPostings can be ordinary objects.
> When the shared buffer gets full, the PostingPool will flush a sorted run to
> disk.
>
> Only PostingPool, PostingPoolQueue (the external sorter), and
> PostingsWriter will have access to the FlatPostings --- it's best to limit
> FlatPosting's visibility because managing access to the shared buffer is
> tricky.
>
> PostingPool will be public.  Subclasses of Posting will contribute to the
> PostingPool via a PostPool_Add() method.
>
>   void
>   PostPool_add(PostingPool *self, u32_t doc_num,
>                char *content, size_t content_len,
>                char *blob, size_t blob_len);

I like this approach, but, I don't like that this requires an instance
of FlatPosting per document X term.  In Lucene, we only need one
instance of our Posting (to be renamed to RawPostingList) per term
since last flush.

> Hmm, we'll need some way to track field as well.  There's a couple of ways
> of doing that...

In Lucene we hold each Field's postings hash separately.  Ie a
separate class (DocumentsWriterFieldData) per thread X field.

> > > I wish RawPosting's design was a little less cryptic.  However, malloc()
> > > and free(), used by the normal object allocation infrastructure in KS, are
> > > comparatively expensive.  Using a custom allocation strategy speeds things
> > > up, at the cost of some code clarity -- essentially the same tradeoff that
> > > started this thread.
> >
> > Makes sense.
>
> Well, here's the thing.  RawPosting works in KS and it's fast, but it's a
> hack -- good enough for an implementation detail, but not good enough to
> expose as a public API. And like Lucene's DocumentsWriter, KinoSearch's
> PostingsWriter is large and difficult to penetrate.
>
> What I'm hoping is that this discussion will yield some innovations that
> allow me to refactor this code for clarity and interface simplicity.
> There's still going to be a lot of code, but hopefully it can be better
> broken up into more coherent modules.
>
> I particularly want to see Posting as small as possible, because that's the
> one I'd like to be publicly subclassable.

I'd like the same in Lucene, as we move to modular indexing in DW
(first step towards flexible indexing I think).

> FWIW Refactoring the external sorter has been pretty successful:
>
>    KS::Util::SortExternal
>    KS::Util::SortExRun
>    KS::Test::Util::BBSortEx extends SortExternal
>    KS::Test::Util::BBSortExRun extends SortExRun
>    KS::Index::PostingPoolQueue extends SortExternal
>    KS::Index::PostingPool extends SortExRun
>
> SortExternal and SortExRun are abstract base classes.  BBSortEx and
> BBSortExRun, which deal with ByteBufs, are test-only subclasses that air out
> the merging code.  PostingPoolQueue and PostingPool deal with RawPosting
> objects.
>
> PostingPoolQueue is nice and small now, and SortExternal and SortExRun are
> pretty coherent and well tested.  The only one left in that hierarchy that's
> kind of bloated and fragile is PostingPool.

Nice.

> > I see.  So Posting exposes the doc_num so the generic code
> > (SegPList_next) can check if the doc is deleted when iterating.
>
> Yes.  It would be nice to make it an accessor, but the PList_Next() method
> is inner loop code.

OK.

> > What is self->doc_base?  In Lucene, each segment doesn't "know" it's
> > doc base; MultiSegmentReader handles that "above" this layer.
>
> Right, I had to move that down.  The reason is that Posting objects get
> released into the wild, and once that happens they can't ask their parent
> container object about the offset any more.  With a TermDocs object, you can
> only access the doc number through the TermDocs itself...
>
>   while (termDocs.next()) {
>     int docNum = termDocs.doc();
>     ...
>   }
>
>  ... but with a PostingList, you can access the doc number through the
> Posting:
>
>   while (postingList.next()) {
>     Posting posting = postingList.getPosting();
>     int docNum = posting.getDocNum();
>     ...
>
>   }

This would actually cause us trouble in Lucene.  With the new
IndexReader.reopen(), multiple readers can share a single underlying
SegmentReader.  Often such sharing would be with the same base, but
not always.  For example IndexWriter's new expungeDeletes() method can
shift the base down for such a shared segment.  A reopen would then
need to use the new base (while the old IndexReader must stick with
the old base).

> > So, first a Document is translated into array of Tokens (one per
> > term occurrence).  Then this array gets sorted by term text, then term
> > position.  Then (2nd pass) you coalesce all adjacent Tokens that share
> > the same term text and write each into a RawPosting.  Then this array
> > of RawPostings is passed to the external sorter.
> >
>
> Yes.
>
> You know what I'd really like to do though?  Kill off Token and replace it
> with Posting.

Neat!  Meaning analyzers would produce Posting instance streams?
Though then your Posting would then need term text.

> > In Lucene we take each token and do a hash lookup and then write the
> > necessary bytes immediately.  We then only sort, once, during flush.
>
> I'm a little confused.  Are you saying you have a trick where Lucene sorts
> less often?  What you refer to as a "second pass" in KS doesn't involve
> sorting (but it looks like you grok that?).  Also, KS doesn't use hashing to
> group Tokens.

Sorry, I was confused too.  In fact maybe I still am?  Here's my
thinking:

You're right KS's 2nd pass (which converts Posting -> RawPosting) does
not need sorting since you just coalesce adjacent tokens.  Still you
need to compare adjacent token text for equality which adds extra cost
since your qsort() had already done this comparing and because many of
these comparisons are equal (must run the full length of the token's
text).  One thing you could do is intern your token text as ints, and
then do int == comparison?  This would save RAM too since (I think?)
you are storing the same text from different docs multiple times.

KS does do more sorting than Lucene.  First, you sort each document's
Token's by text then position.  Then you coalesce into RawPosting
instances (one per unique token text, per doc).  These are enrolled
into the PostingPool and then when its full you do a massive merge
sort of each doc's RawPosting instances to disk.  Finally, another
merge sort of these runs to make the final segment.

Whereas Lucene keeps one big hash, keyed by token text, holding all
documents & their positions where this token text occurred.  We sort
that only when we flush a segment.

Except, when term vectors are on, in which case we must also sort all
unique token text's per document.  But that sort is not also sorting
position occurences, so this is somewhat less costly than KS's
per-document sort.

>  We discussed this algo once before.

You and Google have good memory ;)  Sheesh that was a bit over a year
ago now!  Time flies.

After my initial reaction, and then further testing of that approach
(ie document turns into sorted "run" & runs are merge-sorted together
over time) vs the persistent hash approach I found the persistent hash
was in fact faster, so I went with it.  Here's the post from that:

  
https://issues.apache.org/jira/browse/LUCENE-843?focusedCommentId=12492650#action_12492650

I think those performance gains were because the persistent-hash
approach requires less sorting & less re-processing of already written bytes.

> > (Except, term vectors still must sort every document's Tokens.)
> >
>
> In KS, TermVectorsWriter gets passed the already-inverted array of Tokens.
> This object, currently named TokenBatch but perhaps to be renamed Inversion,
> is what Analyzers operate on in KS.

Inverted meaning sorted right?

I'm trying to work out how best to share & genericize Lucene's
inversion code.  TermVectors consumes the inverted terms, puts them in
a hash, writes into byte slices, but flushes per document instead of
per-segment.  So it's the same generic code just different times for
flushing.

> > Instead, segments are now allowed to share these "doc stores".  Each
> > segment can reference another segment's doc stores, starting at its
> > own offset.
> >
>
> Mmm, that's clever. :)  I wish it didn't break the segmentation model, but
> it's a sensible optimization -- especially because you write segments so
> much more often.

Right, it's a tradeoff.  But it's nice because it gives us something
similar to KS's approach, except, we don't have to do the final costly
merge step that KS does to merge runs into a single segment.  In
effect each of our runs *is* a segment.  It also means that "close()"
is not unexpectedly costly, at least for this reason (close still
waits for merges to finish so it can still be costly).  Of course then
we also have more merging to do than KS..

> > > KS doesn't presently work with threads -- Perl's threading model is kind 
> > > of
> > > awkward.
> > >
> >
> > Oooh, ouch: the world is quickly going multi-core/cpu.  I think thread
> > concurreny is very important.  But this sounds like a Perl threading
> > limititation, imposed on you.
> >
>
> Yeah.  I'd say it's a higher priority to finish the C API and get POSIX
> threads working than it is to get Perl threads working.

Agreed.  How is Lucy going?

> > OK this sounds like it'd have good concurrency.  Each thread would
> > accumulate many docs worth of RawPostings arrays.  Though, holding the
> > global lock while doing the merge sort into the PostingPoolQueue's
> > cache might become a bottleneck?  How often does that happen and what
> > net %tg of the time does it normally take?
>
> It happens at the end of an indexing session, after the last doc has been
> processed, when $invindexer->finish gets called -- the analogue to
> IndexWriter.close().

Got it.  Lucene does the analagous thing here: each thread builds up
its own postings hash and only on flushing a segment do we acquire a
global lock and merge them.

> (; Probably I should put it into $invindexer->prepare_commit. ;)

You've been paying attention ;)  That one is fun...

> The amount of time it takes to finish a segment varies a lot.  I haven't
> benchmarked it rigorously, but my guess is that it averages between 5 and 25
> percent of total time.

Yeah, for Lucene it's just folded into merging.  But merging is very
costly.  It's a good thing we now have ConcurrentMergeScheduler.

> > In Lucene, each thread writes into its own private RawPosting
> > equivalent (stored in a thread state class), until it's time to flush
> > a segment, at which point we do a merge sort of all thread states
> > while holding the global lock.
>
> Sounds like about the same thing.

Yes.

> When a PostingPool flushes a sorted run to temp storage, it actually uses
> the real file format.  And when it's time to merge an existing segment, a
> PostingPool gets assigned to it and reads it back in.   So, when the
> external sorting apparatus goes into fetch mode, the process of recovering
> RawPosting/FlatPosting objects is essentially the same whether they being
> recovered from an existing segment or from temporary storage.

Very clean!

> > We actually lose some concurrency here: if a large doc is being
> > indexed by one thread, and a tiny doc by another thread, that tiny doc
> > can't write until the large doc finishes, and this ties up the tiny
> > doc's thread state such that that if that thread tries to index
> > another doc it will have to wait until the previous doc gets written.
> >
>
> Writing stored fields shouldn't be CPU intensive, though.  My guess is that
> you're i/o bound at that point.

Actually, translating fields into a byte blob is done concurrently
(same with term vectors).  It's flushing of the bytes that can only be
done at the right time, holding the global lock, because we append the
bytes to the forward doc store files.  So when docs finish indexing
"out of order" (due to luck-of-scheduler or size-of-document or both)
there is a queue holding these byte blobs waiting their turn.  That
queue is not correctly done now in Lucene because when a doc is
waiting in it, the ThreadState is tied up so that the next document
that needs indexing under that thread must wait until the previous doc
for that thread state get written.  It's just a silly limitation that
I'd like to fix at some point.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Pooling of posting objects in DocumentsWriter

Reply via email to