Re: Pooling of posting objects in DocumentsWriter

Marvin Humphrey Tue, 15 Apr 2008 14:57:53 -0700


On Apr 13, 2008, at 2:28 AM, Michael McCandless wrote:

In fact, RawPosting "objects" aren't even malloc'd individually --they'reassigned memory chunks by a custom MemoryPool object (which *does*have to
maintain awareness of alignment issues).  You can't do this:
Right, it's that alignment waste that I'd think you need to "amortize"
onto the cost of each RawPosting.

Worse than the alignment waste is that it's impossible to know justhow much memory is required by each RawPosting before it the exportoperation completes, but you have to allocate it up front. That meansplanning for a worst case scenario. It's analogous to packing UTF-8from UTF-16 source.

The way I hacked it was to always ask for the worst-case amount whenallocating each RawPosting, but allow quick resizing via an immediatecall to MemPool_Resize().

I'm perfectly comfortable with such hacks when they are contained to asmall scope, but this one's not: each Posting subclass has to dealwith the MemPool_Resize API or waste memory -- theoretically includingpublic subclasses. :( But perhaps we can do better...

Flexible arrays, which are a lesser-known but reliable hack, haveto be the
last element in a C struct:

 http://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
Alas we can't use such neat tricks in Javaland...

Heh. It works, but it would be better if it weren't there. And Ithink there is a solution which will be elegant both in Java and inKS. Let's consider a new private class, FlatPosting, intended to takethe place of RawPosting:


  struct kino_FlatPosting {
     kino_VirtualTable    *_;
     kino_ref_t            ref;
     chy_u32_t             doc_num;
     char                 *content;
     size_t                content_len;
     char                 *blob;
     size_t                blob_len;
  };

FlatPosting's "content" and "blob" members will be pointers intoshared buffers. In Java, they would be array offsets instead. UnlikeRawPosting objects, each FlatPosting is the same size -- no flexiblearray members.

FlatPostings will be pooled within PostingPool parent objects, whichwill own the shared buffers. Because they are pooled and reused,allocation overhead is no longer crucial and FlatPostings can beordinary objects. When the shared buffer gets full, the PostingPoolwill flush a sorted run to disk.

Only PostingPool, PostingPoolQueue (the external sorter), andPostingsWriter will have access to the FlatPostings --- it's best tolimit FlatPosting's visibility because managing access to the sharedbuffer is tricky.

PostingPool will be public. Subclasses of Posting will contribute tothe PostingPool via a PostPool_Add() method.


  void
  PostPool_add(PostingPool *self, u32_t doc_num,
               char *content, size_t content_len,
               char *blob, size_t blob_len);

Hmm, we'll need some way to track field as well. There's a couple ofways of doing that...

I wish RawPosting's design was a little less cryptic. However,malloc()and free(), used by the normal object allocation infrastructure inKS, arecomparatively expensive. Using a custom allocation strategy speedsthingsup, at the cost of some code clarity -- essentially the sametradeoff that
started this thread.
Makes sense.

Well, here's the thing. RawPosting works in KS and it's fast, butit's a hack -- good enough for an implementation detail, but not goodenough to expose as a public API. And like Lucene's DocumentsWriter,KinoSearch's PostingsWriter is large and difficult to penetrate.

What I'm hoping is that this discussion will yield some innovationsthat allow me to refactor this code for clarity and interfacesimplicity. There's still going to be a lot of code, but hopefully itcan be better broken up into more coherent modules.

I particularly want to see Posting as small as possible, becausethat's the one I'd like to be publicly subclassable.


FWIW Refactoring the external sorter has been pretty successful:

   KS::Util::SortExternal
   KS::Util::SortExRun
   KS::Test::Util::BBSortEx extends SortExternal
   KS::Test::Util::BBSortExRun extends SortExRun
   KS::Index::PostingPoolQueue extends SortExternal
   KS::Index::PostingPool extends SortExRun

SortExternal and SortExRun are abstract base classes. BBSortEx andBBSortExRun, which deal with ByteBufs, are test-only subclasses thatair out the merging code. PostingPoolQueue and PostingPool deal withRawPosting objects.

PostingPoolQueue is nice and small now, and SortExternal and SortExRunare pretty coherent and well tested. The only one left in thathierarchy that's kind of bloated and fragile is PostingPool.

I see.  So Posting exposes the doc_num so the generic code
(SegPList_next) can check if the doc is deleted when iterating.

Yes. It would be nice to make it an accessor, but the PList_Next()method is inner loop code.

What is self->doc_base?  In Lucene, each segment doesn't "know" it's
doc base; MultiSegmentReader handles that "above" this layer.

Right, I had to move that down. The reason is that Posting objectsget released into the wild, and once that happens they can't ask theirparent container object about the offset any more. With a TermDocsobject, you can only access the doc number through the TermDocsitself...


  while (termDocs.next()) {
    int docNum = termDocs.doc();
    ...
  }

... but with a PostingList, you can access the doc number through thePosting:


  while (postingList.next()) {
    Posting posting = postingList.getPosting();
    int docNum = posting.getDocNum();
    ...
  }

So, first a Document is translated into array of Tokens (one per
term occurrence).  Then this array gets sorted by term text, then term
position.  Then (2nd pass) you coalesce all adjacent Tokens that share
the same term text and write each into a RawPosting.  Then this array
of RawPostings is passed to the external sorter.


Yes.

You know what I'd really like to do though? Kill off Token andreplace it with Posting.

In Lucene we take each token and do a hash lookup and then write the
necessary bytes immediately.  We then only sort, once, during flush.

I'm a little confused. Are you saying you have a trick where Lucenesorts less often? What you refer to as a "second pass" in KS doesn'tinvolve sorting (but it looks like you grok that?). Also, KS doesn'tuse hashing to group Tokens.


We discussed this algo once before.

<http://www.archivum.info/java-dev@lucene.apache.org/2007-04/msg00091.html>


Your reaction:

  OOOOOH!  I like that approach!

  So you basically do not "de-dup" by field+term on your first pass
  through the tokens in the doc (which is "roughly" what that hash
  does).  Instead, append all tokens in an array, then sort first by
  field+text and second by position?  This is done for each document
  right?

  This seems like it could be a major win!

:)

(Except, term vectors still must sort every document's Tokens.)

In KS, TermVectorsWriter gets passed the already-inverted array ofTokens. This object, currently named TokenBatch but perhaps to berenamed Inversion, is what Analyzers operate on in KS.

Instead, segments are now allowed to share these "doc stores".  Each
segment can reference another segment's doc stores, starting at its
own offset.

Mmm, that's clever. :) I wish it didn't break the segmentation model,but it's a sensible optimization -- especially because you writesegments so much more often.

KS doesn't presently work with threads -- Perl's threading model iskind of
awkward.


Oooh, ouch: the world is quickly going multi-core/cpu.  I think thread
concurreny is very important.  But this sounds like a Perl threading
limititation, imposed on you.

Yeah. I'd say it's a higher priority to finish the C API and getPOSIX threads working than it is to get Perl threads working.

OK this sounds like it'd have good concurrency.  Each thread would
accumulate many docs worth of RawPostings arrays.  Though, holding the
global lock while doing the merge sort into the PostingPoolQueue's
cache might become a bottleneck?  How often does that happen and what
net %tg of the time does it normally take?

It happens at the end of an indexing session, after the last doc hasbeen processed, when $invindexer->finish gets called -- the analogueto IndexWriter.close(). (; Probably I should put it into $invindexer->prepare_commit. ;)

The amount of time it takes to finish a segment varies a lot. Ihaven't benchmarked it rigorously, but my guess is that it averagesbetween 5 and 25 percent of total time.

In Lucene, each thread writes into its own private RawPosting
equivalent (stored in a thread state class), until it's time to flush
a segment, at which point we do a merge sort of all thread states
while holding the global lock.


Sounds like about the same thing.

When a PostingPool flushes a sorted run to temp storage, it actuallyuses the real file format. And when it's time to merge an existingsegment, a PostingPool gets assigned to it and reads it back in. So,when the external sorting apparatus goes into fetch mode, the processof recovering RawPosting/FlatPosting objects is essentially the samewhether they being recovered from an existing segment or fromtemporary storage.

We actually lose some concurrency here: if a large doc is being
indexed by one thread, and a tiny doc by another thread, that tiny doc
can't write until the large doc finishes, and this ties up the tiny
doc's thread state such that that if that thread tries to index
another doc it will have to wait until the previous doc gets written.

Writing stored fields shouldn't be CPU intensive, though. My guess isthat you're i/o bound at that point.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Pooling of posting objects in DocumentsWriter

Reply via email to