On Apr 13, 2008, at 2:28 AM, Michael McCandless wrote:

In fact, RawPosting "objects" aren't even malloc'd individually -- they're assigned memory chunks by a custom MemoryPool object (which *does* have to
maintain awareness of alignment issues).  You can't do this:

Right, it's that alignment waste that I'd think you need to "amortize"
onto the cost of each RawPosting.

Worse than the alignment waste is that it's impossible to know just how much memory is required by each RawPosting before it the export operation completes, but you have to allocate it up front. That means planning for a worst case scenario. It's analogous to packing UTF-8 from UTF-16 source.

The way I hacked it was to always ask for the worst-case amount when allocating each RawPosting, but allow quick resizing via an immediate call to MemPool_Resize().

I'm perfectly comfortable with such hacks when they are contained to a small scope, but this one's not: each Posting subclass has to deal with the MemPool_Resize API or waste memory -- theoretically including public subclasses. :( But perhaps we can do better...

Flexible arrays, which are a lesser-known but reliable hack, have to be the
last element in a C struct:

 http://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html

Alas we can't use such neat tricks in Javaland...

Heh. It works, but it would be better if it weren't there. And I think there is a solution which will be elegant both in Java and in KS. Let's consider a new private class, FlatPosting, intended to take the place of RawPosting:

  struct kino_FlatPosting {
     kino_VirtualTable    *_;
     kino_ref_t            ref;
     chy_u32_t             doc_num;
     char                 *content;
     size_t                content_len;
     char                 *blob;
     size_t                blob_len;
  };

FlatPosting's "content" and "blob" members will be pointers into shared buffers. In Java, they would be array offsets instead. Unlike RawPosting objects, each FlatPosting is the same size -- no flexible array members.

FlatPostings will be pooled within PostingPool parent objects, which will own the shared buffers. Because they are pooled and reused, allocation overhead is no longer crucial and FlatPostings can be ordinary objects. When the shared buffer gets full, the PostingPool will flush a sorted run to disk.

Only PostingPool, PostingPoolQueue (the external sorter), and PostingsWriter will have access to the FlatPostings --- it's best to limit FlatPosting's visibility because managing access to the shared buffer is tricky.

PostingPool will be public. Subclasses of Posting will contribute to the PostingPool via a PostPool_Add() method.

  void
  PostPool_add(PostingPool *self, u32_t doc_num,
               char *content, size_t content_len,
               char *blob, size_t blob_len);

Hmm, we'll need some way to track field as well. There's a couple of ways of doing that...

I wish RawPosting's design was a little less cryptic. However, malloc() and free(), used by the normal object allocation infrastructure in KS, are comparatively expensive. Using a custom allocation strategy speeds things up, at the cost of some code clarity -- essentially the same tradeoff that
started this thread.

Makes sense.

Well, here's the thing. RawPosting works in KS and it's fast, but it's a hack -- good enough for an implementation detail, but not good enough to expose as a public API. And like Lucene's DocumentsWriter, KinoSearch's PostingsWriter is large and difficult to penetrate.

What I'm hoping is that this discussion will yield some innovations that allow me to refactor this code for clarity and interface simplicity. There's still going to be a lot of code, but hopefully it can be better broken up into more coherent modules.

I particularly want to see Posting as small as possible, because that's the one I'd like to be publicly subclassable.

FWIW Refactoring the external sorter has been pretty successful:

   KS::Util::SortExternal
   KS::Util::SortExRun
   KS::Test::Util::BBSortEx extends SortExternal
   KS::Test::Util::BBSortExRun extends SortExRun
   KS::Index::PostingPoolQueue extends SortExternal
   KS::Index::PostingPool extends SortExRun

SortExternal and SortExRun are abstract base classes. BBSortEx and BBSortExRun, which deal with ByteBufs, are test-only subclasses that air out the merging code. PostingPoolQueue and PostingPool deal with RawPosting objects.

PostingPoolQueue is nice and small now, and SortExternal and SortExRun are pretty coherent and well tested. The only one left in that hierarchy that's kind of bloated and fragile is PostingPool.

I see.  So Posting exposes the doc_num so the generic code
(SegPList_next) can check if the doc is deleted when iterating.

Yes. It would be nice to make it an accessor, but the PList_Next() method is inner loop code.

What is self->doc_base?  In Lucene, each segment doesn't "know" it's
doc base; MultiSegmentReader handles that "above" this layer.

Right, I had to move that down. The reason is that Posting objects get released into the wild, and once that happens they can't ask their parent container object about the offset any more. With a TermDocs object, you can only access the doc number through the TermDocs itself...

  while (termDocs.next()) {
    int docNum = termDocs.doc();
    ...
  }

... but with a PostingList, you can access the doc number through the Posting:

  while (postingList.next()) {
    Posting posting = postingList.getPosting();
    int docNum = posting.getDocNum();
    ...
  }

So, first a Document is translated into array of Tokens (one per
term occurrence).  Then this array gets sorted by term text, then term
position.  Then (2nd pass) you coalesce all adjacent Tokens that share
the same term text and write each into a RawPosting.  Then this array
of RawPostings is passed to the external sorter.

Yes.

You know what I'd really like to do though? Kill off Token and replace it with Posting.

In Lucene we take each token and do a hash lookup and then write the
necessary bytes immediately.  We then only sort, once, during flush.

I'm a little confused. Are you saying you have a trick where Lucene sorts less often? What you refer to as a "second pass" in KS doesn't involve sorting (but it looks like you grok that?). Also, KS doesn't use hashing to group Tokens.

We discussed this algo once before.

<http://www.archivum.info/java-dev@lucene.apache.org/2007-04/msg00091.html >

Your reaction:

  OOOOOH!  I like that approach!

  So you basically do not "de-dup" by field+term on your first pass
  through the tokens in the doc (which is "roughly" what that hash
  does).  Instead, append all tokens in an array, then sort first by
  field+text and second by position?  This is done for each document
  right?

  This seems like it could be a major win!

:)

(Except, term vectors still must sort every document's Tokens.)

In KS, TermVectorsWriter gets passed the already-inverted array of Tokens. This object, currently named TokenBatch but perhaps to be renamed Inversion, is what Analyzers operate on in KS.

Instead, segments are now allowed to share these "doc stores".  Each
segment can reference another segment's doc stores, starting at its
own offset.

Mmm, that's clever. :) I wish it didn't break the segmentation model, but it's a sensible optimization -- especially because you write segments so much more often.

KS doesn't presently work with threads -- Perl's threading model is kind of
awkward.

Oooh, ouch: the world is quickly going multi-core/cpu.  I think thread
concurreny is very important.  But this sounds like a Perl threading
limititation, imposed on you.

Yeah. I'd say it's a higher priority to finish the C API and get POSIX threads working than it is to get Perl threads working.

OK this sounds like it'd have good concurrency.  Each thread would
accumulate many docs worth of RawPostings arrays.  Though, holding the
global lock while doing the merge sort into the PostingPoolQueue's
cache might become a bottleneck?  How often does that happen and what
net %tg of the time does it normally take?

It happens at the end of an indexing session, after the last doc has been processed, when $invindexer->finish gets called -- the analogue to IndexWriter.close(). (; Probably I should put it into $invindexer- >prepare_commit. ;)

The amount of time it takes to finish a segment varies a lot. I haven't benchmarked it rigorously, but my guess is that it averages between 5 and 25 percent of total time.

In Lucene, each thread writes into its own private RawPosting
equivalent (stored in a thread state class), until it's time to flush
a segment, at which point we do a merge sort of all thread states
while holding the global lock.

Sounds like about the same thing.

When a PostingPool flushes a sorted run to temp storage, it actually uses the real file format. And when it's time to merge an existing segment, a PostingPool gets assigned to it and reads it back in. So, when the external sorting apparatus goes into fetch mode, the process of recovering RawPosting/FlatPosting objects is essentially the same whether they being recovered from an existing segment or from temporary storage.

We actually lose some concurrency here: if a large doc is being
indexed by one thread, and a tiny doc by another thread, that tiny doc
can't write until the large doc finishes, and this ties up the tiny
doc's thread state such that that if that thread tries to index
another doc it will have to wait until the previous doc gets written.

Writing stored fields shouldn't be CPU intensive, though. My guess is that you're i/o bound at that point.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to