On Apr 13, 2008, at 2:28 AM, Michael McCandless wrote:
In fact, RawPosting "objects" aren't even malloc'd individually --
they're
assigned memory chunks by a custom MemoryPool object (which *does*
have to
maintain awareness of alignment issues). You can't do this:
Right, it's that alignment waste that I'd think you need to "amortize"
onto the cost of each RawPosting.
Worse than the alignment waste is that it's impossible to know just
how much memory is required by each RawPosting before it the export
operation completes, but you have to allocate it up front. That means
planning for a worst case scenario. It's analogous to packing UTF-8
from UTF-16 source.
The way I hacked it was to always ask for the worst-case amount when
allocating each RawPosting, but allow quick resizing via an immediate
call to MemPool_Resize().
I'm perfectly comfortable with such hacks when they are contained to a
small scope, but this one's not: each Posting subclass has to deal
with the MemPool_Resize API or waste memory -- theoretically including
public subclasses. :( But perhaps we can do better...
Flexible arrays, which are a lesser-known but reliable hack, have
to be the
last element in a C struct:
http://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
Alas we can't use such neat tricks in Javaland...
Heh. It works, but it would be better if it weren't there. And I
think there is a solution which will be elegant both in Java and in
KS. Let's consider a new private class, FlatPosting, intended to take
the place of RawPosting:
struct kino_FlatPosting {
kino_VirtualTable *_;
kino_ref_t ref;
chy_u32_t doc_num;
char *content;
size_t content_len;
char *blob;
size_t blob_len;
};
FlatPosting's "content" and "blob" members will be pointers into
shared buffers. In Java, they would be array offsets instead. Unlike
RawPosting objects, each FlatPosting is the same size -- no flexible
array members.
FlatPostings will be pooled within PostingPool parent objects, which
will own the shared buffers. Because they are pooled and reused,
allocation overhead is no longer crucial and FlatPostings can be
ordinary objects. When the shared buffer gets full, the PostingPool
will flush a sorted run to disk.
Only PostingPool, PostingPoolQueue (the external sorter), and
PostingsWriter will have access to the FlatPostings --- it's best to
limit FlatPosting's visibility because managing access to the shared
buffer is tricky.
PostingPool will be public. Subclasses of Posting will contribute to
the PostingPool via a PostPool_Add() method.
void
PostPool_add(PostingPool *self, u32_t doc_num,
char *content, size_t content_len,
char *blob, size_t blob_len);
Hmm, we'll need some way to track field as well. There's a couple of
ways of doing that...
I wish RawPosting's design was a little less cryptic. However,
malloc()
and free(), used by the normal object allocation infrastructure in
KS, are
comparatively expensive. Using a custom allocation strategy speeds
things
up, at the cost of some code clarity -- essentially the same
tradeoff that
started this thread.
Makes sense.
Well, here's the thing. RawPosting works in KS and it's fast, but
it's a hack -- good enough for an implementation detail, but not good
enough to expose as a public API. And like Lucene's DocumentsWriter,
KinoSearch's PostingsWriter is large and difficult to penetrate.
What I'm hoping is that this discussion will yield some innovations
that allow me to refactor this code for clarity and interface
simplicity. There's still going to be a lot of code, but hopefully it
can be better broken up into more coherent modules.
I particularly want to see Posting as small as possible, because
that's the one I'd like to be publicly subclassable.
FWIW Refactoring the external sorter has been pretty successful:
KS::Util::SortExternal
KS::Util::SortExRun
KS::Test::Util::BBSortEx extends SortExternal
KS::Test::Util::BBSortExRun extends SortExRun
KS::Index::PostingPoolQueue extends SortExternal
KS::Index::PostingPool extends SortExRun
SortExternal and SortExRun are abstract base classes. BBSortEx and
BBSortExRun, which deal with ByteBufs, are test-only subclasses that
air out the merging code. PostingPoolQueue and PostingPool deal with
RawPosting objects.
PostingPoolQueue is nice and small now, and SortExternal and SortExRun
are pretty coherent and well tested. The only one left in that
hierarchy that's kind of bloated and fragile is PostingPool.
I see. So Posting exposes the doc_num so the generic code
(SegPList_next) can check if the doc is deleted when iterating.
Yes. It would be nice to make it an accessor, but the PList_Next()
method is inner loop code.
What is self->doc_base? In Lucene, each segment doesn't "know" it's
doc base; MultiSegmentReader handles that "above" this layer.
Right, I had to move that down. The reason is that Posting objects
get released into the wild, and once that happens they can't ask their
parent container object about the offset any more. With a TermDocs
object, you can only access the doc number through the TermDocs
itself...
while (termDocs.next()) {
int docNum = termDocs.doc();
...
}
... but with a PostingList, you can access the doc number through the
Posting:
while (postingList.next()) {
Posting posting = postingList.getPosting();
int docNum = posting.getDocNum();
...
}
So, first a Document is translated into array of Tokens (one per
term occurrence). Then this array gets sorted by term text, then term
position. Then (2nd pass) you coalesce all adjacent Tokens that share
the same term text and write each into a RawPosting. Then this array
of RawPostings is passed to the external sorter.
Yes.
You know what I'd really like to do though? Kill off Token and
replace it with Posting.
In Lucene we take each token and do a hash lookup and then write the
necessary bytes immediately. We then only sort, once, during flush.
I'm a little confused. Are you saying you have a trick where Lucene
sorts less often? What you refer to as a "second pass" in KS doesn't
involve sorting (but it looks like you grok that?). Also, KS doesn't
use hashing to group Tokens.
We discussed this algo once before.
<http://www.archivum.info/java-dev@lucene.apache.org/2007-04/msg00091.html
>
Your reaction:
OOOOOH! I like that approach!
So you basically do not "de-dup" by field+term on your first pass
through the tokens in the doc (which is "roughly" what that hash
does). Instead, append all tokens in an array, then sort first by
field+text and second by position? This is done for each document
right?
This seems like it could be a major win!
:)
(Except, term vectors still must sort every document's Tokens.)
In KS, TermVectorsWriter gets passed the already-inverted array of
Tokens. This object, currently named TokenBatch but perhaps to be
renamed Inversion, is what Analyzers operate on in KS.
Instead, segments are now allowed to share these "doc stores". Each
segment can reference another segment's doc stores, starting at its
own offset.
Mmm, that's clever. :) I wish it didn't break the segmentation model,
but it's a sensible optimization -- especially because you write
segments so much more often.
KS doesn't presently work with threads -- Perl's threading model is
kind of
awkward.
Oooh, ouch: the world is quickly going multi-core/cpu. I think thread
concurreny is very important. But this sounds like a Perl threading
limititation, imposed on you.
Yeah. I'd say it's a higher priority to finish the C API and get
POSIX threads working than it is to get Perl threads working.
OK this sounds like it'd have good concurrency. Each thread would
accumulate many docs worth of RawPostings arrays. Though, holding the
global lock while doing the merge sort into the PostingPoolQueue's
cache might become a bottleneck? How often does that happen and what
net %tg of the time does it normally take?
It happens at the end of an indexing session, after the last doc has
been processed, when $invindexer->finish gets called -- the analogue
to IndexWriter.close(). (; Probably I should put it into $invindexer-
>prepare_commit. ;)
The amount of time it takes to finish a segment varies a lot. I
haven't benchmarked it rigorously, but my guess is that it averages
between 5 and 25 percent of total time.
In Lucene, each thread writes into its own private RawPosting
equivalent (stored in a thread state class), until it's time to flush
a segment, at which point we do a merge sort of all thread states
while holding the global lock.
Sounds like about the same thing.
When a PostingPool flushes a sorted run to temp storage, it actually
uses the real file format. And when it's time to merge an existing
segment, a PostingPool gets assigned to it and reads it back in. So,
when the external sorting apparatus goes into fetch mode, the process
of recovering RawPosting/FlatPosting objects is essentially the same
whether they being recovered from an existing segment or from
temporary storage.
We actually lose some concurrency here: if a large doc is being
indexed by one thread, and a tiny doc by another thread, that tiny doc
can't write until the large doc finishes, and this ties up the tiny
doc's thread state such that that if that thread tries to index
another doc it will have to wait until the previous doc gets written.
Writing stored fields shouldn't be CPU intensive, though. My guess is
that you're i/o bound at that point.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]