On Apr 12, 2008, at 2:29 AM, Michael McCandless wrote:
The total allocation for any given RawPosting can be calculated
like so:
offsetof(RawPosting, blob) + posting->content_len + posting-
>aux_len
Probably you also need to add in bytes lost to alignment requirements,
when content_len + aux_len doesn't "align" to a 4/8 byte boundary?
Good thought, but "aux" isn't malloc'd and never gets dereferenced,
just copied. It's not even a real member -- it's just an artificially
bounded slice of bytes within the single RawPosting allocation.
In fact, RawPosting "objects" aren't even malloc'd individually --
they're assigned memory chunks by a custom MemoryPool object (which
*does* have to maintain awareness of alignment issues). You can't do
this:
free(raw_posting);
Instead, you do this, which frees all the memory previously doled out
by the MemoryPool, in a single action which is much faster than lots
of individual calls to free():
MemPool_Release_All(mem_pool); /* mass liberation of RawPostings */
It sounds like external sorter deals with RawPosting and need not know
any details of the aux bytes being carried along.
Yes.
But: why distinguish between Posting and RawPosting?
RawPosting can't be subclassed, because of that flexible array
member. Flexible arrays, which are a lesser-known but reliable hack,
have to be the last element in a C struct:
http://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
I wish RawPosting's design was a little less cryptic. However,
malloc() and free(), used by the normal object allocation
infrastructure in KS, are comparatively expensive. Using a custom
allocation strategy speeds things up, at the cost of some code clarity
-- essentially the same tradeoff that started this thread.
Who uses Posting but not RawPosting?
Posting is an abstract base class. MatchPosting, ScorePosting, etc.
subclass it.
PostingList uses Posting directly. Here's the function that
implements SegPostingList's Next() method.
u32_t
SegPList_next(SegPostingList *self)
{
InStream *const post_stream = self->post_stream;
DelDocs *const deldocs = self->deldocs;
Posting *const posting = self->posting;
while (1) {
/* Bail if we're out of docs. */
if (self->count >= self->doc_freq) {
Post_Reset(posting, self->doc_base);
return 0;
}
self->count++;
Post_Read_Record(posting, post_stream);
/* If the doc isn't deleted... success! */
if (deldocs == NULL) {
break;
}
else {
u32_t doc_minus_base = posting->doc_num - self-
>doc_base;
if ( !DelDocs_Get(deldocs, doc_minus_base) ) {
break;
}
}
}
return posting->doc_num;
}
Or, maybe it's because you do two passes when indexing a document?
First "invert" it, into array of Posting instances,
KS deals with Tokens grouped within arrays, rather than streams of
Tokens a la Lucene. Inverting is simply a matter of sorting the
array, then grouping Tokens by common term text. One RawPosting gets
created for each unique term.
OK I now see that what we call Posting really should be called
PostingList: each instance of this class, in DW, tracks all
documents
that contained that term.
I suppose that would be more accurate... though in KS, the
"PostingList"
class is the replacement for TermDocs, TermPositions, etc.
I see. Though that is really a search-time vs indexing-time
distinction, referencing the same thing (a "posting list"). Maybe
RawPostingList, to make it clearer that these are the raw bytes for
many docs produced during indexing?
That usage would fit within the current KS nomenclature, for whatever
that's worth WRT Lucene. ;)
In Lucene, a Posting instance holds many documents.
I think renaming that class might improve accuracy a bit (I assume
it's not public). I've never seen "posting" defined as something
which could hold multiple documents -- it's either one term indexing
one document (the definition KS uses), or one instance of one term
indexing one document.
The latter meaning seems to be a little more common, but it didn't map
as well to the class hierarchy I wanted to define.
In Lucene we just flush a true segment when RAM is full. For
early versions of LUCENE-843 I was flushing "intermediate" segments
(like KS's runs), but the cost of merging these runs wasn't that much
better than the cost of merging normal segments, and, it added
unexpected cost to close(), so I fell back to just flushing normal
segments.
Really, you gave up on that? Then aren't you moving stored fields and
term vectors around more often?
ow do multiple threads interact here? Does each one work with its
own ExternalSorter, and then all threads and all of their runs do
merge sort at the end? Or do multiple threads feed a single external
sorter? How finely do you require thread synchronization?
KS doesn't presently work with threads -- Perl's threading model is
kind of awkward. But here's the way it would probably work best: One
external sorter object (a PostingPoolQueue). One sorted run object (a
PostingPool) per thread. Thread synchronization is required whenever
a PostingPool gets written to temp storage, since all pools share a
common OutStream. Synchronization is also required at finish-time
during periodic calls to PostPoolQ_Refill(pool_q), which is when
RawPosting objects are merge sorted from the PostingPools into a cache
in the PostingPoolQueue.
The PostingPool objects, each of which represents a sorted run,
actually do most of the work.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]