Re: Pooling of posting objects in DocumentsWriter

Marvin Humphrey Sat, 12 Apr 2008 15:05:11 -0700


On Apr 12, 2008, at 2:29 AM, Michael McCandless wrote:

The total allocation for any given RawPosting can be calculatedlike so:
offsetof(RawPosting, blob) + posting->content_len + posting->aux_len
Probably you also need to add in bytes lost to alignment requirements,
when content_len + aux_len doesn't "align" to a 4/8 byte boundary?

Good thought, but "aux" isn't malloc'd and never gets dereferenced,just copied. It's not even a real member -- it's just an artificiallybounded slice of bytes within the single RawPosting allocation.

In fact, RawPosting "objects" aren't even malloc'd individually --they're assigned memory chunks by a custom MemoryPool object (which*does* have to maintain awareness of alignment issues). You can't dothis:


   free(raw_posting);

Instead, you do this, which frees all the memory previously doled outby the MemoryPool, in a single action which is much faster than lotsof individual calls to free():


   MemPool_Release_All(mem_pool);  /* mass liberation of RawPostings */

It sounds like external sorter deals with RawPosting and need not know
any details of the aux bytes being carried along.


Yes.

But: why distinguish between Posting and RawPosting?

RawPosting can't be subclassed, because of that flexible arraymember. Flexible arrays, which are a lesser-known but reliable hack,have to be the last element in a C struct:


  http://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html

I wish RawPosting's design was a little less cryptic. However,malloc() and free(), used by the normal object allocationinfrastructure in KS, are comparatively expensive. Using a customallocation strategy speeds things up, at the cost of some code clarity-- essentially the same tradeoff that started this thread.

Who uses Posting but not RawPosting?

Posting is an abstract base class. MatchPosting, ScorePosting, etc.subclass it.

PostingList uses Posting directly. Here's the function thatimplements SegPostingList's Next() method.


    u32_t
    SegPList_next(SegPostingList *self)
    {
        InStream *const post_stream = self->post_stream;
        DelDocs  *const deldocs     = self->deldocs;
        Posting  *const posting     = self->posting;

        while (1) {
            /* Bail if we're out of docs. */
            if (self->count >= self->doc_freq) {
                Post_Reset(posting, self->doc_base);
                return 0;
            }
            self->count++;

            Post_Read_Record(posting, post_stream);

            /* If the doc isn't deleted... success! */
            if (deldocs == NULL) {
                break;
            }
            else {

u32_t doc_minus_base = posting->doc_num - self->doc_base;

                if ( !DelDocs_Get(deldocs, doc_minus_base) ) {
                    break;
                }
            }
        }

        return posting->doc_num;
    }

Or, maybe it's because you do two passes when indexing a document?
First "invert" it, into array of Posting instances,

KS deals with Tokens grouped within arrays, rather than streams ofTokens a la Lucene. Inverting is simply a matter of sorting thearray, then grouping Tokens by common term text. One RawPosting getscreated for each unique term.

OK I now see that what we call Posting really should be called
PostingList: each instance of this class, in DW, tracks alldocuments
that contained that term.
I suppose that would be more accurate... though in KS, the"PostingList"
class is the replacement for TermDocs, TermPositions, etc.


I see.  Though that is really a search-time vs indexing-time
distinction, referencing the same thing (a "posting list").  Maybe
RawPostingList, to make it clearer that these are the raw bytes for
many docs produced during indexing?

That usage would fit within the current KS nomenclature, for whateverthat's worth WRT Lucene. ;)

In Lucene, a Posting instance holds many documents.

I think renaming that class might improve accuracy a bit (I assumeit's not public). I've never seen "posting" defined as somethingwhich could hold multiple documents -- it's either one term indexingone document (the definition KS uses), or one instance of one termindexing one document.

The latter meaning seems to be a little more common, but it didn't mapas well to the class hierarchy I wanted to define.

In Lucene we just flush a true segment when RAM is full.  For
early versions of LUCENE-843 I was flushing "intermediate" segments
(like KS's runs), but the cost of merging these runs wasn't that much
better than the cost of merging normal segments, and, it added
unexpected cost to close(), so I fell back to just flushing normal
segments.

Really, you gave up on that? Then aren't you moving stored fields andterm vectors around more often?

ow do multiple threads interact here?  Does each one work with its
own ExternalSorter, and then all threads and all of their runs do
merge sort at the end?  Or do multiple threads feed a single external
sorter?  How finely do you require thread synchronization?

KS doesn't presently work with threads -- Perl's threading model iskind of awkward. But here's the way it would probably work best: Oneexternal sorter object (a PostingPoolQueue). One sorted run object (aPostingPool) per thread. Thread synchronization is required whenevera PostingPool gets written to temp storage, since all pools share acommon OutStream. Synchronization is also required at finish-timeduring periodic calls to PostPoolQ_Refill(pool_q), which is whenRawPosting objects are merge sorted from the PostingPools into a cachein the PostingPoolQueue.

The PostingPool objects, each of which represents a sorted run,actually do most of the work.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Pooling of posting objects in DocumentsWriter

Reply via email to