Re: Pooling of posting objects in DocumentsWriter

2008-04-17 Thread Michael McCandless
Marvin Humphrey <[EMAIL PROTECTED]> wrote:
>
>  On Apr 13, 2008, at 2:28 AM, Michael McCandless wrote:
>
> > > In fact, RawPosting "objects" aren't even malloc'd individually -- they're
> > > assigned memory chunks by a custom MemoryPool object (which *does* have to
> > > maintain awareness of alignment issues).  You can't do this:
> >
> > Right, it's that alignment waste that I'd think you need to "amortize"
> > onto the cost of each RawPosting.
>
> Worse than the alignment waste is that it's impossible to know just how
> much memory is required by each RawPosting before it the export operation
> completes, but you have to allocate it up front.  That means planning for a
> worst case scenario.  It's analogous to packing UTF-8 from UTF-16 source.
>
> The way I hacked it was to always ask for the worst-case amount when
> allocating each RawPosting, but allow quick resizing via an immediate call
> to MemPool_Resize().

But that resizing requires copying bytes right?

> I'm perfectly comfortable with such hacks when they are contained to a
> small scope, but this one's not: each Posting subclass has to deal with the
> MemPool_Resize API or waste memory -- theoretically including public
> subclasses.  :(  But perhaps we can do better...

We have this even worse in Lucene because we keep a Posting
(RawPostingList) per-term "open" for many docs (until we flush).  This
is why we use shared byte[] slices: it allows us to quickly "grow" the
size of the byte blob as needed, without wasting too many bytes.  And
no extra copying is done (until it's time to flush).

Whereas, KS "closes" the RawPosting for each unique term in the
document, after the document is done being indexed.  Is that how you
get your "worst case"?  (Because you know how many occurrences a given
term X has in the document).

> > > Flexible arrays, which are a lesser-known but reliable hack, have to be 
> > > the
> > > last element in a C struct:
> > >
> > >  http://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
> >
> > Alas we can't use such neat tricks in Javaland...
>
> Heh.  It works, but it would be better if it weren't there.  And I think
> there is a solution which will be elegant both in Java and in KS.  Let's
> consider a new private class, FlatPosting, intended to take the place of
> RawPosting:
>
>   struct kino_FlatPosting {
>
>  kino_VirtualTable*_;
>  kino_ref_tref;
>  chy_u32_t doc_num;
>  char *content;
>  size_tcontent_len;
>  char *blob;
>  size_tblob_len;
>   };
>
> FlatPosting's "content" and "blob" members will be pointers into shared
> buffers.  In Java, they would be array offsets instead.  Unlike RawPosting
> objects, each FlatPosting is the same size -- no flexible array members.
>
> FlatPostings will be pooled within PostingPool parent objects, which will
> own the shared buffers.  Because they are pooled and reused, allocation
> overhead is no longer crucial and FlatPostings can be ordinary objects.
> When the shared buffer gets full, the PostingPool will flush a sorted run to
> disk.
>
> Only PostingPool, PostingPoolQueue (the external sorter), and
> PostingsWriter will have access to the FlatPostings --- it's best to limit
> FlatPosting's visibility because managing access to the shared buffer is
> tricky.
>
> PostingPool will be public.  Subclasses of Posting will contribute to the
> PostingPool via a PostPool_Add() method.
>
>   void
>   PostPool_add(PostingPool *self, u32_t doc_num,
>char *content, size_t content_len,
>char *blob, size_t blob_len);

I like this approach, but, I don't like that this requires an instance
of FlatPosting per document X term.  In Lucene, we only need one
instance of our Posting (to be renamed to RawPostingList) per term
since last flush.

> Hmm, we'll need some way to track field as well.  There's a couple of ways
> of doing that...

In Lucene we hold each Field's postings hash separately.  Ie a
separate class (DocumentsWriterFieldData) per thread X field.

> > > I wish RawPosting's design was a little less cryptic.  However, malloc()
> > > and free(), used by the normal object allocation infrastructure in KS, are
> > > comparatively expensive.  Using a custom allocation strategy speeds things
> > > up, at the cost of some code clarity -- essentially the same tradeoff that
> > > started this thread.
> >
> > Makes sense.
>
> Well, here's the thing.  RawPosting works in KS and it's fast, but it's a
> hack -- good enough for an implementation detail, but not good enough to
> expose as a public API. And like Lucene's DocumentsWriter, KinoSearch's
> PostingsWriter is large and difficult to penetrate.
>
> What I'm hoping is that this discussion will yield some innovations that
> allow me to refactor this code for clarity and interface simplicity.
> There's still going to be a lot of code, but hopefully it can be better
> broken up into more coherent

Re: Pooling of posting objects in DocumentsWriter

2008-04-15 Thread Marvin Humphrey


On Apr 13, 2008, at 2:28 AM, Michael McCandless wrote:

In fact, RawPosting "objects" aren't even malloc'd individually --  
they're
assigned memory chunks by a custom MemoryPool object (which *does*  
have to

maintain awareness of alignment issues).  You can't do this:


Right, it's that alignment waste that I'd think you need to "amortize"
onto the cost of each RawPosting.


Worse than the alignment waste is that it's impossible to know just  
how much memory is required by each RawPosting before it the export  
operation completes, but you have to allocate it up front.  That means  
planning for a worst case scenario.  It's analogous to packing UTF-8  
from UTF-16 source.


The way I hacked it was to always ask for the worst-case amount when  
allocating each RawPosting, but allow quick resizing via an immediate  
call to MemPool_Resize().


I'm perfectly comfortable with such hacks when they are contained to a  
small scope, but this one's not: each Posting subclass has to deal  
with the MemPool_Resize API or waste memory -- theoretically including  
public subclasses.  :(  But perhaps we can do better...


Flexible arrays, which are a lesser-known but reliable hack, have  
to be the

last element in a C struct:

 http://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html


Alas we can't use such neat tricks in Javaland...


Heh.  It works, but it would be better if it weren't there.  And I  
think there is a solution which will be elegant both in Java and in  
KS.  Let's consider a new private class, FlatPosting, intended to take  
the place of RawPosting:


  struct kino_FlatPosting {
 kino_VirtualTable*_;
 kino_ref_tref;
 chy_u32_t doc_num;
 char *content;
 size_tcontent_len;
 char *blob;
 size_tblob_len;
  };

FlatPosting's "content" and "blob" members will be pointers into  
shared buffers.  In Java, they would be array offsets instead.  Unlike  
RawPosting objects, each FlatPosting is the same size -- no flexible  
array members.


FlatPostings will be pooled within PostingPool parent objects, which  
will own the shared buffers.  Because they are pooled and reused,  
allocation overhead is no longer crucial and FlatPostings can be  
ordinary objects.  When the shared buffer gets full, the PostingPool  
will flush a sorted run to disk.


Only PostingPool, PostingPoolQueue (the external sorter), and  
PostingsWriter will have access to the FlatPostings --- it's best to  
limit FlatPosting's visibility because managing access to the shared  
buffer is tricky.


PostingPool will be public.  Subclasses of Posting will contribute to  
the PostingPool via a PostPool_Add() method.


  void
  PostPool_add(PostingPool *self, u32_t doc_num,
   char *content, size_t content_len,
   char *blob, size_t blob_len);

Hmm, we'll need some way to track field as well.  There's a couple of  
ways of doing that...


I wish RawPosting's design was a little less cryptic.  However,  
malloc()
and free(), used by the normal object allocation infrastructure in  
KS, are
comparatively expensive.  Using a custom allocation strategy speeds  
things
up, at the cost of some code clarity -- essentially the same  
tradeoff that

started this thread.


Makes sense.


Well, here's the thing.  RawPosting works in KS and it's fast, but  
it's a hack -- good enough for an implementation detail, but not good  
enough to expose as a public API. And like Lucene's DocumentsWriter,  
KinoSearch's PostingsWriter is large and difficult to penetrate.


What I'm hoping is that this discussion will yield some innovations  
that allow me to refactor this code for clarity and interface  
simplicity.  There's still going to be a lot of code, but hopefully it  
can be better broken up into more coherent modules.


I particularly want to see Posting as small as possible, because  
that's the one I'd like to be publicly subclassable.


FWIW Refactoring the external sorter has been pretty successful:

   KS::Util::SortExternal
   KS::Util::SortExRun
   KS::Test::Util::BBSortEx extends SortExternal
   KS::Test::Util::BBSortExRun extends SortExRun
   KS::Index::PostingPoolQueue extends SortExternal
   KS::Index::PostingPool extends SortExRun

SortExternal and SortExRun are abstract base classes.  BBSortEx and  
BBSortExRun, which deal with ByteBufs, are test-only subclasses that  
air out the merging code.  PostingPoolQueue and PostingPool deal with  
RawPosting objects.


PostingPoolQueue is nice and small now, and SortExternal and SortExRun  
are pretty coherent and well tested.  The only one left in that  
hierarchy that's kind of bloated and fragile is PostingPool.



I see.  So Posting exposes the doc_num so the generic code
(SegPList_next) can check if the doc is deleted when iterating.


Yes.  It would be nice to make it an accessor, but the PList_Next()  
method is inner loop code.



What is self->doc_base

Re: Pooling of posting objects in DocumentsWriter

2008-04-13 Thread Michael McCandless
Marvin Humphrey <[EMAIL PROTECTED]> wrote:
>
>  On Apr 12, 2008, at 2:29 AM, Michael McCandless wrote:
>
>
> >
> > > The total allocation for any given RawPosting can be calculated like so:
> > >
> > >  offsetof(RawPosting, blob) + posting->content_len + posting->aux_len
> > >
> >
> > Probably you also need to add in bytes lost to alignment requirements,
> > when content_len + aux_len doesn't "align" to a 4/8 byte boundary?
> >
>
> Good thought, but "aux" isn't malloc'd and never gets dereferenced, just
> copied.  It's not even a real member -- it's just an artificially bounded
> slice of bytes within the single RawPosting allocation.
>
> In fact, RawPosting "objects" aren't even malloc'd individually -- they're
> assigned memory chunks by a custom MemoryPool object (which *does* have to
> maintain awareness of alignment issues).  You can't do this:

Right, it's that alignment waste that I'd think you need to "amortize"
onto the cost of each RawPosting.

>free(raw_posting);
>
> Instead, you do this, which frees all the memory previously doled out by
> the MemoryPool, in a single action which is much faster than lots of
> individual calls to free():
>
>MemPool_Release_All(mem_pool);  /* mass liberation of RawPostings */

OK got it.  Pools are nice when they "match" :)

> > But: why distinguish between Posting and RawPosting?
> >
>
> RawPosting can't be subclassed, because of that flexible array member.
> Flexible arrays, which are a lesser-known but reliable hack, have to be the
> last element in a C struct:
>
>   http://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html

Alas we can't use such neat tricks in Javaland...

> I wish RawPosting's design was a little less cryptic.  However, malloc()
> and free(), used by the normal object allocation infrastructure in KS, are
> comparatively expensive.  Using a custom allocation strategy speeds things
> up, at the cost of some code clarity -- essentially the same tradeoff that
> started this thread.

Makes sense.

> > Who uses Posting but not RawPosting?
> >
>
> Posting is an abstract base class.  MatchPosting, ScorePosting, etc.
> subclass it.
>
> PostingList uses Posting directly.  Here's the function that implements
> SegPostingList's Next() method.
>
> u32_t
> SegPList_next(SegPostingList *self)
> {
> InStream *const post_stream = self->post_stream;
> DelDocs  *const deldocs = self->deldocs;
> Posting  *const posting = self->posting;
>
> while (1) {
> /* Bail if we're out of docs. */
> if (self->count >= self->doc_freq) {
> Post_Reset(posting, self->doc_base);
> return 0;
> }
> self->count++;
>
> Post_Read_Record(posting, post_stream);
>
> /* If the doc isn't deleted... success! */
> if (deldocs == NULL) {
> break;
> }
> else {
> u32_t doc_minus_base = posting->doc_num - self->doc_base;
> if ( !DelDocs_Get(deldocs, doc_minus_base) ) {
> break;
> }
> }
> }
>
> return posting->doc_num;
>
> }

I see.  So Posting exposes the doc_num so the generic code
(SegPList_next) can check if the doc is deleted when iterating.

What is self->doc_base?  In Lucene, each segment doesn't "know" it's
doc base; MultiSegmentReader handles that "above" this layer.

> > Or, maybe it's because you do two passes when indexing a document?
> > First "invert" it, into array of Posting instances,
> >
>
> KS deals with Tokens grouped within arrays, rather than streams of Tokens a
> la Lucene.  Inverting is simply a matter of sorting the array, then grouping
> Tokens by common term text.  One RawPosting gets created for each unique
> term.

OK.  So, first a Document is translated into array of Tokens (one per
term occurrence).  Then this array gets sorted by term text, then term
position.  Then (2nd pass) you coalesce all adjacent Tokens that share
the same term text and write each into a RawPosting.  Then this array
of RawPostings is passed to the external sorter.

In Lucene we take each token and do a hash lookup and then write the
necessary bytes immediately.  We then only sort, once, during flush.
(Except, term vectors still must sort every document's Tokens.)

> > > > OK I now see that what we call Posting really should be called
> > > > PostingList: each instance of this class, in DW, tracks all documents
> > > > that contained that term.
> > > >
> > > >
> > >
> > > I suppose that would be more accurate... though in KS, the "PostingList"
> > > class is the replacement for TermDocs, TermPositions, etc.
> > >
> >
> > I see.  Though that is really a search-time vs indexing-time
> > distinction, referencing the same thing (a "posting list").  Maybe
> > RawPostingList, to make it clearer that these are the raw bytes for
> > many docs produced during indexing?
> >
>
> That usage would fi

Re: Pooling of posting objects in DocumentsWriter

2008-04-12 Thread Marvin Humphrey


On Apr 12, 2008, at 2:29 AM, Michael McCandless wrote:

The total allocation for any given RawPosting can be calculated  
like so:


  offsetof(RawPosting, blob) + posting->content_len + posting- 
>aux_len


Probably you also need to add in bytes lost to alignment requirements,
when content_len + aux_len doesn't "align" to a 4/8 byte boundary?


Good thought, but "aux" isn't malloc'd and never gets dereferenced,  
just copied.  It's not even a real member -- it's just an artificially  
bounded slice of bytes within the single RawPosting allocation.


In fact, RawPosting "objects" aren't even malloc'd individually --  
they're assigned memory chunks by a custom MemoryPool object (which  
*does* have to maintain awareness of alignment issues).  You can't do  
this:


   free(raw_posting);

Instead, you do this, which frees all the memory previously doled out  
by the MemoryPool, in a single action which is much faster than lots  
of individual calls to free():


   MemPool_Release_All(mem_pool);  /* mass liberation of RawPostings */


It sounds like external sorter deals with RawPosting and need not know
any details of the aux bytes being carried along.


Yes.


But: why distinguish between Posting and RawPosting?


RawPosting can't be subclassed, because of that flexible array  
member.  Flexible arrays, which are a lesser-known but reliable hack,  
have to be the last element in a C struct:


  http://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html

I wish RawPosting's design was a little less cryptic.  However,  
malloc() and free(), used by the normal object allocation  
infrastructure in KS, are comparatively expensive.  Using a custom  
allocation strategy speeds things up, at the cost of some code clarity  
-- essentially the same tradeoff that started this thread.



Who uses Posting but not RawPosting?


Posting is an abstract base class.  MatchPosting, ScorePosting, etc.  
subclass it.


PostingList uses Posting directly.  Here's the function that  
implements SegPostingList's Next() method.


u32_t
SegPList_next(SegPostingList *self)
{
InStream *const post_stream = self->post_stream;
DelDocs  *const deldocs = self->deldocs;
Posting  *const posting = self->posting;

while (1) {
/* Bail if we're out of docs. */
if (self->count >= self->doc_freq) {
Post_Reset(posting, self->doc_base);
return 0;
}
self->count++;

Post_Read_Record(posting, post_stream);

/* If the doc isn't deleted... success! */
if (deldocs == NULL) {
break;
}
else {
u32_t doc_minus_base = posting->doc_num - self- 
>doc_base;

if ( !DelDocs_Get(deldocs, doc_minus_base) ) {
break;
}
}
}

return posting->doc_num;
}


Or, maybe it's because you do two passes when indexing a document?
First "invert" it, into array of Posting instances,


KS deals with Tokens grouped within arrays, rather than streams of  
Tokens a la Lucene.  Inverting is simply a matter of sorting the  
array, then grouping Tokens by common term text.  One RawPosting gets  
created for each unique term.



OK I now see that what we call Posting really should be called
PostingList: each instance of this class, in DW, tracks all  
documents

that contained that term.



I suppose that would be more accurate... though in KS, the  
"PostingList"

class is the replacement for TermDocs, TermPositions, etc.


I see.  Though that is really a search-time vs indexing-time
distinction, referencing the same thing (a "posting list").  Maybe
RawPostingList, to make it clearer that these are the raw bytes for
many docs produced during indexing?


That usage would fit within the current KS nomenclature, for whatever  
that's worth WRT Lucene. ;)



In Lucene, a Posting instance holds many documents.


I think renaming that class might improve accuracy a bit (I assume  
it's not public).  I've never seen "posting" defined as something  
which could hold multiple documents -- it's either one term indexing  
one document (the definition KS uses), or one instance of one term  
indexing one document.


The latter meaning seems to be a little more common, but it didn't map  
as well to the class hierarchy I wanted to define.



In Lucene we just flush a true segment when RAM is full.  For
early versions of LUCENE-843 I was flushing "intermediate" segments
(like KS's runs), but the cost of merging these runs wasn't that much
better than the cost of merging normal segments, and, it added
unexpected cost to close(), so I fell back to just flushing normal
segments.


Really, you gave up on that?  Then aren't you moving stored fields and  
term vectors around more often?



ow do multiple threads interact here?  Does each one work with its
own ExternalSorter, and then all threads and all of their 

Re: Pooling of posting objects in DocumentsWriter

2008-04-12 Thread Michael McCandless
Marvin Humphrey <[EMAIL PROTECTED]> wrote:
>
>  On Apr 10, 2008, at 2:37 AM, Michael McCandless wrote:
>
> >
> > > IMO, the abstract base Posting class should not track text.  It should
> > > include only one datum: a document number.  This keeps it in line with the
> > > simplest IR definition for a "posting": one document matching one term.
> > >
> >
> > But how do you then write out a segment with the terms packed, in
> > sorted order?  Your "generic" layer needs to know how to sort these
> > Posting lists by term text, right?
> >
>
>  Yes.  But the generic layer is the semi-serialized export form RawPosting,
> not the abstract base class Posting.  (Would it be clearer if RawPosting was
> named "FlatPosting"?)
>
>   struct kino_Posting {
>   kino_VirtualTable* _;
>   kino_ref_t ref;
>   chy_u32_t doc_num;
>   };
>
>   struct kino_RawPosting {
>   kino_VirtualTable* _;
>   kino_ref_t ref;
>   chy_u32_t doc_num;
>   chy_u32_t freq;
>   chy_u32_t content_len;
>   chy_u32_t aux_len;
>   char blob[1];/* flexible array */
>   };
>
> The last member of RawPosting, "blob", uses the C "flexible array" hack
> which takes advantage of C's lack of bounds checking.  A RawPosting consists
> of a single contiguous memory allocation, but the size of that allocation
> varies depending on how much information the object needs to carry.
>
> The "blob" is divided into two parts: content, and "aux".  Content is term
> text.  "aux" is encoded material which will be written to the postings file.
> The total allocation for any given RawPosting can be calculated like so:
>
>offsetof(RawPosting, blob) + posting->content_len + posting->aux_len

Probably you also need to add in bytes lost to alignment requirements,
when content_len + aux_len doesn't "align" to a 4/8 byte boundary?

> Sorting is done by comparing first the "content" component of blob using
> memcmp(), then breaking ties with content_len and finally, doc_num.
>
> The write method invoked by PostingsWriter is the same --
> RawPost_Write_Record -- no matter what Posting subclass is assigned to the
> field.  The first part of the implementing function RawPost_write_record may
> look familiar to you, as it uses a close variant of Lucene's algo for
> encoding doc_num and freq together.  Note the last line, though:
>
> void
> RawPost_write_record(RawPosting *self, OutStream *outstream,
>  u32_t last_doc_num)
> {
> const u32_t delta_doc = self->doc_num - last_doc_num;
> char  *const aux_content = self->blob + self->content_len;
>
> /* Write doc num, freq. */
> if (self->freq == 1) {
> const u32_t doc_code = (delta_doc << 1) | 1;
> OutStream_Write_C32(outstream, doc_code);
> }
> else {
> const u32_t doc_code = delta_doc << 1;
> OutStream_Write_C32(outstream, doc_code);
> OutStream_Write_C32(outstream, self->freq);
> }
>
> /* Print prepared data. */
> OutStream_Write_Bytes(outstream, aux_content, self->aux_len);
> }
>
> The idea is that information other than doc num and freq gets serialized
> during the export operation which creates the RawPosting -- then the last
> line of RawPost_write_record prints the pre-serialized data to the
> OutStream.

OK I think I understand the distinction.  It sounds like external
sorter deals with RawPosting and need not know any details of the aux
bytes being carried along.

But: why distinguish between Posting and RawPosting?  Who uses Posting
but not RawPosting?

Or, maybe it's because you do two passes when indexing a document?
First "invert" it, into array of Posting instances, then 2nd pass
across these Posting instances to coalesce multiple term occurrences
in one doc into RawPosting instances?

> > OK I now see that what we call Posting really should be called
> > PostingList: each instance of this class, in DW, tracks all documents
> > that contained that term.
> >
>
> I suppose that would be more accurate... though in KS, the "PostingList"
> class is the replacement for TermDocs, TermPositions, etc.

I see.  Though that is really a search-time vs indexing-time
distinction, referencing the same thing (a "posting list").  Maybe
RawPostingList, to make it clearer that these are the raw bytes for
many docs produced during indexing?

> > Whereas for KS, Posting is a single occurrence of term in a single doc, 
> > right?
> >
>
> It's one term matching one document.
>
> These are the definitions I'm using, as set out in
> KinoSearch::Docs::IRTheory at :
>
> * document - An atomic unit of retrieval.
> * term - An attribute which describes a document.
> * posting - One term indexing one document.
> * term list - The complete list of terms which describe a document.
> * posting list - The complete list of documents which a term indexes.
> * inverted index - A data structure which ma

Re: Pooling of posting objects in DocumentsWriter

2008-04-10 Thread Marvin Humphrey


On Apr 10, 2008, at 2:37 AM, Michael McCandless wrote:
IMO, the abstract base Posting class should not track text.  It  
should
include only one datum: a document number.  This keeps it in line  
with the
simplest IR definition for a "posting": one document matching one  
term.


But how do you then write out a segment with the terms packed, in
sorted order?  Your "generic" layer needs to know how to sort these
Posting lists by term text, right?


Yes.  But the generic layer is the semi-serialized export form  
RawPosting, not the abstract base class Posting.  (Would it be clearer  
if RawPosting was named "FlatPosting"?)


  struct kino_Posting {
  kino_VirtualTable* _;
  kino_ref_t ref;
  chy_u32_t doc_num;
  };

  struct kino_RawPosting {
  kino_VirtualTable* _;
  kino_ref_t ref;
  chy_u32_t doc_num;
  chy_u32_t freq;
  chy_u32_t content_len;
  chy_u32_t aux_len;
  char blob[1];/* flexible array */
  };

The last member of RawPosting, "blob", uses the C "flexible array"  
hack which takes advantage of C's lack of bounds checking.  A  
RawPosting consists of a single contiguous memory allocation, but the  
size of that allocation varies depending on how much information the  
object needs to carry.


The "blob" is divided into two parts: content, and "aux".  Content is  
term text.  "aux" is encoded material which will be written to the  
postings file.  The total allocation for any given RawPosting can be  
calculated like so:


   offsetof(RawPosting, blob) + posting->content_len + posting->aux_len

Sorting is done by comparing first the "content" component of blob  
using memcmp(), then breaking ties with content_len and finally,  
doc_num.


The write method invoked by PostingsWriter is the same --  
RawPost_Write_Record -- no matter what Posting subclass is assigned to  
the field.  The first part of the implementing function  
RawPost_write_record may look familiar to you, as it uses a close  
variant of Lucene's algo for encoding doc_num and freq together.  Note  
the last line, though:


void
RawPost_write_record(RawPosting *self, OutStream *outstream,
 u32_t last_doc_num)
{
const u32_t delta_doc = self->doc_num - last_doc_num;
char  *const aux_content = self->blob + self->content_len;

/* Write doc num, freq. */
if (self->freq == 1) {
const u32_t doc_code = (delta_doc << 1) | 1;
OutStream_Write_C32(outstream, doc_code);
}
else {
const u32_t doc_code = delta_doc << 1;
OutStream_Write_C32(outstream, doc_code);
OutStream_Write_C32(outstream, self->freq);
}

/* Print prepared data. */
OutStream_Write_Bytes(outstream, aux_content, self->aux_len);
}

The idea is that information other than doc num and freq gets  
serialized during the export operation which creates the RawPosting --  
then the last line of RawPost_write_record prints the pre-serialized  
data to the OutStream.



OK I now see that what we call Posting really should be called
PostingList: each instance of this class, in DW, tracks all documents
that contained that term.


I suppose that would be more accurate... though in KS, the  
"PostingList" class is the replacement for TermDocs, TermPositions, etc.


Whereas for KS, Posting is a single occurrence of term in a single  
doc, right?


It's one term matching one document.

These are the definitions I'm using, as set out in  
KinoSearch::Docs::IRTheory at :


* document - An atomic unit of retrieval.
* term - An attribute which describes a document.
* posting - One term indexing one document.
* term list - The complete list of terms which describe a document.
* posting list - The complete list of documents which a term  
indexes.
* inverted index - A data structure which maps from terms to  
documents.



Does a Posting contain all
occurrences of the term in the doc (multiple positions) or only one?


In KS, a Posting object can represent multiple occurrences of a term  
within a single document.  That was somewhat arbitrary on my part.



How do you do buffering/flushing?


RawPostings are fed into an object which handles external sorting.   
The external sorter tracks RAM usage, and checks once per document  
whether it should flush a sorted run to temporary storage.


After the last document gets added, the external sorter flips to  
fetching mode.  The pre-sorted runs are merged into a sorted stream of  
RawPostings, and the postings and lexicon files get written out.


Then, for search-time you have a PostingList class which takes the  
place of
TermDocs/TermPositions, and uses an underlying Posting object to  
read the

file.  (PostingList and its subclasses don't know anything about file
formats.)


Wouldn't PostingList need to know something of the file format?  EG
maybe it's a sparse format (docID or gap encoded each t

Re: Flexible indexing design (was Re: Pooling of posting objects in DocumentsWriter)

2008-04-10 Thread Michael McCandless
Michael Busch <[EMAIL PROTECTED]> wrote:

> > I agree we would have an abstract base Posting class that just tracks
> > the term text.
> >
> > Then, DocumentsWriter manages inverting each field, maintaining the
> > per-field hash of term Text -> abstract Posting instances, exposing
> > the methods to write bytes into multiple streams for a Posting in the
> > RAM "byte slices", and then read them back when flushing, etc.
> >
> > And then the code that writes the current index format would plug into
> > this and should be fairly small and easy to understand.  For example,
> > frq/prx postings and term vectors writing would be two plugins to the
> > "inverted terms" API; it's just that term vectors flush after every
> > document and frq/prx flush when RAM is full.
> >
> >
>
>  I think this makes sense. We also need to come up with a good solution for
> the dictionary, because a term with frq/prx postings needs to store two (or
> three for skiplist) file pointers in the dictionary, whereas e. g. a
> "binary" posting list only needs one pointer.

Right.  I had been thinking at a minimum we allow "flexibility" by
storing N offsets instead of hardwiring frq and prx offsets alone.  N
is 2 now (frq and prx), but could change eg if we put skip into a
separate file like KS does then N = 3.  If you don't store positions
then N drops back to 2, etc.  This would at least be a start.

> > Then there would also be plugins that just tap into the entire
> > document (don't need inversion), like FieldsWriter.
> >
> > There are still alot of details to work out...
> >
>
>  Definitely. For example, we should think about the Field APIs. Since we
> don't have global field semantics in Lucene I wonder how to handle conflict
> cases, e. g. when a document specifies a different posting list format than
> a previous one for the same field. The easiest way would be to not allow it
> and throw an exception. But this is kind of against Lucene's way of dealing
> with fields currently. But I'm scared of the complicated code to handle
> conflicts of all the possible combinations of posting list formats.
> KinoSearch doesn't have to worry about this, because it has a static schema
> (I think?), but isn't as flexible as Lucene.

Yes, assuming we keep this flexibility, then it's up to each plugin to
"deal" with this 1) when writing docs and 2) when merging segments.

We are going to have to make the FieldInfo API generic, somehow, so
that plugins can record interesting details into the FieldInfo.  EG
the addition of payloads required adding a "storePayloads" boolean
into FieldInfo.  Likewise, in LUCENE-1231 you need to record into
FieldInfo whether the fixed or variable length encoding is in use.

So we need extensibility of FieldInfo too: multiple plugins need to
store stuff.

> > > The DocumentsWriter does pooling of the Posting instances and I'm
> wondering how much this improves performance.
> > >
> >
> > We should retest this.  I think it was a decent difference in
> > performance but I don't remember how much.  I think the pooling can
> > also be made generic (handled by DocumentsWriter).  EG the plugin
> > could expose a "newPosting()"  method.
> >
> >
>
>  Yeah, but for code simplicity let's really figure out first how much
> pooling helps at all.

OK I will test this at some point.

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Pooling of posting objects in DocumentsWriter

2008-04-10 Thread Michael McCandless
Marvin Humphrey <[EMAIL PROTECTED]> wrote:
>
>  On Apr 8, 2008, at 10:25 AM, Michael McCandless wrote:
>
> > I've actually been working on factoring DocumentsWriter, as a first
> > step towards flexible indexing.
> >
>
>  The way I handled this in KS was to turn Posting into a class akin to
> TermBuffer: the individual Posting object persists, but its values change.
>
>  Meanwhile, each Posting subclass has a Read_Raw method which generates a
> "RawPosting".  RawPosting objects are a serialized, sortable, lowest common
> denominator form of Posting which every subclass must be able to export.
> They're allocated from a specialized MemoryPool, making them cheap to
> manufacture and to release.
>
>  RawPosting is the only form PostingsWriter is actually required to know
> about:
>
>// PostingsWriter loop:
>while ((RawPosting rawPosting = rawPostingQueue.pop()) != null) {
>   writeRawPosting(rawPosting);
>
>}
>
> > I agree we would have an abstract base Posting class that just tracks
> > the term text.
> >
>
>  IMO, the abstract base Posting class should not track text.  It should
> include only one datum: a document number.  This keeps it in line with the
> simplest IR definition for a "posting": one document matching one term.

But how do you then write out a segment with the terms packed, in
sorted order?  Your "generic" layer needs to know how to sort these
Posting lists by term text, right?

>Posting:doc num (abstract)
>MatchPosting:   doc num
>ScorePosting:   doc num, freq, per-doc boost, positions
>RichPosting:doc num, freq, positions with per-position boost
>PayloadPosting: doc num, payload

OK I now see that what we call Posting really should be called
PostingList: each instance of this class, in DW, tracks all documents
that contained that term.  Whereas for KS, Posting is a single
occurrence of term in a single doc, right?  Does a Posting contain all
occurrences of the term in the doc (multiple positions) or only one?

How do you do buffering/flushing?  After each document do you re-sweep
your Posting instances and write them into a single segment?  Or do
accumulate many of these Posting instances (so many docs are held in
this form) and when RAM is full you flush to disk?

>  Then, for search-time you have a PostingList class which takes the place of
> TermDocs/TermPositions, and uses an underlying Posting object to read the
> file.  (PostingList and its subclasses don't know anything about file
> formats.)

Wouldn't PostingList need to know something of the file format?  EG
maybe it's a sparse format (docID or gap encoded each time), or, it's
non-sparse (like norms, column-stride fields).

>  Each Posting subclass is associated with a subclass of TermScorer which
> implements its own Posting-subclass-specific scoring algorithm.
>
>// MatchPostingScorer scoring algo ...
>while (postingList.next()) {
>   MatchPosting posting = postingList.getPosting();
>   collector.collect(posting.getDocNum(), 1.0);
>}
>
>// ScorePostingScorer scoring algo...
>while (postingList.next()) {
>   ScorePosting posting = (ScorePosting)postingList.getPosting();
>   int freq = posting.getFreq();
>   float score = freq < TERMSCORER_SCORE_CACHE_SIZE
>   ? scoreCache[freq]// cache hit
>   : sim.tf(freq) * weightValue;
>   collector.collect(posting.getDocNum(), score);
>
>}
>
>
> > And then the code that writes the current index format would plug into
> > this and should be fairly small and easy to understand.
> >
>
>  I'm pessimistic that that anything that writes the current index format
> could be "easy to understand", because the spec is so dreadfully convoluted.

I'm quite a bit more optimistic here.

>  As I have argued before, the key is to have each Posting subclass wholly
> define a file format.  That makes them pluggable, breaking the tight binding
> between the Lucene codebase and the Lucene file format spec.

It's not just Posting that defines the file format.  Things like
stored fields, norms, column-stride fields, have nothing to do with
inversion.  So these writers/readers should "plug in" at a layer above
the inversion?  OK, I see these below:

> > Then there would also be plugins that just tap into the entire
> > document (don't need inversion), like FieldsWriter.
> >
>
>
>  Yes.  Here's how things are set up in KS:
>
>InvIndexer
>   SegWriter
>  DocWriter
>  PostingsWriter
> LexWriter
>  TermVectorsWriter
>  // plug in more writers here?
>
>  Ideally, all of the writers under SegWriter would be subclasses of an
> abstract SegDataWriter class, and would implement addInversion() and
> addSegment(). SegWriter.addDoc() would look something like this:
>
>addDoc(Document doc) {
>   Inversion inversion = invert(doc);
>   for (int i = 0; i < writers.size; i++) {
>  writers[i].addInversion(inversion);
>  

Flexible indexing design (was Re: Pooling of posting objects in DocumentsWriter)

2008-04-09 Thread Michael Busch

Thanks for your quick answers.

Michael McCandless wrote:

Hi Michael,

I've actually been working on factoring DocumentsWriter, as a first
step towards flexible indexing.



Cool, yeah separating the DocumentsWriter into multiple classes 
certainly helped understanding the complex code better.



I agree we would have an abstract base Posting class that just tracks
the term text.

Then, DocumentsWriter manages inverting each field, maintaining the
per-field hash of term Text -> abstract Posting instances, exposing
the methods to write bytes into multiple streams for a Posting in the
RAM "byte slices", and then read them back when flushing, etc.

And then the code that writes the current index format would plug into
this and should be fairly small and easy to understand.  For example,
frq/prx postings and term vectors writing would be two plugins to the
"inverted terms" API; it's just that term vectors flush after every
document and frq/prx flush when RAM is full.



I think this makes sense. We also need to come up with a good solution 
for the dictionary, because a term with frq/prx postings needs to store 
two (or three for skiplist) file pointers in the dictionary, whereas e. 
g. a "binary" posting list only needs one pointer.



Then there would also be plugins that just tap into the entire
document (don't need inversion), like FieldsWriter.

There are still alot of details to work out...


Definitely. For example, we should think about the Field APIs. Since we 
don't have global field semantics in Lucene I wonder how to handle 
conflict cases, e. g. when a document specifies a different posting list 
format than a previous one for the same field. The easiest way would be 
to not allow it and throw an exception. But this is kind of against 
Lucene's way of dealing with fields currently. But I'm scared of the 
complicated code to handle conflicts of all the possible combinations of 
posting list formats. KinoSearch doesn't have to worry about this, 
because it has a static schema (I think?), but isn't as flexible as Lucene.




The DocumentsWriter does pooling of the Posting instances and I'm 
wondering how much this improves performance.


We should retest this.  I think it was a decent difference in
performance but I don't remember how much.  I think the pooling can
also be made generic (handled by DocumentsWriter).  EG the plugin
could expose a "newPosting()"  method.



Yeah, but for code simplicity let's really figure out first how much 
pooling helps at all.



Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Pooling of posting objects in DocumentsWriter

2008-04-08 Thread Marvin Humphrey


On Apr 8, 2008, at 10:25 AM, Michael McCandless wrote:

I've actually been working on factoring DocumentsWriter, as a first
step towards flexible indexing.


The way I handled this in KS was to turn Posting into a class akin to  
TermBuffer: the individual Posting object persists, but its values  
change.


Meanwhile, each Posting subclass has a Read_Raw method which generates  
a "RawPosting".  RawPosting objects are a serialized, sortable, lowest  
common denominator form of Posting which every subclass must be able  
to export. They're allocated from a specialized MemoryPool, making  
them cheap to manufacture and to release.


RawPosting is the only form PostingsWriter is actually required to  
know about:


   // PostingsWriter loop:
   while ((RawPosting rawPosting = rawPostingQueue.pop()) != null) {
  writeRawPosting(rawPosting);
   }


I agree we would have an abstract base Posting class that just tracks
the term text.


IMO, the abstract base Posting class should not track text.  It should  
include only one datum: a document number.  This keeps it in line with  
the simplest IR definition for a "posting": one document matching one  
term.


   Posting:doc num (abstract)
   MatchPosting:   doc num
   ScorePosting:   doc num, freq, per-doc boost, positions
   RichPosting:doc num, freq, positions with per-position boost
   PayloadPosting: doc num, payload

Then, for search-time you have a PostingList class which takes the  
place of TermDocs/TermPositions, and uses an underlying Posting object  
to read the file.  (PostingList and its subclasses don't know anything  
about file formats.)


Each Posting subclass is associated with a subclass of TermScorer  
which implements its own Posting-subclass-specific scoring algorithm.


   // MatchPostingScorer scoring algo ...
   while (postingList.next()) {
  MatchPosting posting = postingList.getPosting();
  collector.collect(posting.getDocNum(), 1.0);
   }

   // ScorePostingScorer scoring algo...
   while (postingList.next()) {
  ScorePosting posting = (ScorePosting)postingList.getPosting();
  int freq = posting.getFreq();
  float score = freq < TERMSCORER_SCORE_CACHE_SIZE
  ? scoreCache[freq]// cache hit
  : sim.tf(freq) * weightValue;
  collector.collect(posting.getDocNum(), score);
   }


And then the code that writes the current index format would plug into
this and should be fairly small and easy to understand.


I'm pessimistic that that anything that writes the current index  
format could be "easy to understand", because the spec is so  
dreadfully convoluted.


As I have argued before, the key is to have each Posting subclass  
wholly define a file format.  That makes them pluggable, breaking the  
tight binding between the Lucene codebase and the Lucene file format  
spec.



Then there would also be plugins that just tap into the entire
document (don't need inversion), like FieldsWriter.



Yes.  Here's how things are set up in KS:

   InvIndexer
  SegWriter
 DocWriter
 PostingsWriter
LexWriter
 TermVectorsWriter
 // plug in more writers here?

Ideally, all of the writers under SegWriter would be subclasses of an  
abstract SegDataWriter class, and would implement addInversion() and  
addSegment(). SegWriter.addDoc() would look something like this:


   addDoc(Document doc) {
  Inversion inversion = invert(doc);
  for (int i = 0; i < writers.size; i++) {
 writers[i].addInversion(inversion);
  }
   }

In practice, three of the writers are required (one for term  
dictionary/lexicon, one for postings, and one for some form of  
document storage), but the design allows for plugging in additional  
SegDataWriter subclasses.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Pooling of posting objects in DocumentsWriter

2008-04-08 Thread Michael McCandless

Hi Michael,

I've actually been working on factoring DocumentsWriter, as a first
step towards flexible indexing.

I agree we would have an abstract base Posting class that just tracks
the term text.

Then, DocumentsWriter manages inverting each field, maintaining the
per-field hash of term Text -> abstract Posting instances, exposing
the methods to write bytes into multiple streams for a Posting in the
RAM "byte slices", and then read them back when flushing, etc.

And then the code that writes the current index format would plug into
this and should be fairly small and easy to understand.  For example,
frq/prx postings and term vectors writing would be two plugins to the
"inverted terms" API; it's just that term vectors flush after every
document and frq/prx flush when RAM is full.

Then there would also be plugins that just tap into the entire
document (don't need inversion), like FieldsWriter.

There are still alot of details to work out...

The DocumentsWriter does pooling of the Posting instances and I'm  
wondering how much this improves performance.


We should retest this.  I think it was a decent difference in
performance but I don't remember how much.  I think the pooling can
also be made generic (handled by DocumentsWriter).  EG the plugin
could expose a "newPosting()"  method.

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Pooling of posting objects in DocumentsWriter

2008-04-08 Thread Michael Busch

Hi,

this is most likely a question for Mike. I'm trying to figure out what 
changes we need to make in order to support flexible indexing and 
LUCENE-1231. Currently I'm looking into the DocumentsWriter.


If we want to support different posting lists, then we probably want to 
change the Posting class to be an abstract base class and have different 
subclasses that implement the different posting formats.
The DocumentsWriter does pooling of the Posting instances and I'm 
wondering how much this improves performance. It will be a bit harder to 
do pooling with different Posting implementations. Probably we would 
need a Map with one entry for each Posting subclass 
used?
I wonder if that's worth it, because I was thinking that pooling of 
small objects in modern JVMs is not really more efficient anymore?


-Michael

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]