Re: Flexible indexing design (was Re: Pooling of posting objects in DocumentsWriter)

2008-04-10 Thread Michael McCandless
Michael Busch [EMAIL PROTECTED] wrote:

  I agree we would have an abstract base Posting class that just tracks
  the term text.
 
  Then, DocumentsWriter manages inverting each field, maintaining the
  per-field hash of term Text - abstract Posting instances, exposing
  the methods to write bytes into multiple streams for a Posting in the
  RAM byte slices, and then read them back when flushing, etc.
 
  And then the code that writes the current index format would plug into
  this and should be fairly small and easy to understand.  For example,
  frq/prx postings and term vectors writing would be two plugins to the
  inverted terms API; it's just that term vectors flush after every
  document and frq/prx flush when RAM is full.
 
 

  I think this makes sense. We also need to come up with a good solution for
 the dictionary, because a term with frq/prx postings needs to store two (or
 three for skiplist) file pointers in the dictionary, whereas e. g. a
 binary posting list only needs one pointer.

Right.  I had been thinking at a minimum we allow flexibility by
storing N offsets instead of hardwiring frq and prx offsets alone.  N
is 2 now (frq and prx), but could change eg if we put skip into a
separate file like KS does then N = 3.  If you don't store positions
then N drops back to 2, etc.  This would at least be a start.

  Then there would also be plugins that just tap into the entire
  document (don't need inversion), like FieldsWriter.
 
  There are still alot of details to work out...
 

  Definitely. For example, we should think about the Field APIs. Since we
 don't have global field semantics in Lucene I wonder how to handle conflict
 cases, e. g. when a document specifies a different posting list format than
 a previous one for the same field. The easiest way would be to not allow it
 and throw an exception. But this is kind of against Lucene's way of dealing
 with fields currently. But I'm scared of the complicated code to handle
 conflicts of all the possible combinations of posting list formats.
 KinoSearch doesn't have to worry about this, because it has a static schema
 (I think?), but isn't as flexible as Lucene.

Yes, assuming we keep this flexibility, then it's up to each plugin to
deal with this 1) when writing docs and 2) when merging segments.

We are going to have to make the FieldInfo API generic, somehow, so
that plugins can record interesting details into the FieldInfo.  EG
the addition of payloads required adding a storePayloads boolean
into FieldInfo.  Likewise, in LUCENE-1231 you need to record into
FieldInfo whether the fixed or variable length encoding is in use.

So we need extensibility of FieldInfo too: multiple plugins need to
store stuff.

   The DocumentsWriter does pooling of the Posting instances and I'm
 wondering how much this improves performance.
  
 
  We should retest this.  I think it was a decent difference in
  performance but I don't remember how much.  I think the pooling can
  also be made generic (handled by DocumentsWriter).  EG the plugin
  could expose a newPosting()  method.
 
 

  Yeah, but for code simplicity let's really figure out first how much
 pooling helps at all.

OK I will test this at some point.

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Flexible indexing design (was Re: Pooling of posting objects in DocumentsWriter)

2008-04-09 Thread Michael Busch

Thanks for your quick answers.

Michael McCandless wrote:

Hi Michael,

I've actually been working on factoring DocumentsWriter, as a first
step towards flexible indexing.



Cool, yeah separating the DocumentsWriter into multiple classes 
certainly helped understanding the complex code better.



I agree we would have an abstract base Posting class that just tracks
the term text.

Then, DocumentsWriter manages inverting each field, maintaining the
per-field hash of term Text - abstract Posting instances, exposing
the methods to write bytes into multiple streams for a Posting in the
RAM byte slices, and then read them back when flushing, etc.

And then the code that writes the current index format would plug into
this and should be fairly small and easy to understand.  For example,
frq/prx postings and term vectors writing would be two plugins to the
inverted terms API; it's just that term vectors flush after every
document and frq/prx flush when RAM is full.



I think this makes sense. We also need to come up with a good solution 
for the dictionary, because a term with frq/prx postings needs to store 
two (or three for skiplist) file pointers in the dictionary, whereas e. 
g. a binary posting list only needs one pointer.



Then there would also be plugins that just tap into the entire
document (don't need inversion), like FieldsWriter.

There are still alot of details to work out...


Definitely. For example, we should think about the Field APIs. Since we 
don't have global field semantics in Lucene I wonder how to handle 
conflict cases, e. g. when a document specifies a different posting list 
format than a previous one for the same field. The easiest way would be 
to not allow it and throw an exception. But this is kind of against 
Lucene's way of dealing with fields currently. But I'm scared of the 
complicated code to handle conflicts of all the possible combinations of 
posting list formats. KinoSearch doesn't have to worry about this, 
because it has a static schema (I think?), but isn't as flexible as Lucene.




The DocumentsWriter does pooling of the Posting instances and I'm 
wondering how much this improves performance.


We should retest this.  I think it was a decent difference in
performance but I don't remember how much.  I think the pooling can
also be made generic (handled by DocumentsWriter).  EG the plugin
could expose a newPosting()  method.



Yeah, but for code simplicity let's really figure out first how much 
pooling helps at all.



Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]