Flexible indexing design (was Re: Pooling of posting objects in DocumentsWriter)

Michael Busch Wed, 09 Apr 2008 06:37:58 -0700

Thanks for your quick answers.

Michael McCandless wrote:

Hi Michael,


I've actually been working on factoring DocumentsWriter, as a first
step towards flexible indexing.

Cool, yeah separating the DocumentsWriter into multiple classescertainly helped understanding the complex code better.

I agree we would have an abstract base Posting class that just tracks
the term text.

Then, DocumentsWriter manages inverting each field, maintaining the
per-field hash of term Text -> abstract Posting instances, exposing
the methods to write bytes into multiple streams for a Posting in the
RAM "byte slices", and then read them back when flushing, etc.

And then the code that writes the current index format would plug into
this and should be fairly small and easy to understand.  For example,
frq/prx postings and term vectors writing would be two plugins to the
"inverted terms" API; it's just that term vectors flush after every
document and frq/prx flush when RAM is full.

I think this makes sense. We also need to come up with a good solutionfor the dictionary, because a term with frq/prx postings needs to storetwo (or three for skiplist) file pointers in the dictionary, whereas e.g. a "binary" posting list only needs one pointer.

Then there would also be plugins that just tap into the entire
document (don't need inversion), like FieldsWriter.

There are still alot of details to work out...

Definitely. For example, we should think about the Field APIs. Since wedon't have global field semantics in Lucene I wonder how to handleconflict cases, e. g. when a document specifies a different posting listformat than a previous one for the same field. The easiest way would beto not allow it and throw an exception. But this is kind of againstLucene's way of dealing with fields currently. But I'm scared of thecomplicated code to handle conflicts of all the possible combinations ofposting list formats. KinoSearch doesn't have to worry about this,because it has a static schema (I think?), but isn't as flexible as Lucene.

The DocumentsWriter does pooling of the Posting instances and I'mwondering how much this improves performance.
We should retest this.  I think it was a decent difference in
performance but I don't remember how much.  I think the pooling can
also be made generic (handled by DocumentsWriter).  EG the plugin
could expose a "newPosting()"  method.

Yeah, but for code simplicity let's really figure out first how muchpooling helps at all.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Flexible indexing design (was Re: Pooling of posting objects in DocumentsWriter)

Reply via email to