Re: Flexible indexing design

Michael McCandless Thu, 17 Apr 2008 11:57:56 -0700

Marvin Humphrey <[EMAIL PROTECTED]> wrote:

>  On Apr 13, 2008, at 2:35 AM, Michael McCandless wrote:
>
>
> > I think the major difference is locality?  In a compound file, you
> > have to seek "far away" to reach the prx & skip data (if they are
> > separate).
>
> There's another item worth mentioning, something that Doug, Grant and I
> discussed when this flexible indexing talk started way back when.  When you
> unify frq/prx data into a single file, phrase queries and the like benefit
> from improved locality, but simple term queries are impeded because needless
> positional data must be plowed through.


Good point.

> We dismissed that cost with the assertion that you could specify a
> match-only field for simple queries if that was important to you, but IME
> that doesn't seem to be very practical.  It's hard for the internals to know
> that they should prefer one field over another based on the type of query,
> and hard to manually override everywhere.

The thing is, we have to build this out anyway, right?  Ie, *Query
must know something about the index.  If I have a pluggable indexer,
then on the querying side I need something (I'm not sure what/how)
that knows how to create the right demuxer (container) and codec
(decoder) to interact with whatever my indexing plugins wrote.

So I don't think it's "out of bounds" to have *Query classes that know
to use the non-prx variant of a field when positions aren't needed.
Though I agree it makes things more complex.

> > This is like "column stride" vs "row stride" serialization
> > of a matrix.
> >
> > Relatively soon, though, we will all be on SSDs, so maybe this
> > locality argument becomes far less important ;)
> >
>
> Yes, I've thought about that.  It defeats the phrase-query locality
> argument for unified postings files and recommends breaking things up
> logically by type of data into frq/prx/payload/whatever.
>
> Would it be possible to design a Posting plugin class that reads from
> multiple files?  I'm pretty sure the answer is yes.  It messes up the
> single-stream readRecord() method I've been detailing and might force
> Posting to maintain state.  But if Postings are scarce TermBuffer-style
> objects where values mutate, rather than populous Term-style objects where
> you need a new instance for each set of values, then it doesn't matter if
> they're large.
>
> If that could be done, I think it would be possible to retrofit the
> Posting/PostingList concept into Lucene without a file format change.  FWIW.

I think this is possible.  When reading an index, the
Posting/PostingList should be more like TermBuffer than Term.

Thinking about a codec reading multiple files... I would also love
this: a codec that can write & read layered updates to the inverted
files.  EG say I want to update field X of a bunch of documents.  Why
not do this, but write an "incremental" update (with a generation
number so we remain write-once) to the index files.

This way I have a massive original _x.frq file, and then a smallish
_x_1.frq update.

Then when reading, I dynamically choose _x_1.frq when it has a given
docID, else I fallback to _x.frq.  We could write many such updates to
a segment over time.  A partial optimize could coalesce all of them
and write a new single .frq (to the next generation).

We could do something similar with per-document stores.  EG why
rewrite norms if only a few changed?  We could instead store a
sparsely encoded _x_1.nrm "update".  Likewise for stored fields, term
vectors.

This would finally give us incremental updates of just reindexing one
field of a document.

> > I would think running out of file descriptors is common problem otherwise.
> >
>
> The default per-process limit for file descriptors on OS X is 256.  Under
> the Lucene non-compound file format, you're guaranteed to run out of file
> descriptors eventually under normal usage.  If KS allowed a non-compound
> format, you'd also be guaranteed to run out of file descriptors, just
> sooner.  Since not failing at all is the only acceptable outcome, there's
> not much practical difference.

It's a performance/risk tradeoff.  Though I wonder what the practical
performance difference is at search time between CFS and non-CFS for
Lucene.

> I think there's more to be gained from tweaking out the VFS than in
> accommodating a non-compound format.  Saddling users with file descriptor
> constraint worries and having to invoke ulimit all the time sucks.

I think making it an option is appropriate, if the performance
difference is there, so long as the default is still CFS.  Simple
things should be simple; complex things should be possible ;)

> > > My conclusion was that it was better to exploit the benefits of bounded,
> > > single-purpose streams and simple file formats whenever possible.
> > >
> > > There's also a middle way, where each *format* gets its own file.  Then 
> > > you
> > > wind up with fewer files, but you have to track field number state.
> > >
> > > The nice thing is that packet-scoped plugins can be compatible with ALL of
> > > these configurations:
> > >
> >
> > Right.  This way users can pick & choose how to put things in the
> > index (with "healthy" defaults, of course).
> >
>
>
> Well, IMO, we don't want the users to have to care about the container
> classes.
>
> Under the TermDocs/TermPositions model, every time you add new data, you
> need to subclass the containers.  Under the PostingList model, you don't --
> Posting plugs in.
>
> For KS at least, the primary goal is to make Posting public and as easy to
> subclass as possible -- because a public Posting plugin class seems to me to
> be the easiest way to add custom flexible indexing features like like text
> payloads, or arbitrary integer values used by custom function queries, or
> other schemes not yet considered.

I agree, Lucene needs a stronger separation of "container" from
"codec".  If someone just wants to plugin a new codec they should be
able to cleanly plug into an existing container and "only" provide the
codec.

EG, say I want to store say an arbitrary array of ints per document.
Not all documents have an array, and when they do, the array length
varies.

To do this, I'd like to re-use a container that knows how to store a
byte blob per document, sparsely, and say indexed with a multi-level
skip list.  That container is exactly the container we now use for
storing frq/prx data, under each term.  So, ideally, I can easily
re-use that container and just provide a codec (encoder & decoder)
that maps from an int[] to a byte blob, and back.

We need to factor things so that this container can be easily shared
and is entirely decoupled from the codecs it's using.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Flexible indexing design

Reply via email to