Re: real time updates

Nathan Kurz Mon, 16 Mar 2009 00:10:16 -0700

On Sat, Mar 14, 2009 at 5:41 PM, Marvin Humphrey <[email protected]> wrote:
> That sounds like a fun exercise.  Let's start from a clean slate, and try to
> build up interfaces for Indexer and Searchable.  Just so that it's clear that
> nothing final is being decided, let's call our experimental project "Luser".


Thanks for playing along!  I think this could be a very useful thought
experiment, at least for me.

> Since Lucy's internal
> doc numbers will be ephemeral, they wouldn't work.  We'd need to add a primary
> key field.

While I understand this is mostly true, I had assumed that it was
possible to create some synthetic primary key, say, 'segment - doc'.
If not, how does one know which record to mark for deletion with a
delete/add style of update?  Does one save some unique identifier as
field value, and then search for it?

> The very nature of an inverted index is that a single document gets broken up
> and listed in many posting lists.  Updating all those posting lists is
> basically impossible without rebuilding the index.

While I routinely mangle everything related to the Lucy object
hierarchy, the physical layout of the inverted index is one area where
I feel comfortable.  Yes, I'm suggesting that updating each and every
one of those posting lists both feasible and possibly the best
approach.  It only becomes impossible if the update rate exceeds the
available memory bandwidth, which for a modern processor is mighty
high.  Given a reasonable read/write ratio, I think it's a win, at
least for the cases I'm interested in.  I'm not suggesting you move to
such an approach now, but it's a route I strongly want left open until
conclusively proven to be unworkable.

> Getting back to the point... I think IndexReader has to have Doc_Max() and
> Doc_Freq().  How do we avoid those and still support TF/IDF?

You are right that we need these, but it's just a question of where
the support should go.  As a non-OO thinker, I view these as being
data members within some standardized Posting struct.  Thus I'd guess
that these should be part of the Posting object, accessible only
during the Scoring phase, rather than on the IndexReader.   But likely
I'm not understanding you here.

> It's also hard to imagine life without Lexicon(), or support for sort caches.

Again, it's just a question of where these should go.  While you view
the Lexicon as being part of the Index (and commented appropriately on
the overloading of this term), I view them as independent.  I can view
cases where different inverted indexes might share a common lexicon in
the same way that segments currently share one, perhaps even with the
Lexicon being generated and stored by an unrelated app (say, for user
applied tags).

> If you were to write an Architecture class to support pluggability, what would
> it look like?

I guess that's what I'm hoping to answer.  I don't currently know.  My
instinct is that it might be best to treat the problem as one of data
exchange between levels rather than pluggability, but that might just
be my poor OO intuition.

Nathan Kurz
[email protected]

Re: real time updates

Reply via email to