Re: real time updates

Marvin Humphrey Sun, 15 Mar 2009 17:44:47 -0700

On Sun, Mar 15, 2009 at 07:21:13PM -0400, Michael McCandless wrote:

> Right.  I guess it's because Lucene buffers up deletes that it can continue
> to accept adds & deletes even during the blip.  But it cannot write a new
> segment (materialize the adds & deletes) during the blip.


OK, I think that makes sense.  Lucene isn't so much performing deletions as
promising to perform deletions at some point in the future.  There's still a
window where no new deletions are being performed (the "blip"), and the
process of reconciling deletions finishes during this window.

> Does this mean you can run multiple writers against the same index, to gain
> concurrency?  

That would be fab.  I hadn't thought it would be possible, but maybe we can
get there...

> (Though... that's tricky, with deletes; oh maybe  because you store new
> deletes for an old segment along with the new segment  that's OK?  Hmm, it
> still seems like you'd have a staleness problem).

What if we have the deletions reader OR together all bit vectors against a
given segment?  Search-time performance would dive of course, but I believe
we'd get logically correct results.

Under the Lucene bit-vector naming scheme, you'd need to keep every deletions
file around for the life of a given segment -- at least until you had a
consolidator process lock everything down and write an authoritative bit
vector.  With the current KS bit-vector naming scheme, out of date bit-vector
files would be zapped by the merging process (which in this case means the
consolidator).  I don't think it's any more efficient, though it's arguably
cleaner.

The tombstone approach would work for the same reason.  It doesn't matter if
multiple tombstone rows contain a tombstone for the same document, because the
priority queue ORs together the results.  Therfore, you don't need to
coordinate the addition of new tombstones.

Claiming a new segment directory and committing a new master file
(segments_XXX in Lucene, snapshot_XXX.json in KS) wouldn't require
synchronization: if those ops fail because your process lost out in the race
condition, you just retry.  The only time we have true synchronization
requirements is during merging.

So... if we were to somehow make tombstones perform adequately at search-time,
I think we could make a many-writers-single-merger model work.

> Ugh, lock starvation.  Really the OS should provide a FIFO lock queue of
> some sort.

Well, I think this would be less of a headache if we didn't need portability.
It's just that the locking and IPC mechanisms provided by various operating
systems out there are wildly incompatible.

Unfortunately, I don't think there's any other way to implement background
merging for all Lucy target hosts besides the multiple-process approach.  Lucy
will never work with Perl ithreads.

PS: FYI, your messages today have premature line-wrapping issues -- your
original text, not just the quotes.

Marvin Humphrey

Re: real time updates

Reply via email to