Marvin Humphrey wrote:

On Sat, Mar 14, 2009 at 05:51:43AM -0400, Michael McCandless wrote:
Even w/ background merging, which allows new segments to be written &
reopened in a reader even while the big merge is running in the BG,
Lucene still has the challenge of warming a reader on the [large]
newly merged segment before using the reader "for real".

Lucy doesn't have to worry about the warming aspect; given sufficient RAM, all the files in the recently written segment will still be "hot" in the OS file
cache.

The trick we need to master is the coordination of two concurrent write
processes.  I think it goes something like this:

* The background consolidator writer grabs "consolidate.lock". It starts writing its own segment based on the state of the index at that moment. * Meanwhile, an indeterminate number of consolidator-aware write processes
   launch and complete.

So eg you could merge 2 sets of segments at once (like Lucene)?

These processes are forbidden from merging any files
 that pre-date the establishment of "consolidate.lock".

Why? It seems like it needs to merge segments created before it acquired
that lock (that's why it was launched).

* Once the consolidator finishes most of what it's doing, it waits to obtain a write lock. The only task left is to carry forward new deletions which have been made since the establishment of "consolidate.lock" against the segments which the consolidator has just merged away. It finishes that task, commits, releases "write.lock", releases "consolidate.lock",then
   exits.

That, and update the master "segments" file to actually record the merge, and
incRef/decRef to delete files.

Does that sound similar to the Lucene implementation?

Yes.

But, what if while a large merge is happening, and enough segments have
been written to warrant a small merge to kick off & finish?

We need an incremental copy-on-write solution (eg only the "page" that's change gets copied when a new deletion arrives). We need this for changes
to norms too.

Norms, huh? That's weird. Do you have to do that because a field definition
has been modified?

No, it's to handle someone calling IndexReader.setNorm, eg if they are
doing "realtime boosting".


But then does deletions-seg_2.bv contain all deletes for seg_2? In which case this is just like the "generation" Lucene increments & tacks on when
it saves a del; just a different naming scheme.

That's right, it's just a different naming scheme. In fact, it's marginally less efficient because the bit vector must be copied a little more often.

However, with that change, segment directories are truly never modified once written. For somewhat esoteric reasons, that made it easier to factor a sensible DeletionsWriter out of the existing KinoSearch indexing code so that
we could plug in alternative implementations.

OK.

Mike

Reply via email to