Re: adding "explicit commits" to Lucene?

Michael McCandless Tue, 16 Jan 2007 03:10:40 -0800

OK, catching up here and trying to merge threads together otherwise
I'm going to lose my mind!:

Chuck Williams wrote:
>
> Ning Li wrote:
>>
>> If a reader can only open snapshots both for search and for
>> modification, I think another change is needed besides the ones
>> listed: assume the latest snapshot is segments_5 and the latest
>> checkpoint is segmentsx_7 with 2 new segments, then a reader opens
>> snapshot segments_5, performs a few deletes and writes a new
>> checkpoint segmentsx_8. The summary file segmentsx_8 should include
>> the 2 new segments which are in segmentsx_7 but not in segments_5.
>> Such segments to include are easily identifiable only if they are not
>> merged with segments in the latest snapshot... All these won't be
>> necessary if a reader always opens the latest checkpoint for
>> modification, which will also support deletion of non-committed
>> documents.
>>
> This problem seems worse.  I don't see how a reader and a writer can
> independently compute and write checkpoints.  The adds in the writer
> don't just create new segments, they replace existing ones through
> merging.  And the merging changes doc-ids by expunging deletes.  It
> seems that all deletes must be based on the most recent checkpoint, or
> merging of checkpoints to create the next snapshot will be considerably
> more complex.

Good catch Ning!  And, I agree, when a reader plans to make
modifications to the index, I think the best solution is to require
that the reader has opened most recent "segments*_N" (be that a
snapshot or a checkpoint).  Really a reader is actually a "writer" in
this context.  This means we need a way to open a reader against the
most recent checkpoint as well (I will add that).

This is very much consistent with how a reader now checks if it is
still current when someone first tries to change a del/norm: if it's
not still current (ie, another writer has written a new segments_N
file) then an IOException is raised with "IndexReader out of date and
no longer valid for delete, undelete, or setNorm operations".  I think
with explicit commits that same requirement & check would apply.

Chuck Williams wrote:

> My interest is transactions, not making doc-id's permanent.
> Specifically, the ability to ensure that a group of adds either all go
> into the index or none go into the index, and to ensure that if none go
> into the index that the index is not changed in any way.

Right, I see "explicit commits" as a very simple implementation to
provide a powerful base functionality to Lucene.  This base
functionality can indeed enable or make easier/more performant many
neat things above it (the permanent docids discussion, Chuck's highly
performant ParallelWriter, delayed flushing of pending deletes, etc)
but I'd like to keep a clean separation and focus first only on making
the most minimal yet self-contained "explicit commits" work and then
separately build out on top of it.  Progress not perfection!

Doron Cohon wrote:

> As a database application, to my understanding the (newly suggested)
> transaction support in Lucene is single tx. I can't see how multiple
> tx can be done within Lucene (and I don't think it should be
> done). Even if it was possible, I think indexing would become very
> inefficient. I think the motivation for adding (some) tx support is
> different, and tx support would be minimal, definitely not multiple
> tx.

Ning Li wrote:

> Lastly, hopefully the term "transaction" won't cause any confusion
> since this "explicit commit" is much simpler than database
> transaction where a database can guarantee the ACID properties for
> each of multiple concurrent transactions.

I agree "explicit commits" is in fact a reduced version of the more
general ACID transactions that relational DBs provide.  I really don't
want to call it "transactions" for this reason: that label would
automatically oversell the capability, then only to later disappoint
our users.  Always best to "under promise and over deliver" and the
label "transactions" would do just the reverse.  But yes explicit
commits is basically a "single transaction".

> If I had a vote it would be +1 on the direction Michael has proposed,
> assuming it can be done robustly and without performance penalty.

I don't anticipate any performance issues.  The implementation is so
amazingly trivial!  The only index format change is a new name for
those segments_N files that were just the automatic checkpoints that
Lucene does.  Otherwise the index format is unchanged.  And then
additional logic for a reader/writer to decide which one of these to
read/write.

The only really "interesting" change is to the IndexFileDeleter: it
now must be more careful in how it figures out which index files are
safe to delete (this is the part I'm working on now).  I will
definitely test performance (with the new benchmarking suite!)  but I
don't expect any changes for the better or worse with just "explicit
commits".

The things that then become possible once you have explicit commits
should give us good potential performance improvements, error
recoverability, etc. in the future.  But that's the future and I'm
focusing on "now" :)

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: adding "explicit commits" to Lucene?

Reply via email to