OK, catching up here and trying to merge threads together otherwise I'm going to lose my mind!:
Chuck Williams wrote: > > Ning Li wrote: >> >> If a reader can only open snapshots both for search and for >> modification, I think another change is needed besides the ones >> listed: assume the latest snapshot is segments_5 and the latest >> checkpoint is segmentsx_7 with 2 new segments, then a reader opens >> snapshot segments_5, performs a few deletes and writes a new >> checkpoint segmentsx_8. The summary file segmentsx_8 should include >> the 2 new segments which are in segmentsx_7 but not in segments_5. >> Such segments to include are easily identifiable only if they are not >> merged with segments in the latest snapshot... All these won't be >> necessary if a reader always opens the latest checkpoint for >> modification, which will also support deletion of non-committed >> documents. >> > This problem seems worse. I don't see how a reader and a writer can > independently compute and write checkpoints. The adds in the writer > don't just create new segments, they replace existing ones through > merging. And the merging changes doc-ids by expunging deletes. It > seems that all deletes must be based on the most recent checkpoint, or > merging of checkpoints to create the next snapshot will be considerably > more complex. Good catch Ning! And, I agree, when a reader plans to make modifications to the index, I think the best solution is to require that the reader has opened most recent "segments*_N" (be that a snapshot or a checkpoint). Really a reader is actually a "writer" in this context. This means we need a way to open a reader against the most recent checkpoint as well (I will add that). This is very much consistent with how a reader now checks if it is still current when someone first tries to change a del/norm: if it's not still current (ie, another writer has written a new segments_N file) then an IOException is raised with "IndexReader out of date and no longer valid for delete, undelete, or setNorm operations". I think with explicit commits that same requirement & check would apply. Chuck Williams wrote: > My interest is transactions, not making doc-id's permanent. > Specifically, the ability to ensure that a group of adds either all go > into the index or none go into the index, and to ensure that if none go > into the index that the index is not changed in any way. Right, I see "explicit commits" as a very simple implementation to provide a powerful base functionality to Lucene. This base functionality can indeed enable or make easier/more performant many neat things above it (the permanent docids discussion, Chuck's highly performant ParallelWriter, delayed flushing of pending deletes, etc) but I'd like to keep a clean separation and focus first only on making the most minimal yet self-contained "explicit commits" work and then separately build out on top of it. Progress not perfection! Doron Cohon wrote: > As a database application, to my understanding the (newly suggested) > transaction support in Lucene is single tx. I can't see how multiple > tx can be done within Lucene (and I don't think it should be > done). Even if it was possible, I think indexing would become very > inefficient. I think the motivation for adding (some) tx support is > different, and tx support would be minimal, definitely not multiple > tx. Ning Li wrote: > Lastly, hopefully the term "transaction" won't cause any confusion > since this "explicit commit" is much simpler than database > transaction where a database can guarantee the ACID properties for > each of multiple concurrent transactions. I agree "explicit commits" is in fact a reduced version of the more general ACID transactions that relational DBs provide. I really don't want to call it "transactions" for this reason: that label would automatically oversell the capability, then only to later disappoint our users. Always best to "under promise and over deliver" and the label "transactions" would do just the reverse. But yes explicit commits is basically a "single transaction". > If I had a vote it would be +1 on the direction Michael has proposed, > assuming it can be done robustly and without performance penalty. I don't anticipate any performance issues. The implementation is so amazingly trivial! The only index format change is a new name for those segments_N files that were just the automatic checkpoints that Lucene does. Otherwise the index format is unchanged. And then additional logic for a reader/writer to decide which one of these to read/write. The only really "interesting" change is to the IndexFileDeleter: it now must be more careful in how it figures out which index files are safe to delete (this is the part I'm working on now). I will definitely test performance (with the new benchmarking suite!) but I don't expect any changes for the better or worse with just "explicit commits". The things that then become possible once you have explicit commits should give us good potential performance improvements, error recoverability, etc. in the future. But that's the future and I'm focusing on "now" :) Mike --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]