Re: adding "explicit commits" to Lucene?

robert engels Mon, 15 Jan 2007 13:23:43 -0800

I think that you will find a much larger performance decrease indoing things this way - if the external resource is a db, or anynetworked accessed resource.

When even just a single document is changed in the Lucene index youcould have MILLIONS of changes to internal doc ids (if say an earlydocument was deleted).


Seems far better to store the external id in the Lucene index.

You will find that performance penalty of looking up the Lucenedocument by the external id (vs. the internal doc #), to be far lessthan the performance penalty of updating every document in theexternal index when the Lucene index is merged.

The only case I can see this would be of any benefit is if the Luceneindex RARELY if EVER changes - anything else, and you will have bigproblems.

Now, if the Lucene is changed to support point in time searching(basically never delete any index files), you might be able to dowhat you this. Just create a Directory only creating the segments upto that time.


Sounds VERY messy to me.

On Jan 15, 2007, at 3:12 PM, Doron Cohen wrote:

Also related is the request made several times in the list to beable tocontrol when docids are changing, for applications that need tomaintainsome mapping between external IDs to Lucene docs but for someperformance

reasons cannot afford to only count on storing external (DB) IDs in

Lucene's field. For instance, recent discussion "Making documentnumbers

persistent" in java-user.

So, an application controlled commit would allow an application not to

"experience" document numbering changes - no docid changes wouldaffect theapplication until a commit is issued. So the application would beable to

call optimize and then issue a commit, thereby exposing docid changes.

One disadvantage of controlling ids changes like this is thatsearch would

have to stale long behind index updates, unless optimize is called.

Therefore, - that's another issue of course - I am wondering ifthere mightbe interest in allowing applications to control whether deleteddocs are

allowed to be removed/squeezed-out or not.

Michael McCandless <[EMAIL PROTECTED]> wrote on 14/01/2007
13:36:34:

Team,

I've been struggling to find a clean solution for LUCENE-710, when I

thought of a simple addition to Lucene ("explicit commits") thatwould

I think resolve LUCENE-710 and would fix a few other outstanding
issues when readers are using a "live" index (being updated by a
writer).

The basic idea is to add an explicit "commit" operation to Lucene.

This is the same nice feature Solr has, but just a different
implementation (in Lucene core, in a single index, instead).  The
commit makes a "point in time" snapshot (term borrowed from Solr!)
available for searching.

The implementation is surprisingly simple (see below) and completely
backwards compatible.

I'd like to get some feedback on the idea/implementation.


Details...: right now, Lucene writes a new segments_N file at various

times: when a writer (or reader that's writing deletes/norms)needs toflush its pending changes to disk; when a writer merges segments;when

a writer is closed; multiple times during optimize/addIndexes; etc.

These times are not controllable / predictable to the developer using
Lucene.

A new reader always opens the last segments_N written, and, when a
reader uses isCurrent() to check whether it should re-open (the
suggested way), that method always returns false (meaning you should
re-open) if there are any new segments_N files.

So it's somewhat uncontrollable to the developer what state the index
is in when you [re-]open a reader.

People work around this today by adding logic above Lucene so thatthe

writer separately communicates to readers when is a good time to
refresh.  But with "explicit commits", readers could instead look
directly at the index and pick the right segments_N to refresh to.

I'm proposing that we separate the writing of a new segments_N file
into those writes that are done automatically by Lucene (I'll call

these "checkpoints") from meaningful (to the application) commitsthat

are done explicitly by the developer at known times (I'll call this
"committing a snapshot").  I would add a new boolean mode to

IndexWriter called "autoCommit", and a new public method "commit()" toIndexWriter and IndexReader (we'd have to rename the currentprotected

commit() in IndexReader)

When autoCommit is true, this means every write of a segments_N file
will be "commit a snapshot", meaning readers will then use it for
searching.  This will be the default and this is exactly how Lucene
behaves today.  So this change is completely backwards compatible.

When autoCommit is false, then when Lucene chooses to save a
segments_N file it's just a "checkpoint": a reader would not open or
re-open to the checkpoint.  This means the developer must then call
IndexWriter.commit() or IndexReader.commit() in order to "commit a
snapshot" at the right time, thereby telling readers that this
segments_N file is a valid one to switch to for searching.


The implementation is very simple (I have an initial coarse prototype
working with all but the last bullet):

   * If a segments_N file is just a checkpoint, it's named

"segmentsx_N" (note the added 'x'); if it's a snapshot, it'snamed

     "segments_N".  No other changes to the index format.

   * A reader by default opens the latest snapshot but can optionally
     open a specific N (segments_N) snapshot.

   * A writer by default starts from the most recent "checkpoint" but
     may also take a specific checkpoint or snapshot point N
     (segments_N) to start from (to allow rollback).

   * Change IndexReader.isCurrent() to see if there are any newer
     snapshots but disregard newer checkpoints.

* When a writer is in autoCommit=false mode, it always writesto the

     next segmentsx_N; else it writes to segments_N.

   * The commit() method would just write to the next segments_N file
     and return the N it had written (in case application needs to
     re-use it later).

* IndexFileDeleter would need to have a slightly smarter policywhenautoCommit=false, ie, "don't delete anything referenced byeitherthe past N snapshots or if the snapshot was obsoleted lessthan X

     minutes ago".


I think there are some compelling things this could solve:

   * The "delete then add" problem (really a special but very common
     case of general transactions):

Right now when you want to update a bunch of documents in aLuceneindex, it's best to open a reader, do a "batch delete", closethe

     reader, open a writer, do a "batch add", close the writer.  This
     is the suggested way.

     The open risk here is that a reader could refresh at any time
     during these operations, and find that a bunch of documents have
     been deleted but not yet added again.

Whereas, with autoCommit false you could do this entireoperation(batch delete then batch add), and then call the final commit() in

     the end, and readers would know not to re-open the index until
     that final commit() succeeded.

   * The "using too much disk space during optimize" problem:

     This came up on the user's list recently: if you aggressively
     refresh readers while optimize() is running, you can tie up much
     more disk space than you'd expect, because your readers are

holding open all the [possibly very large] intermediatesegments.


     Whereas, if autoCommit is false, then developer calls optimize()
     and then calls commit(), the readers would know not to re-open
     until optimize was complete.

   * More general transactions:

     It has come up a fair number of times how to make Lucene

transactional, either by itself ("do the following complexseries

     of index operations but if there is any failure, rollback to the
     start, and don't expose result to searcher until all operations
     are done") or as part of a larger transaction eg involving a
     relational database.

     EG, if you want to add a big set of documents to Lucene, but not

make them searchable until they are all added, or until aspecific

     time (eg Monday @ 9 AM), you can't do that easily today but it
     would be simple with explicit commits.

I believe this change would make transactions work correctlywith

     Lucene.

* LUCENE-710 ("implement point in time searching withoutrelying on

     filesystem semantics"), also known as "getting Lucene to work
     correctly over NFS".

I think this issue is nearly solved when autoCommit=false, aslongas we can adopt a shared policy on "when readers refresh" tomatchthe new deletion policy (described above). Basically, aslong as

     the deleter and readers are playing by the same "refresh rules"

and the writer gives the readers enough time to switch/warm,then

     the deleter should never delete something in use by a reader.



There are also some neat future things made possible:

* The "support deleteDocuments in IndexWriter" (LUCENE-565)feature

     could have a more efficient implementation (just like Solr) when
     autoCommit is false, because deletes don't need to be flushed

until commit() is called. Whereas, now, they must beaggressively

     flushed on each checkpoint.

* More generally, because "checkpoints" do not need to beusable by

     a reader/searcher, other neat optimizations might be possible.

     EG maybe the merge policy could be improved if it knows that
     certain segments are "just checkpoints" and are not involved in
     searching.

   * I could simplify the approach for my recent addIndexes changes

(LUCENE-702) to use this, instead of it's current approach(wish I

     had thought of this sooner: ugh!.).

   * A single index could hold many snapshots, and, we could enable a

reader to explicitly open against an older snapshot. EGmaybe you

     take weekly and a monthly snapshot because you sometimes want to
     go back and "run a search on last week's catalog".

Feedback?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: adding "explicit commits" to Lucene?

Reply via email to