Re: adding "explicit commits" to Lucene?

robert engels Mon, 15 Jan 2007 22:11:52 -0800

If that is all you need, I think it is far simpler:

If you have an OID, then al that is required is to a write to aseparate disk file the operations (delete this OID, insert thisdocument, etc...)

Once the file is permanently on disk. Then it is simple to just keepplaying the file back until it succeeds.


This is what we do in our search server.

I am not completely familiar with parallel reader, but in reading theJavaDoc I don't see the benefit - since you have to write thedocuments to both indexes anyway??? Why is it of any benefit to breakthe document into multiple parts?

If you have OIDs available, parallel reader can be accomplished in afar simpler and more efficient manner - we have a completelyfederated server implementation that was trivial - less < 100 linesof code. We did it simpler, and create a hash from the OID, and storethe document into a different index depending on the has, then runthe query across all indexes in parallel, joining the results.


On Jan 15, 2007, at 11:49 PM, Chuck Williams wrote:

My interest is transactions, not making doc-id's permanent.
Specifically, the ability to ensure that a group of adds either all go

into the index or none go into the index, and to ensure that ifnone go

into the index that the index is not changed in any way.

I have UID's but they cannot ensure the latter property, i.e. they
cannot ensure side-effect-free rollbacks.

Yes, if you have no reliance on internal Lucene structures like doc-id's

and segments, then that shouldn't matter.  But many capabilities have

such reliance for good reasons. E.g., ParallelReader, which is apublic

supported class in Lucene, requires doc-id synchronization.  There are
similar good reasons for an application to take advantage of doc-ids.

Lucene uses doc-id's in many of its API's and so it is not surprising

that many applications rely on them, and I'm sure misuse them notfully

understanding the semantics and uncertainties of doc-id changes due to
merging segments with deletes.

Applications can use doc-ids for legitimate and beneficial purposes

while remaining semantically valid. Making such capabilitiesefficientand robust in all cases is facilitated by application control overwhen

doc-id's and segment structure change at a granularity larger than the
single Document.

If I had a vote it would be +1 on the direction Michael has proposed,
assuming it can be done robustly and without performance penalty.

Chuck


robert engels wrote on 01/15/2007 07:34 PM:

I honestly think that having a unique OID as an indexed field and
putting a layer on top of Lucene is the best solution to all of this.

It makes it almost trivial, and you can implement transactionhandling

in a variety of ways.

Attempting to make the doc ids "permanent" is a tough challenge,
considering the orignal design called for them to be "non permanent".

It seems doubtful that you cannot have some sort of primary key any
way and be this concerned about the transactional nature of Lucene.

I vote -1 on all of this. I think it will detract from the simple and
efficient storage mechanism that Lucene uses.

On Jan 15, 2007, at 11:19 PM, Chuck Williams wrote:

Ning Li wrote on 01/15/2007 06:29 PM:

On 1/14/07, Michael McCandless <[EMAIL PROTECTED]> wrote:

* The "support deleteDocuments in IndexWriter" (LUCENE-565)featurecould have a more efficient implementation (just like Solr)when
    autoCommit is false, because deletes don't need to be flushed
until commit() is called. Whereas, now, they must beaggressively
    flushed on each checkpoint.


If a reader can only open snapshots both for search and for
modification, I think another change is needed besides the ones
listed: assume the latest snapshot is segments_5 and the latest
checkpoint is segmentsx_7 with 2 new segments, then a reader opens
snapshot segments_5, performs a few deletes and writes a new
checkpoint segmentsx_8. The summary file segmentsx_8 should include
the 2 new segments which are in segmentsx_7 but not in segments_5.

Such segments to include are easily identifiable only if theyare not

merged with segments in the latest snapshot... All these won't be
necessary if a reader always opens the latest checkpoint for
modification, which will also support deletion of non-committed
documents.

This problem seems worse.  I don't see how a reader and a writer can
independently compute and write checkpoints.  The adds in the writer
don't just create new segments, they replace existing ones through
merging.  And the merging changes doc-ids by expunging deletes.  It

seems that all deletes must be based on the most recentcheckpoint, ormerging of checkpoints to create the next snapshot will beconsiderably

more complex.

Chuck

---------------------------------------------------------------------

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: adding "explicit commits" to Lucene?

Reply via email to