I agree this is a long overdue feature... we need to get it into
Lucene somehow.

I like the Layers analogy... I think that will work well with Lucene's
transactional semantics, ie a prior commit point would continue to see
the index before the updates but new commit points would see the
updates.

I think we would somehow want the new postings "layer" written to
cleanly be merged under Docs/PositionsEnum?  So that searching is
unaffected -- ie the scorers just see a normal postings enum.
FieldCache would also just populate normally.  But somehow these
partial docs would have to not "count" as real docIDs... and the
normal merging of segments would coalesce these updates...

Also: how would we handle stored fields & term vectors?

Mike

On Sat, Mar 27, 2010 at 7:25 AM, Grant Ingersoll <gsing...@apache.org> wrote:
> First off, this is something I've had in my head for a long time, but don't
> have any code.
> As many of you know, one of the main things that vexes any search engine
> based on an inverted index is how to do fast updates of just one field w/o
> having to delete and re-add the whole document like we do today.   When I
> think about the whole update problem, I keep coming back to the notion of
> Photoshop (or any other real photo editing solution) Layers.  In a photo
> editing solution, when you want to hide/change a piece of a photo, it is
> considered best practice to add a layer over that part of the photo to be
> changed.  This way, the original photo is maintained and you don't have to
> worry about accidentally damaging the area you aren't interested in.  Thus,
> a layer is essentially a mask on the original photo. The analogy isn't quite
> the same here, but nevertheless...
>
> So, thinking out loud here and I'm not sure on the best wording of this:
>
> When a document first comes in, it is all in one place, just as it is now.
> Then, when an update comes in on a particular field, we somehow mark in the
> index that the document in question is modified and then we add the new
> change onto the end of the index (just like we currently do when adding new
> docs, but this time it's just a doc w/ a single field). Then, when
> searching, we would, when scoring the affected documents, go to a secondary
> process that knew where to look up the incremental changes. As background
> merging takes place, these "disjoint" documents would be merged back
> together. We'd maybe even consider a "high update" merge scheduler that
> could more frequently handle these incremental merges.
>
> I'm not sure where we would maintain the list of changes.  That is, is it
> something that goes in the posting list, or is it a side structure.  I think
> in the posting list would be to slow.  Also, perhaps it is worthwhile for
> people to indicate that a particular field is expected to be updated while
> others maintain their current format so as not to incur the penalty on each.
>
>  In a sense, the old field for that document is masked by the new field. I
> think, given proper index structure, that we maybe could make that marking
> of the old field fast (maybe it's a pointer to the new field, maybe it's
> just a bit indicating to go look in the "update" segment)
>
> On the search side, I think performance would still be maintained b/c even
> in high update envs. you aren't usually talking about more than a few
> thousand changes in a minute or two and the background merger would be
> responsible for keeping the total number of disjoint documents low.
>
> I realize there isn't a whole lot to go on here just yet, but perhaps it
> will spawn some questions/ideas that will help us work it out in a better
> way.
> At any rate, I think adding incr. field update capability would be a huge
> win for Lucene.
> -Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to