[Late to this party, but thought I'd chime in] I think this "layer" concept is right on. But I'm wondering about the life cycle of these layers. Do layers live forever? Or do they collapse at some point? (Like, as I think was already pointed out, deletes are when segments are merged today.)
-Babak On Sat, Mar 27, 2010 at 5:25 AM, Grant Ingersoll <gsing...@apache.org> wrote: > First off, this is something I've had in my head for a long time, but don't > have any code. > As many of you know, one of the main things that vexes any search engine > based on an inverted index is how to do fast updates of just one field w/o > having to delete and re-add the whole document like we do today. When I > think about the whole update problem, I keep coming back to the notion of > Photoshop (or any other real photo editing solution) Layers. In a photo > editing solution, when you want to hide/change a piece of a photo, it is > considered best practice to add a layer over that part of the photo to be > changed. This way, the original photo is maintained and you don't have to > worry about accidentally damaging the area you aren't interested in. Thus, > a layer is essentially a mask on the original photo. The analogy isn't quite > the same here, but nevertheless... > > So, thinking out loud here and I'm not sure on the best wording of this: > > When a document first comes in, it is all in one place, just as it is now. > Then, when an update comes in on a particular field, we somehow mark in the > index that the document in question is modified and then we add the new > change onto the end of the index (just like we currently do when adding new > docs, but this time it's just a doc w/ a single field). Then, when > searching, we would, when scoring the affected documents, go to a secondary > process that knew where to look up the incremental changes. As background > merging takes place, these "disjoint" documents would be merged back > together. We'd maybe even consider a "high update" merge scheduler that > could more frequently handle these incremental merges. > > I'm not sure where we would maintain the list of changes. That is, is it > something that goes in the posting list, or is it a side structure. I think > in the posting list would be to slow. Also, perhaps it is worthwhile for > people to indicate that a particular field is expected to be updated while > others maintain their current format so as not to incur the penalty on each. > > In a sense, the old field for that document is masked by the new field. I > think, given proper index structure, that we maybe could make that marking > of the old field fast (maybe it's a pointer to the new field, maybe it's > just a bit indicating to go look in the "update" segment) > > On the search side, I think performance would still be maintained b/c even > in high update envs. you aren't usually talking about more than a few > thousand changes in a minute or two and the background merger would be > responsible for keeping the total number of disjoint documents low. > > I realize there isn't a whole lot to go on here just yet, but perhaps it > will spawn some questions/ideas that will help us work it out in a better > way. > At any rate, I think adding incr. field update capability would be a huge > win for Lucene. > -Grant --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org