Of course introducing the idea of updates also introduces the notion of a primary key and there's probably an entirely separate discussion to be had around user-supplied vs Lucene-generated keys.
That aside, the biggest concern for me here is the impact that this is likely to have on search - currently queries such as "a:1 AND b:2" are streamed efficiently when evaluated because fields a and b have long postings lists conveniently sorted in doc-id insertion order that can be walked in sequence. If there are to be disjoint, partial docs, with updated contents arriving out-of-primary-key-order this is bound to introduce costly disk seeks to the query process or require commit-time merges/sorts to preserve the doc-ordered posting lists needed to maintain search speed. Both of these strategies come at a reasonable cost. Of course some form of RAM-based value caching (allowing us to randomly look up the latest value for field b in doc x) is fast but probably only suited to small-scale deployments. It's probably worth thinking through the scenarios we want to cater for. Maybe a Digg-like scenario with users voting on document popularity *can* be catered for with RAM-based field caches because the data (count of votes) is small enough to cache? Cheers, Mark On 27 Mar 2010, at 11:25, Grant Ingersoll wrote: > First off, this is something I've had in my head for a long time, but don't > have any code. > > As many of you know, one of the main things that vexes any search engine > based on an inverted index is how to do fast updates of just one field w/o > having to delete and re-add the whole document like we do today. When I > think about the whole update problem, I keep coming back to the notion of > Photoshop (or any other real photo editing solution) Layers. In a photo > editing solution, when you want to hide/change a piece of a photo, it is > considered best practice to add a layer over that part of the photo to be > changed. This way, the original photo is maintained and you don't have to > worry about accidentally damaging the area you aren't interested in. Thus, a > layer is essentially a mask on the original photo. The analogy isn't quite > the same here, but nevertheless... > So, thinking out loud here and I'm not sure on the best wording of this: > > When a document first comes in, it is all in one place, just as it is now. > Then, when an update comes in on a particular field, we somehow mark in the > index that the document in question is modified and then we add the new > change onto the end of the index (just like we currently do when adding new > docs, but this time it's just a doc w/ a single field). Then, when searching, > we would, when scoring the affected documents, go to a secondary process that > knew where to look up the incremental changes. As background merging takes > place, these "disjoint" documents would be merged back together. We'd maybe > even consider a "high update" merge scheduler that could more frequently > handle these incremental merges. > > > I'm not sure where we would maintain the list of changes. That is, is it > something that goes in the posting list, or is it a side structure. I think > in the posting list would be to slow. Also, perhaps it is worthwhile for > people to indicate that a particular field is expected to be updated while > others maintain their current format so as not to incur the penalty on each. > In a sense, the old field for that document is masked by the new field. I > think, given proper index structure, that we maybe could make that marking of > the old field fast (maybe it's a pointer to the new field, maybe it's just a > bit indicating to go look in the "update" segment) > > On the search side, I think performance would still be maintained b/c even in > high update envs. you aren't usually talking about more than a few thousand > changes in a minute or two and the background merger would be responsible for > keeping the total number of disjoint documents low. > > I realize there isn't a whole lot to go on here just yet, but perhaps it will > spawn some questions/ideas that will help us work it out in a better way. > > At any rate, I think adding incr. field update capability would be a huge win > for Lucene. > > -Grant