Of course introducing the idea of updates also introduces the notion of a 
primary key and there's probably an entirely separate discussion to be had 
around user-supplied vs Lucene-generated keys.

That aside, the biggest concern for me here is the impact that this is likely 
to have on search -  currently queries such as "a:1 AND b:2" are streamed 
efficiently when evaluated because fields a and b have long postings lists 
conveniently sorted in doc-id insertion order that can be walked in sequence. 
If there are to be disjoint, partial  docs, with updated contents arriving 
out-of-primary-key-order this is bound to introduce costly disk seeks to the 
query process or require commit-time merges/sorts to preserve the doc-ordered 
posting lists needed to maintain search speed. Both of these strategies come at 
a reasonable cost. Of course some form of RAM-based value caching (allowing us 
to randomly look up the latest value for field b in doc x) is fast but probably 
only suited to small-scale deployments.

It's probably worth thinking through the scenarios we want to cater for. Maybe 
a Digg-like scenario with users voting on document popularity *can* be catered 
for with RAM-based field caches because the data (count of votes) is small 
enough to cache? 

Cheers,
Mark


On 27 Mar 2010, at 11:25, Grant Ingersoll wrote:

> First off, this is something I've had in my head for a long time, but don't 
> have any code.
> 
> As many of you know, one of the main things that vexes any search engine 
> based on an inverted index is how to do fast updates of just one field w/o 
> having to delete and re-add the whole document like we do today.   When I 
> think about the whole update problem, I keep coming back to the notion of 
> Photoshop (or any other real photo editing solution) Layers.  In a photo 
> editing solution, when you want to hide/change a piece of a photo, it is 
> considered best practice to add a layer over that part of the photo to be 
> changed.  This way, the original photo is maintained and you don't have to 
> worry about accidentally damaging the area you aren't interested in.  Thus, a 
> layer is essentially a mask on the original photo. The analogy isn't quite 
> the same here, but nevertheless...
> So, thinking out loud here and I'm not sure on the best wording of this: 
> 
> When a document first comes in, it is all in one place, just as it is now. 
> Then, when an update comes in on a particular field, we somehow mark in the 
> index that the document in question is modified and then we add the new 
> change onto the end of the index (just like we currently do when adding new 
> docs, but this time it's just a doc w/ a single field). Then, when searching, 
> we would, when scoring the affected documents, go to a secondary process that 
> knew where to look up the incremental changes. As background merging takes 
> place, these "disjoint" documents would be merged back together. We'd maybe 
> even consider a "high update" merge scheduler that could more frequently 
> handle these incremental merges.   
> 
> 
> I'm not sure where we would maintain the list of changes.  That is, is it 
> something that goes in the posting list, or is it a side structure.  I think 
> in the posting list would be to slow.  Also, perhaps it is worthwhile for 
> people to indicate that a particular field is expected to be updated while 
> others maintain their current format so as not to incur the penalty on each.
>  In a sense, the old field for that document is masked by the new field. I 
> think, given proper index structure, that we maybe could make that marking of 
> the old field fast (maybe it's a pointer to the new field, maybe it's just a 
> bit indicating to go look in the "update" segment)
> 
> On the search side, I think performance would still be maintained b/c even in 
> high update envs. you aren't usually talking about more than a few thousand 
> changes in a minute or two and the background merger would be responsible for 
> keeping the total number of disjoint documents low.
> 
> I realize there isn't a whole lot to go on here just yet, but perhaps it will 
> spawn some questions/ideas that will help us work it out in a better way.
> 
> At any rate, I think adding incr. field update capability would be a huge win 
> for Lucene.
> 
> -Grant

Reply via email to