On Mar 27, 2010, at 11:14 AM, Mark Harwood wrote: > Of course introducing the idea of updates also introduces the notion of a > primary key and there's probably an entirely separate discussion to be had > around user-supplied vs Lucene-generated keys.
Not sure I see that need. Can you explain your reasoning a bit more? > > That aside, the biggest concern for me here is the impact that this is likely > to have on search - currently queries such as "a:1 AND b:2" are streamed > efficiently when evaluated because fields a and b have long postings lists > conveniently sorted in doc-id insertion order that can be walked in sequence. > If there are to be disjoint, partial docs, with updated contents arriving > out-of-primary-key-order this is bound to introduce costly disk seeks to the > query process or require commit-time merges/sorts to preserve the doc-ordered > posting lists needed to maintain search speed. Both of these strategies come > at a reasonable cost. Of course some form of RAM-based value caching > (allowing us to randomly look up the latest value for field b in doc x) is > fast but probably only suited to small-scale deployments. Indeed, part of me thinks this is especially suited for flex indexing, where I can make a design time decision to pay the cost in exchange for high updates at the cost of potentially slower search. > > It's probably worth thinking through the scenarios we want to cater for. > Maybe a Digg-like scenario with users voting on document popularity *can* be > catered for with RAM-based field caches because the data (count of votes) is > small enough to cache? Agreed. Many social applications require updating one or two fields very frequently (popularity, ratings, votes, etc.) > > Cheers, > Mark > > > On 27 Mar 2010, at 11:25, Grant Ingersoll wrote: > >> First off, this is something I've had in my head for a long time, but don't >> have any code. >> >> As many of you know, one of the main things that vexes any search engine >> based on an inverted index is how to do fast updates of just one field w/o >> having to delete and re-add the whole document like we do today. When I >> think about the whole update problem, I keep coming back to the notion of >> Photoshop (or any other real photo editing solution) Layers. In a photo >> editing solution, when you want to hide/change a piece of a photo, it is >> considered best practice to add a layer over that part of the photo to be >> changed. This way, the original photo is maintained and you don't have to >> worry about accidentally damaging the area you aren't interested in. Thus, >> a layer is essentially a mask on the original photo. The analogy isn't quite >> the same here, but nevertheless... >> So, thinking out loud here and I'm not sure on the best wording of this: >> >> When a document first comes in, it is all in one place, just as it is now. >> Then, when an update comes in on a particular field, we somehow mark in the >> index that the document in question is modified and then we add the new >> change onto the end of the index (just like we currently do when adding new >> docs, but this time it's just a doc w/ a single field). Then, when >> searching, we would, when scoring the affected documents, go to a secondary >> process that knew where to look up the incremental changes. As background >> merging takes place, these "disjoint" documents would be merged back >> together. We'd maybe even consider a "high update" merge scheduler that >> could more frequently handle these incremental merges. >> >> >> I'm not sure where we would maintain the list of changes. That is, is it >> something that goes in the posting list, or is it a side structure. I think >> in the posting list would be to slow. Also, perhaps it is worthwhile for >> people to indicate that a particular field is expected to be updated while >> others maintain their current format so as not to incur the penalty on each. >> In a sense, the old field for that document is masked by the new field. I >> think, given proper index structure, that we maybe could make that marking >> of the old field fast (maybe it's a pointer to the new field, maybe it's >> just a bit indicating to go look in the "update" segment) >> >> On the search side, I think performance would still be maintained b/c even >> in high update envs. you aren't usually talking about more than a few >> thousand changes in a minute or two and the background merger would be >> responsible for keeping the total number of disjoint documents low. >> >> I realize there isn't a whole lot to go on here just yet, but perhaps it >> will spawn some questions/ideas that will help us work it out in a better >> way. >> >> At any rate, I think adding incr. field update capability would be a huge >> win for Lucene. >> >> -Grant >