On Mar 27, 2010, at 11:14 AM, Mark Harwood wrote:

> Of course introducing the idea of updates also introduces the notion of a 
> primary key and there's probably an entirely separate discussion to be had 
> around user-supplied vs Lucene-generated keys.

Not sure I see that need.  Can you explain your reasoning a bit more?

> 
> That aside, the biggest concern for me here is the impact that this is likely 
> to have on search -  currently queries such as "a:1 AND b:2" are streamed 
> efficiently when evaluated because fields a and b have long postings lists 
> conveniently sorted in doc-id insertion order that can be walked in sequence. 
> If there are to be disjoint, partial  docs, with updated contents arriving 
> out-of-primary-key-order this is bound to introduce costly disk seeks to the 
> query process or require commit-time merges/sorts to preserve the doc-ordered 
> posting lists needed to maintain search speed. Both of these strategies come 
> at a reasonable cost. Of course some form of RAM-based value caching 
> (allowing us to randomly look up the latest value for field b in doc x) is 
> fast but probably only suited to small-scale deployments.

Indeed, part of me thinks this is especially suited for flex indexing, where I 
can make a design time decision to pay the cost in exchange for high updates at 
the cost of potentially slower search.


> 
> It's probably worth thinking through the scenarios we want to cater for. 
> Maybe a Digg-like scenario with users voting on document popularity *can* be 
> catered for with RAM-based field caches because the data (count of votes) is 
> small enough to cache? 

Agreed.  Many social applications require updating one or two fields very 
frequently (popularity, ratings, votes, etc.)


> 
> Cheers,
> Mark
> 
> 
> On 27 Mar 2010, at 11:25, Grant Ingersoll wrote:
> 
>> First off, this is something I've had in my head for a long time, but don't 
>> have any code.
>> 
>> As many of you know, one of the main things that vexes any search engine 
>> based on an inverted index is how to do fast updates of just one field w/o 
>> having to delete and re-add the whole document like we do today.   When I 
>> think about the whole update problem, I keep coming back to the notion of 
>> Photoshop (or any other real photo editing solution) Layers.  In a photo 
>> editing solution, when you want to hide/change a piece of a photo, it is 
>> considered best practice to add a layer over that part of the photo to be 
>> changed.  This way, the original photo is maintained and you don't have to 
>> worry about accidentally damaging the area you aren't interested in.  Thus, 
>> a layer is essentially a mask on the original photo. The analogy isn't quite 
>> the same here, but nevertheless...
>> So, thinking out loud here and I'm not sure on the best wording of this: 
>> 
>> When a document first comes in, it is all in one place, just as it is now. 
>> Then, when an update comes in on a particular field, we somehow mark in the 
>> index that the document in question is modified and then we add the new 
>> change onto the end of the index (just like we currently do when adding new 
>> docs, but this time it's just a doc w/ a single field). Then, when 
>> searching, we would, when scoring the affected documents, go to a secondary 
>> process that knew where to look up the incremental changes. As background 
>> merging takes place, these "disjoint" documents would be merged back 
>> together. We'd maybe even consider a "high update" merge scheduler that 
>> could more frequently handle these incremental merges.   
>> 
>> 
>> I'm not sure where we would maintain the list of changes.  That is, is it 
>> something that goes in the posting list, or is it a side structure.  I think 
>> in the posting list would be to slow.  Also, perhaps it is worthwhile for 
>> people to indicate that a particular field is expected to be updated while 
>> others maintain their current format so as not to incur the penalty on each.
>>  In a sense, the old field for that document is masked by the new field. I 
>> think, given proper index structure, that we maybe could make that marking 
>> of the old field fast (maybe it's a pointer to the new field, maybe it's 
>> just a bit indicating to go look in the "update" segment)
>> 
>> On the search side, I think performance would still be maintained b/c even 
>> in high update envs. you aren't usually talking about more than a few 
>> thousand changes in a minute or two and the background merger would be 
>> responsible for keeping the total number of disjoint documents low.
>> 
>> I realize there isn't a whole lot to go on here just yet, but perhaps it 
>> will spawn some questions/ideas that will help us work it out in a better 
>> way.
>> 
>> At any rate, I think adding incr. field update capability would be a huge 
>> win for Lucene.
>> 
>> -Grant
> 


Reply via email to