Re: Incremental Field Updates
Good point. I meant the model at the document level: i.e. what milestones does a document go through in its life cycle. Today: created --> deleted With incremental updates: created --> update1 --> update2 --> deleted I think what I'm trying to say is that this second threaded sequence of state changes seems intuitively more fragile under concurrent scenarios. So for example, in a lock-free design, the system would also have to anticipate the following sequence of events: created --> update1 --> deleted --> update2 and consider update2 a null op. I'm imagining there are other cases that I can't think of.. -Babak On Tue, Apr 6, 2010 at 3:40 AM, Michael McCandless wrote: > write once, plus the option to the app to keep multiple commit points > around (by customizing the deletion policy). > > Actually order of operations / commits very much matters in Lucene today. > > Deletions are not idempotent: if you add a doc w/ term X, delete by > term X, add a new doc with term X... that's very different than if you > moved the delete op to the end. Ie the deletion only applies to the > docs added before it. > > Mike > > On Mon, Apr 5, 2010 at 12:45 AM, Babak Farhang wrote: >> Sure. Because of the write once principle. But at some cost >> (duplicated data). I was just agreeing that it would not be a good >> idea to bake in version-ing by keeping the layers around forever in a >> merged index; I wasn't keying in on transactions per se. >> >> Speaking of transactions: I'm not sure if we should worry about this >> much yet, but with "updates" the order of the transaction commits >> seems important. I think commit order is less important today in >> Lucene because its model supports only 2 types of events: document >> creation--which only happens once, and document deletion, which is >> idempotent. What do you think? Will commits have to be ordered if we >> introduce updates? Or does the onus of maintaining order fall on the >> application? >> >> -Babak >> >> On Sat, Apr 3, 2010 at 3:28 AM, Michael McCandless >> wrote: >>> On Sat, Apr 3, 2010 at 1:25 AM, Babak Farhang wrote: > I think they get merged in by the merger, ideally in the background. That sounds sensible. (In other words, we wont concern ourselves with roll backs--something possible while a "layer" is still around.) >>> >>> Actually roll backs would still be very possible even if layers are merged. >>> >>> Ie, one could keep multiple commits around, and the older commits >>> would still be referring to the old postings + layers, keeping them >>> alive. >>> >>> Lucene would still be transactional with such an approach. >>> >>> Mike >>> >>> - >>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-dev-h...@lucene.apache.org >>> >>> >> >> - >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-dev-h...@lucene.apache.org >> >> > > - > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Incremental Field Updates
write once, plus the option to the app to keep multiple commit points around (by customizing the deletion policy). Actually order of operations / commits very much matters in Lucene today. Deletions are not idempotent: if you add a doc w/ term X, delete by term X, add a new doc with term X... that's very different than if you moved the delete op to the end. Ie the deletion only applies to the docs added before it. Mike On Mon, Apr 5, 2010 at 12:45 AM, Babak Farhang wrote: > Sure. Because of the write once principle. But at some cost > (duplicated data). I was just agreeing that it would not be a good > idea to bake in version-ing by keeping the layers around forever in a > merged index; I wasn't keying in on transactions per se. > > Speaking of transactions: I'm not sure if we should worry about this > much yet, but with "updates" the order of the transaction commits > seems important. I think commit order is less important today in > Lucene because its model supports only 2 types of events: document > creation--which only happens once, and document deletion, which is > idempotent. What do you think? Will commits have to be ordered if we > introduce updates? Or does the onus of maintaining order fall on the > application? > > -Babak > > On Sat, Apr 3, 2010 at 3:28 AM, Michael McCandless > wrote: >> On Sat, Apr 3, 2010 at 1:25 AM, Babak Farhang wrote: I think they get merged in by the merger, ideally in the background. >>> >>> That sounds sensible. (In other words, we wont concern ourselves with >>> roll backs--something possible while a "layer" is still around.) >> >> Actually roll backs would still be very possible even if layers are merged. >> >> Ie, one could keep multiple commits around, and the older commits >> would still be referring to the old postings + layers, keeping them >> alive. >> >> Lucene would still be transactional with such an approach. >> >> Mike >> >> - >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-dev-h...@lucene.apache.org >> >> > > - > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Incremental Field Updates
Sure. Because of the write once principle. But at some cost (duplicated data). I was just agreeing that it would not be a good idea to bake in version-ing by keeping the layers around forever in a merged index; I wasn't keying in on transactions per se. Speaking of transactions: I'm not sure if we should worry about this much yet, but with "updates" the order of the transaction commits seems important. I think commit order is less important today in Lucene because its model supports only 2 types of events: document creation--which only happens once, and document deletion, which is idempotent. What do you think? Will commits have to be ordered if we introduce updates? Or does the onus of maintaining order fall on the application? -Babak On Sat, Apr 3, 2010 at 3:28 AM, Michael McCandless wrote: > On Sat, Apr 3, 2010 at 1:25 AM, Babak Farhang wrote: >>> I think they get merged in by the merger, ideally in the background. >> >> That sounds sensible. (In other words, we wont concern ourselves with >> roll backs--something possible while a "layer" is still around.) > > Actually roll backs would still be very possible even if layers are merged. > > Ie, one could keep multiple commits around, and the older commits > would still be referring to the old postings + layers, keeping them > alive. > > Lucene would still be transactional with such an approach. > > Mike > > - > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Incremental Field Updates
On Sat, Apr 3, 2010 at 1:25 AM, Babak Farhang wrote: >> I think they get merged in by the merger, ideally in the background. > > That sounds sensible. (In other words, we wont concern ourselves with > roll backs--something possible while a "layer" is still around.) Actually roll backs would still be very possible even if layers are merged. Ie, one could keep multiple commits around, and the older commits would still be referring to the old postings + layers, keeping them alive. Lucene would still be transactional with such an approach. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Incremental Field Updates
Grant, Reading your post a 3rd time, I see my "suggestion" is in fact the approach you describe. Sorry for being redundant. -Babak On Fri, Apr 2, 2010 at 11:25 PM, Babak Farhang wrote: >> I think they get merged in by the merger, ideally in the background. > > That sounds sensible. (In other words, we wont concern ourselves with > roll backs--something possible while a "layer" is still around.) > > I've been thinking about this problem also. One approach discussed > earlier in these mailing lists has been to somehow maintain a parallel > index of the update-able of the fields in such a way that the docIds > of the parallel index remain in sync with the "master" index. Mike > McCandless and I were discussing some variants of this approach a few > months back: > http://markmail.org/message/uifz5v37k6qxxhvz?q=%22incremental+document+field+update%22+site:markmail%2Eorg&page=1&refer=ipebtbf24y7rleps > That approach involved the concept of mapping (chaining, if you will) > internal docIds to view ids. That docid mapping concept sounds > analogous to this layer concept we are discussing now. > > I now think the parallel index approach may not be such a great idea, > after all: it simply pushes the problem to the edge--the slave index. > If we can solve update problem in the slave index, I reason, then > shouldn't we also be able to solve the same update problem in the > master index (and thereby remove the necessity of maintaining a > (user-level) parallel index in the first place)? > > Which seems to align with the approach being discussed here.. > > I imagine the "layers" being discussed here are somehow threaded by > docId. That is, given a docId, you can quickly find it's "layers." If > so, then the docId mapping idea may be one way to thread these layers. > (A logical document would be constructed by a chain of docIds, each > overriding the previous for each field it defines (or deletes). Such > a construction would have to be "merge-aware" (perhaps using machinery > similar to that used in LUCENE-1879) in order that it may maintain the > docId chain. > > What do you think? > > > On Fri, Apr 2, 2010 at 4:56 AM, Grant Ingersoll wrote: >> >> On Apr 2, 2010, at 2:50 AM, Babak Farhang wrote: >> >>> [Late to this party, but thought I'd chime in] >>> >>> I think this "layer" concept is right on. But I'm wondering about the >>> life cycle of these layers. Do layers live forever? Or do they >>> collapse at some point? (Like, as I think was already pointed out, >>> deletes are when segments are merged today.) >> >> I think they get merged in by the merger, ideally in the background. >> >> - >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-dev-h...@lucene.apache.org >> >> > - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Incremental Field Updates
> I think they get merged in by the merger, ideally in the background. That sounds sensible. (In other words, we wont concern ourselves with roll backs--something possible while a "layer" is still around.) I've been thinking about this problem also. One approach discussed earlier in these mailing lists has been to somehow maintain a parallel index of the update-able of the fields in such a way that the docIds of the parallel index remain in sync with the "master" index. Mike McCandless and I were discussing some variants of this approach a few months back: http://markmail.org/message/uifz5v37k6qxxhvz?q=%22incremental+document+field+update%22+site:markmail%2Eorg&page=1&refer=ipebtbf24y7rleps That approach involved the concept of mapping (chaining, if you will) internal docIds to view ids. That docid mapping concept sounds analogous to this layer concept we are discussing now. I now think the parallel index approach may not be such a great idea, after all: it simply pushes the problem to the edge--the slave index. If we can solve update problem in the slave index, I reason, then shouldn't we also be able to solve the same update problem in the master index (and thereby remove the necessity of maintaining a (user-level) parallel index in the first place)? Which seems to align with the approach being discussed here.. I imagine the "layers" being discussed here are somehow threaded by docId. That is, given a docId, you can quickly find it's "layers." If so, then the docId mapping idea may be one way to thread these layers. (A logical document would be constructed by a chain of docIds, each overriding the previous for each field it defines (or deletes). Such a construction would have to be "merge-aware" (perhaps using machinery similar to that used in LUCENE-1879) in order that it may maintain the docId chain. What do you think? On Fri, Apr 2, 2010 at 4:56 AM, Grant Ingersoll wrote: > > On Apr 2, 2010, at 2:50 AM, Babak Farhang wrote: > >> [Late to this party, but thought I'd chime in] >> >> I think this "layer" concept is right on. But I'm wondering about the >> life cycle of these layers. Do layers live forever? Or do they >> collapse at some point? (Like, as I think was already pointed out, >> deletes are when segments are merged today.) > > I think they get merged in by the merger, ideally in the background. > > - > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Incremental Field Updates
On Apr 2, 2010, at 2:50 AM, Babak Farhang wrote: > [Late to this party, but thought I'd chime in] > > I think this "layer" concept is right on. But I'm wondering about the > life cycle of these layers. Do layers live forever? Or do they > collapse at some point? (Like, as I think was already pointed out, > deletes are when segments are merged today.) I think they get merged in by the merger, ideally in the background. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Incremental Field Updates
[Late to this party, but thought I'd chime in] I think this "layer" concept is right on. But I'm wondering about the life cycle of these layers. Do layers live forever? Or do they collapse at some point? (Like, as I think was already pointed out, deletes are when segments are merged today.) -Babak On Sat, Mar 27, 2010 at 5:25 AM, Grant Ingersoll wrote: > First off, this is something I've had in my head for a long time, but don't > have any code. > As many of you know, one of the main things that vexes any search engine > based on an inverted index is how to do fast updates of just one field w/o > having to delete and re-add the whole document like we do today. When I > think about the whole update problem, I keep coming back to the notion of > Photoshop (or any other real photo editing solution) Layers. In a photo > editing solution, when you want to hide/change a piece of a photo, it is > considered best practice to add a layer over that part of the photo to be > changed. This way, the original photo is maintained and you don't have to > worry about accidentally damaging the area you aren't interested in. Thus, > a layer is essentially a mask on the original photo. The analogy isn't quite > the same here, but nevertheless... > > So, thinking out loud here and I'm not sure on the best wording of this: > > When a document first comes in, it is all in one place, just as it is now. > Then, when an update comes in on a particular field, we somehow mark in the > index that the document in question is modified and then we add the new > change onto the end of the index (just like we currently do when adding new > docs, but this time it's just a doc w/ a single field). Then, when > searching, we would, when scoring the affected documents, go to a secondary > process that knew where to look up the incremental changes. As background > merging takes place, these "disjoint" documents would be merged back > together. We'd maybe even consider a "high update" merge scheduler that > could more frequently handle these incremental merges. > > I'm not sure where we would maintain the list of changes. That is, is it > something that goes in the posting list, or is it a side structure. I think > in the posting list would be to slow. Also, perhaps it is worthwhile for > people to indicate that a particular field is expected to be updated while > others maintain their current format so as not to incur the penalty on each. > > In a sense, the old field for that document is masked by the new field. I > think, given proper index structure, that we maybe could make that marking > of the old field fast (maybe it's a pointer to the new field, maybe it's > just a bit indicating to go look in the "update" segment) > > On the search side, I think performance would still be maintained b/c even > in high update envs. you aren't usually talking about more than a few > thousand changes in a minute or two and the background merger would be > responsible for keeping the total number of disjoint documents low. > > I realize there isn't a whole lot to go on here just yet, but perhaps it > will spawn some questions/ideas that will help us work it out in a better > way. > At any rate, I think adding incr. field update capability would be a huge > win for Lucene. > -Grant - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Incremental Field Updates
On Mar 29, 2010, at 10:11 AM, mark harwood wrote: > >Of course, but what about the Lucene doc id doesn't provide that? > > The question being how you determine the correct doc id to use in the first > place (especially when they are know to be volatile) - the current answer is > to use a stable identifier term which your app holds in the index, AKA a > primary key. > To support single-doc updates, app developers currently have to : > a) allocate keys uniquely > b) ensure they do not store >1 document with the same key. > > My suggestion was, being fundamental requirements to supporting updates > Lucene could, as a convenience, provide some support for this in it's API - > in the same way a database typically does. I don't think Lucene needs a primary key. I don't see why this number can't be determined in the usual ways. > > Earwin has perhaps extended your (and my) original thinking to incorporate > set-based updates (a single set of values applied to many documents which > match a query). > His proposal (correct me if I'm wrong, Earwin) is that single and set-based > changes could both be supported by a single > IndexWriter.updateDocuments(query, changedFields) type method. > The benefit of this scheme is that we are providing a simple method, re-using > established concepts (Queries for document selection) but this does not > change the fact that many users will still need to use primary keys for > single-doc updates and they have to assume responsibility for a) and b) above. Hmmm, this sounds like the Parallel Incr. Indexing Busch has put up in a patch. > > On reflection, I guess these responsibilities are not too tough. > a) is catered for by the fact that Lucene is not typically the master data > store (yet!) and filesystem/webserver/database datasources where document > content is sourced usually have the responsibility to allocate some form of > unique identifier in the form of URLs, database keys or filenames which can > be used. Also, b) is not too hard to handle in app code if you always use the > IndexWriter.updateDocument(term,doc) method for inserts. > > > Cheers, > Mark > > From: Grant Ingersoll > To: java-dev@lucene.apache.org > Sent: Mon, 29 March, 2010 13:11:56 > Subject: Re: Incremental Field Updates > > > On Mar 29, 2010, at 2:26 AM, Mark Harwood wrote: > >> >>> >>>> Of course introducing the idea of updates also introduces the notion of a >>>> primary key and there's probably an entirely separate discussion to be had >>>> around user-supplied vs Lucene-generated keys. >>> >>> Not sure I see that need. Can you explain your reasoning a bit more? >>>> >> >> If you want to update a document you need a way of expressing *which* >> document you are updating. > > Of course, but what about the Lucene doc id doesn't provide that? > > -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
Re: Incremental Field Updates
>Of course, but what about the Lucene doc id doesn't provide that? The question being how you determine the correct doc id to use in the first place (especially when they are know to be volatile) - the current answer is to use a stable identifier term which your app holds in the index, AKA a primary key. To support single-doc updates, app developers currently have to : a) allocate keys uniquely b) ensure they do not store >1 document with the same key. My suggestion was, being fundamental requirements to supporting updates Lucene could, as a convenience, provide some support for this in it's API - in the same way a database typically does. Earwin has perhaps extended your (and my) original thinking to incorporate set-based updates (a single set of values applied to many documents which match a query). His proposal (correct me if I'm wrong, Earwin) is that single and set-based changes could both be supported by a single IndexWriter.updateDocuments(query, changedFields) type method. The benefit of this scheme is that we are providing a simple method, re-using established concepts (Queries for document selection) but this does not change the fact that many users will still need to use primary keys for single-doc updates and they have to assume responsibility for a) and b) above. On reflection, I guess these responsibilities are not too tough. a) is catered for by the fact that Lucene is not typically the master data store (yet!) and filesystem/webserver/database datasources where document content is sourced usually have the responsibility to allocate some form of unique identifier in the form of URLs, database keys or filenames which can be used. Also, b) is not too hard to handle in app code if you always use the IndexWriter.updateDocument(term,doc) method for inserts. Cheers, Mark From: Grant Ingersoll To: java-dev@lucene.apache.org Sent: Mon, 29 March, 2010 13:11:56 Subject: Re: Incremental Field Updates On Mar 29, 2010, at 2:26 AM, Mark Harwood wrote: > > >> >>Of course introducing the idea of updates also introduces the notion of a >>primary key and there's probably an entirely separate discussion to be had >>around user-supplied vs Lucene-generated keys. >> >> >>Not sure I see that need. Can you explain your reasoning a bit more? > >>> > > >If you want to update a document you need a way of expressing *which* document >you are updating. Of course, but what about the Lucene doc id doesn't provide that?
Re: AW: Incremental Field Updates
On 2010-03-29 15:11, Uwe Goetzke wrote: The filed this as patent, too: http://www.freepatentsonline.com/y2009/0228528.html .. which is not granted yet, right? It's a patent application. Besides, I live in EU ;) -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
AW: Incremental Field Updates
The filed this as patent, too: http://www.freepatentsonline.com/y2009/0228528.html Regards Uwe Goetzke -Ursprüngliche Nachricht- Von: Andrzej Bialecki [mailto:a...@getopt.org] Gesendet: Montag, 29. März 2010 14:50 An: java-dev@lucene.apache.org Betreff: Re: Incremental Field Updates On 2010-03-29 12:26, Michael McCandless wrote: > I agree this is a long overdue feature... we need to get it into > Lucene somehow. > > I like the Layers analogy... I think that will work well with Lucene's > transactional semantics, ie a prior commit point would continue to see > the index before the updates but new commit points would see the > updates. I'm coming late to this discussion ... are you guys familiar with this paper? It seems to describe the same model of incremental field-level updates, and the algo operates on internal Lucene ids: http://portal.acm.org/citation.cfm?id=1458171 -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Incremental Field Updates
On 2010-03-29 12:26, Michael McCandless wrote: I agree this is a long overdue feature... we need to get it into Lucene somehow. I like the Layers analogy... I think that will work well with Lucene's transactional semantics, ie a prior commit point would continue to see the index before the updates but new commit points would see the updates. I'm coming late to this discussion ... are you guys familiar with this paper? It seems to describe the same model of incremental field-level updates, and the algo operates on internal Lucene ids: http://portal.acm.org/citation.cfm?id=1458171 -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Incremental Field Updates
On Mar 29, 2010, at 2:26 AM, Mark Harwood wrote: > >> >>> Of course introducing the idea of updates also introduces the notion of a >>> primary key and there's probably an entirely separate discussion to be had >>> around user-supplied vs Lucene-generated keys. >> >> Not sure I see that need. Can you explain your reasoning a bit more? >>> > > If you want to update a document you need a way of expressing *which* > document you are updating. Of course, but what about the Lucene doc id doesn't provide that?
Re: Incremental Field Updates
>>Who ever said that some_condition should point to a unique document? > > My assumption was, for now, we were still talking about the simpler case of > updating a single document. > If we extend the discussion to support set-based updates it's worth > considering the common requirements for updating sets: > a) update values can be non-constants such as "reduce price of all products > in ski-wear dept by 10%". > b) the criteria to define the set can be most usefully expressed as a query > rather than mandating a single term e.g. "set published:false on all docs in > last week's date range" > > That feels like too much functionality to consider adding right now but I can > see a much more basic solution is possible which supports single and simple > set based updates. I must be missing something :) a) We're not a freaking database, why the constant attempts to compare ourselves to it / mimic some functionality? b) The criteria to define the set of deleted documents can already be expressed as a query - IndexWriter.deleteDocuments(query). So what I am offering is to preserve the way to point at the docs we want to see deleted, and allow to do partial modifications on them. Thus we add new and exciting functionality, while introducing zero new concepts. Profit? -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Incremental Field Updates
>Who ever said that some_condition should point to a unique document? My assumption was, for now, we were still talking about the simpler case of updating a single document. If we extend the discussion to support set-based updates it's worth considering the common requirements for updating sets: a) update values can be non-constants such as "reduce price of all products in ski-wear dept by 10%". b) the criteria to define the set can be most usefully expressed as a query rather than mandating a single term e.g. "set published:false on all docs in last week's date range" That feels like too much functionality to consider adding right now but I can see a much more basic solution is possible which supports single and simple set based updates. - Original Message From: Earwin Burrfoot To: java-dev@lucene.apache.org Sent: Mon, 29 March, 2010 11:05:39 Subject: Re: Incremental Field Updates >>Variant d) sounds most logical? And enables all sorts of fun stuff. > > So the duplicate-key docs can have different values for initial-insert fields > but partial updates will cause sharing of a common field value? > And subsequent same-key doc inserts do or don't share these previous > "partial-update" values? > > Sounds like a complex model for users to understand let alone code support > for. > Everyone gets primary keys though. What you say IS complex. Sharing? Bleargh. But everyone digs "update qweqwe set field=value where some_condition". Who ever said that some_condition should point to a unique document? It could, if you wish it so. Or you can do bulk updates if that's what you need. Very flexible and no need to introduce any new concepts. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Incremental Field Updates
I agree this is a long overdue feature... we need to get it into Lucene somehow. I like the Layers analogy... I think that will work well with Lucene's transactional semantics, ie a prior commit point would continue to see the index before the updates but new commit points would see the updates. I think we would somehow want the new postings "layer" written to cleanly be merged under Docs/PositionsEnum? So that searching is unaffected -- ie the scorers just see a normal postings enum. FieldCache would also just populate normally. But somehow these partial docs would have to not "count" as real docIDs... and the normal merging of segments would coalesce these updates... Also: how would we handle stored fields & term vectors? Mike On Sat, Mar 27, 2010 at 7:25 AM, Grant Ingersoll wrote: > First off, this is something I've had in my head for a long time, but don't > have any code. > As many of you know, one of the main things that vexes any search engine > based on an inverted index is how to do fast updates of just one field w/o > having to delete and re-add the whole document like we do today. When I > think about the whole update problem, I keep coming back to the notion of > Photoshop (or any other real photo editing solution) Layers. In a photo > editing solution, when you want to hide/change a piece of a photo, it is > considered best practice to add a layer over that part of the photo to be > changed. This way, the original photo is maintained and you don't have to > worry about accidentally damaging the area you aren't interested in. Thus, > a layer is essentially a mask on the original photo. The analogy isn't quite > the same here, but nevertheless... > > So, thinking out loud here and I'm not sure on the best wording of this: > > When a document first comes in, it is all in one place, just as it is now. > Then, when an update comes in on a particular field, we somehow mark in the > index that the document in question is modified and then we add the new > change onto the end of the index (just like we currently do when adding new > docs, but this time it's just a doc w/ a single field). Then, when > searching, we would, when scoring the affected documents, go to a secondary > process that knew where to look up the incremental changes. As background > merging takes place, these "disjoint" documents would be merged back > together. We'd maybe even consider a "high update" merge scheduler that > could more frequently handle these incremental merges. > > I'm not sure where we would maintain the list of changes. That is, is it > something that goes in the posting list, or is it a side structure. I think > in the posting list would be to slow. Also, perhaps it is worthwhile for > people to indicate that a particular field is expected to be updated while > others maintain their current format so as not to incur the penalty on each. > > In a sense, the old field for that document is masked by the new field. I > think, given proper index structure, that we maybe could make that marking > of the old field fast (maybe it's a pointer to the new field, maybe it's > just a bit indicating to go look in the "update" segment) > > On the search side, I think performance would still be maintained b/c even > in high update envs. you aren't usually talking about more than a few > thousand changes in a minute or two and the background merger would be > responsible for keeping the total number of disjoint documents low. > > I realize there isn't a whole lot to go on here just yet, but perhaps it > will spawn some questions/ideas that will help us work it out in a better > way. > At any rate, I think adding incr. field update capability would be a huge > win for Lucene. > -Grant - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Incremental Field Updates
>>Variant d) sounds most logical? And enables all sorts of fun stuff. > > So the duplicate-key docs can have different values for initial-insert fields > but partial updates will cause sharing of a common field value? > And subsequent same-key doc inserts do or don't share these previous > "partial-update" values? > > Sounds like a complex model for users to understand let alone code support > for. > Everyone gets primary keys though. What you say IS complex. Sharing? Bleargh. But everyone digs "update qweqwe set field=value where some_condition". Who ever said that some_condition should point to a unique document? It could, if you wish it so. Or you can do bulk updates if that's what you need. Very flexible and no need to introduce any new concepts. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Incremental Field Updates
>Variant d) sounds most logical? And enables all sorts of fun stuff. So the duplicate-key docs can have different values for initial-insert fields but partial updates will cause sharing of a common field value? And subsequent same-key doc inserts do or don't share these previous "partial-update" values? Sounds like a complex model for users to understand let alone code support for. Everyone gets primary keys though. - Original Message From: Earwin Burrfoot To: java-dev@lucene.apache.org Sent: Mon, 29 March, 2010 10:14:24 Subject: Re: Incremental Field Updates >>If someone needs this, it can be built over lucene, without >>introducing it as a core feature and needlessly complicating things. > > I think with any partial-update feature the *absence* of primary key support > would "needlessly complicate things": > If Lucene is not capable of performing duplicate detection on insert (because > it has no notion of a primary key field), we need to be prepared for the > situation where we have duplicate-key docs in the index. > What then happens when Grant wants to do a "partial update" as opposed to the > existing full-update semantics which first deletes all documents containing > the supplied term (always a form of primary key)? > Which document instance gets "partially updated"? We either: > a) throw a "duplicate" error (which ideally should have happened back at dup > insert time) > b) Choose one of the documents to "partially update" and keep the duplicate(s) > c) Choose one of the documents to "partially update" and delete the > duplicate(s) > d) "Partially update" all of the duplicate(s) > All less than ideal. Variant d) sounds most logical? And enables all sorts of fun stuff. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Incremental Field Updates
>>If someone needs this, it can be built over lucene, without >>introducing it as a core feature and needlessly complicating things. > > I think with any partial-update feature the *absence* of primary key support > would "needlessly complicate things": > If Lucene is not capable of performing duplicate detection on insert (because > it has no notion of a primary key field), we need to be prepared for the > situation where we have duplicate-key docs in the index. > What then happens when Grant wants to do a "partial update" as opposed to the > existing full-update semantics which first deletes all documents containing > the supplied term (always a form of primary key)? > Which document instance gets "partially updated"? We either: > a) throw a "duplicate" error (which ideally should have happened back at dup > insert time) > b) Choose one of the documents to "partially update" and keep the duplicate(s) > c) Choose one of the documents to "partially update" and delete the > duplicate(s) > d) "Partially update" all of the duplicate(s) > All less than ideal. Variant d) sounds most logical? And enables all sorts of fun stuff. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Incremental Field Updates
>I can delete by lucene-generated docId. Which users used to have to find by first coding a primary-key-term search. Delete by term removed this step to make life easier. >If someone needs this, it can be built over lucene, without >introducing it as a core feature and needlessly complicating things. I think with any partial-update feature the *absence* of primary key support would "needlessly complicate things": If Lucene is not capable of performing duplicate detection on insert (because it has no notion of a primary key field), we need to be prepared for the situation where we have duplicate-key docs in the index. What then happens when Grant wants to do a "partial update" as opposed to the existing full-update semantics which first deletes all documents containing the supplied term (always a form of primary key)? Which document instance gets "partially updated"? We either: a) throw a "duplicate" error (which ideally should have happened back at dup insert time) b) Choose one of the documents to "partially update" and keep the duplicate(s) c) Choose one of the documents to "partially update" and delete the duplicate(s) d) "Partially update" all of the duplicate(s) All less than ideal. I know we are schema-averse with Lucene (and I value that) but surely any partial update feature has to start with a strongly maintained notion of document identity as a foundation? Rather than "needless complexity" I'd argue this was "needed rigour" and actually simplifies the user's job if Lucene can do the duplicate-key-on-insert check automatically rather than relying on ropy application code and dealing with any failures in that. Of course primary keys are not mandatory. You only use them when you need this behaviour - just like in SQL. Cheers Mark - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Incremental Field Updates
> Of course introducing the idea of updates also introduces the notion of a > primary key and there's probably an entirely separate discussion to be had > around user-supplied vs Lucene-generated keys. Not sure I see that need. Can you explain your reasoning a bit more? >>> If you want to update a document you need a way of expressing *which* >>> document you are updating. >> This already works somehow for 'deleting' documents? > Yes, the convention being user-supplied keys. I can delete by lucene-generated docId. It's too volatile to be database-style PK, but nonetheless. > The question posed is if we add another use case where keys are required do > we want to turn this existing informal convention > into more formalized support the way databases do eg duplicate key checks on > insert, auto-inc primary key generators. If someone needs this, it can be built over lucene, without introducing it as a core feature and needlessly complicating things. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Incremental Field Updates
On 29 Mar 2010, at 07:45, Earwin Burrfoot wrote: Of course introducing the idea of updates also introduces the notion of a primary key and there's probably an entirely separate discussion to be had around user-supplied vs Lucene-generated keys. Not sure I see that need. Can you explain your reasoning a bit more? If you want to update a document you need a way of expressing *which* document you are updating. This already works somehow for 'deleting' documents? Yes, the convention being user-supplied keys. The question posed is if we add another use case where keys are required do we want to turn this existing informal convention into more formalized support the way databases do eg duplicate key checks on insert, auto-inc primary key generators. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Incremental Field Updates
>>> Of course introducing the idea of updates also introduces the notion of a >>> primary key and there's probably an entirely separate discussion to be had >>> around user-supplied vs Lucene-generated keys. >> Not sure I see that need. Can you explain your reasoning a bit more? > If you want to update a document you need a way of expressing *which* > document you are updating. This already works somehow for 'deleting' documents? -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Incremental Field Updates
Of course introducing the idea of updates also introduces the notion of a primary key and there's probably an entirely separate discussion to be had around user-supplied vs Lucene-generated keys. Not sure I see that need. Can you explain your reasoning a bit more? If you want to update a document you need a way of expressing *which* document you are updating. Cheers Mark
Re: Incremental Field Updates
On Mar 27, 2010, at 11:14 AM, Mark Harwood wrote: > Of course introducing the idea of updates also introduces the notion of a > primary key and there's probably an entirely separate discussion to be had > around user-supplied vs Lucene-generated keys. Not sure I see that need. Can you explain your reasoning a bit more? > > That aside, the biggest concern for me here is the impact that this is likely > to have on search - currently queries such as "a:1 AND b:2" are streamed > efficiently when evaluated because fields a and b have long postings lists > conveniently sorted in doc-id insertion order that can be walked in sequence. > If there are to be disjoint, partial docs, with updated contents arriving > out-of-primary-key-order this is bound to introduce costly disk seeks to the > query process or require commit-time merges/sorts to preserve the doc-ordered > posting lists needed to maintain search speed. Both of these strategies come > at a reasonable cost. Of course some form of RAM-based value caching > (allowing us to randomly look up the latest value for field b in doc x) is > fast but probably only suited to small-scale deployments. Indeed, part of me thinks this is especially suited for flex indexing, where I can make a design time decision to pay the cost in exchange for high updates at the cost of potentially slower search. > > It's probably worth thinking through the scenarios we want to cater for. > Maybe a Digg-like scenario with users voting on document popularity *can* be > catered for with RAM-based field caches because the data (count of votes) is > small enough to cache? Agreed. Many social applications require updating one or two fields very frequently (popularity, ratings, votes, etc.) > > Cheers, > Mark > > > On 27 Mar 2010, at 11:25, Grant Ingersoll wrote: > >> First off, this is something I've had in my head for a long time, but don't >> have any code. >> >> As many of you know, one of the main things that vexes any search engine >> based on an inverted index is how to do fast updates of just one field w/o >> having to delete and re-add the whole document like we do today. When I >> think about the whole update problem, I keep coming back to the notion of >> Photoshop (or any other real photo editing solution) Layers. In a photo >> editing solution, when you want to hide/change a piece of a photo, it is >> considered best practice to add a layer over that part of the photo to be >> changed. This way, the original photo is maintained and you don't have to >> worry about accidentally damaging the area you aren't interested in. Thus, >> a layer is essentially a mask on the original photo. The analogy isn't quite >> the same here, but nevertheless... >> So, thinking out loud here and I'm not sure on the best wording of this: >> >> When a document first comes in, it is all in one place, just as it is now. >> Then, when an update comes in on a particular field, we somehow mark in the >> index that the document in question is modified and then we add the new >> change onto the end of the index (just like we currently do when adding new >> docs, but this time it's just a doc w/ a single field). Then, when >> searching, we would, when scoring the affected documents, go to a secondary >> process that knew where to look up the incremental changes. As background >> merging takes place, these "disjoint" documents would be merged back >> together. We'd maybe even consider a "high update" merge scheduler that >> could more frequently handle these incremental merges. >> >> >> I'm not sure where we would maintain the list of changes. That is, is it >> something that goes in the posting list, or is it a side structure. I think >> in the posting list would be to slow. Also, perhaps it is worthwhile for >> people to indicate that a particular field is expected to be updated while >> others maintain their current format so as not to incur the penalty on each. >> In a sense, the old field for that document is masked by the new field. I >> think, given proper index structure, that we maybe could make that marking >> of the old field fast (maybe it's a pointer to the new field, maybe it's >> just a bit indicating to go look in the "update" segment) >> >> On the search side, I think performance would still be maintained b/c even >> in high update envs. you aren't usually talking about more than a few >> thousand changes in a minute or two and the background merger would be >> responsible for keeping the total number of disjoint documents low. >> >> I realize there isn't a whole lot to go on here just yet, but perhaps it >> will spawn some questions/ideas that will help us work it out in a better >> way. >> >> At any rate, I think adding incr. field update capability would be a huge >> win for Lucene. >> >> -Grant >
Re: Incremental Field Updates
Of course introducing the idea of updates also introduces the notion of a primary key and there's probably an entirely separate discussion to be had around user-supplied vs Lucene-generated keys. That aside, the biggest concern for me here is the impact that this is likely to have on search - currently queries such as "a:1 AND b:2" are streamed efficiently when evaluated because fields a and b have long postings lists conveniently sorted in doc-id insertion order that can be walked in sequence. If there are to be disjoint, partial docs, with updated contents arriving out-of-primary-key-order this is bound to introduce costly disk seeks to the query process or require commit-time merges/sorts to preserve the doc-ordered posting lists needed to maintain search speed. Both of these strategies come at a reasonable cost. Of course some form of RAM-based value caching (allowing us to randomly look up the latest value for field b in doc x) is fast but probably only suited to small-scale deployments. It's probably worth thinking through the scenarios we want to cater for. Maybe a Digg-like scenario with users voting on document popularity *can* be catered for with RAM-based field caches because the data (count of votes) is small enough to cache? Cheers, Mark On 27 Mar 2010, at 11:25, Grant Ingersoll wrote: > First off, this is something I've had in my head for a long time, but don't > have any code. > > As many of you know, one of the main things that vexes any search engine > based on an inverted index is how to do fast updates of just one field w/o > having to delete and re-add the whole document like we do today. When I > think about the whole update problem, I keep coming back to the notion of > Photoshop (or any other real photo editing solution) Layers. In a photo > editing solution, when you want to hide/change a piece of a photo, it is > considered best practice to add a layer over that part of the photo to be > changed. This way, the original photo is maintained and you don't have to > worry about accidentally damaging the area you aren't interested in. Thus, a > layer is essentially a mask on the original photo. The analogy isn't quite > the same here, but nevertheless... > So, thinking out loud here and I'm not sure on the best wording of this: > > When a document first comes in, it is all in one place, just as it is now. > Then, when an update comes in on a particular field, we somehow mark in the > index that the document in question is modified and then we add the new > change onto the end of the index (just like we currently do when adding new > docs, but this time it's just a doc w/ a single field). Then, when searching, > we would, when scoring the affected documents, go to a secondary process that > knew where to look up the incremental changes. As background merging takes > place, these "disjoint" documents would be merged back together. We'd maybe > even consider a "high update" merge scheduler that could more frequently > handle these incremental merges. > > > I'm not sure where we would maintain the list of changes. That is, is it > something that goes in the posting list, or is it a side structure. I think > in the posting list would be to slow. Also, perhaps it is worthwhile for > people to indicate that a particular field is expected to be updated while > others maintain their current format so as not to incur the penalty on each. > In a sense, the old field for that document is masked by the new field. I > think, given proper index structure, that we maybe could make that marking of > the old field fast (maybe it's a pointer to the new field, maybe it's just a > bit indicating to go look in the "update" segment) > > On the search side, I think performance would still be maintained b/c even in > high update envs. you aren't usually talking about more than a few thousand > changes in a minute or two and the background merger would be responsible for > keeping the total number of disjoint documents low. > > I realize there isn't a whole lot to go on here just yet, but perhaps it will > spawn some questions/ideas that will help us work it out in a better way. > > At any rate, I think adding incr. field update capability would be a huge win > for Lucene. > > -Grant
Incremental Field Updates
First off, this is something I've had in my head for a long time, but don't have any code. As many of you know, one of the main things that vexes any search engine based on an inverted index is how to do fast updates of just one field w/o having to delete and re-add the whole document like we do today. When I think about the whole update problem, I keep coming back to the notion of Photoshop (or any other real photo editing solution) Layers. In a photo editing solution, when you want to hide/change a piece of a photo, it is considered best practice to add a layer over that part of the photo to be changed. This way, the original photo is maintained and you don't have to worry about accidentally damaging the area you aren't interested in. Thus, a layer is essentially a mask on the original photo. The analogy isn't quite the same here, but nevertheless... So, thinking out loud here and I'm not sure on the best wording of this: When a document first comes in, it is all in one place, just as it is now. Then, when an update comes in on a particular field, we somehow mark in the index that the document in question is modified and then we add the new change onto the end of the index (just like we currently do when adding new docs, but this time it's just a doc w/ a single field). Then, when searching, we would, when scoring the affected documents, go to a secondary process that knew where to look up the incremental changes. As background merging takes place, these "disjoint" documents would be merged back together. We'd maybe even consider a "high update" merge scheduler that could more frequently handle these incremental merges. I'm not sure where we would maintain the list of changes. That is, is it something that goes in the posting list, or is it a side structure. I think in the posting list would be to slow. Also, perhaps it is worthwhile for people to indicate that a particular field is expected to be updated while others maintain their current format so as not to incur the penalty on each. In a sense, the old field for that document is masked by the new field. I think, given proper index structure, that we maybe could make that marking of the old field fast (maybe it's a pointer to the new field, maybe it's just a bit indicating to go look in the "update" segment) On the search side, I think performance would still be maintained b/c even in high update envs. you aren't usually talking about more than a few thousand changes in a minute or two and the background merger would be responsible for keeping the total number of disjoint documents low. I realize there isn't a whole lot to go on here just yet, but perhaps it will spawn some questions/ideas that will help us work it out in a better way. At any rate, I think adding incr. field update capability would be a huge win for Lucene. -Grant