subject:"Incremental Field Updates"

Re: Incremental Field Updates

2010-04-08 Thread Babak Farhang

Good point. I meant the model at the document level: i.e. what
milestones does a document go through in its life cycle. Today:

created --> deleted

With incremental updates:

created --> update1 --> update2 --> deleted

I think what I'm trying to say is that this second threaded sequence
of state changes seems intuitively more fragile under concurrent
scenarios.  So for example, in a lock-free design, the system would
also have to anticipate the following sequence of events:

created --> update1 --> deleted --> update2

and consider update2 a null op.  I'm imagining there are other cases
that I can't think of..

-Babak


On Tue, Apr 6, 2010 at 3:40 AM, Michael McCandless
 wrote:
> write once, plus the option to the app to keep multiple commit points
> around (by customizing the deletion policy).
>
> Actually order of operations / commits very much matters in Lucene today.
>
> Deletions are not idempotent: if you add a doc w/ term X, delete by
> term X, add a new doc with term X... that's very different than if you
> moved the delete op to the end.  Ie the deletion only applies to the
> docs added before it.
>
> Mike
>
> On Mon, Apr 5, 2010 at 12:45 AM, Babak Farhang  wrote:
>> Sure. Because of the write once principle.  But at some cost
>> (duplicated data). I was just agreeing that it would not be a good
>> idea to bake in version-ing by keeping the layers around forever in a
>> merged index; I wasn't keying in on transactions per se.
>>
>> Speaking of transactions: I'm not sure if we should worry about this
>> much yet, but with "updates" the order of the transaction commits
>> seems important. I think commit order is less important today in
>> Lucene because its model supports only 2 types of events: document
>> creation--which only happens once, and document deletion, which is
>> idempotent.  What do you think? Will commits have to be ordered if we
>> introduce updates?  Or does the onus of maintaining order fall on the
>> application?
>>
>> -Babak
>>
>> On Sat, Apr 3, 2010 at 3:28 AM, Michael McCandless
>>  wrote:
>>> On Sat, Apr 3, 2010 at 1:25 AM, Babak Farhang  wrote:
> I think they get merged in by the merger, ideally in the background.

 That sounds sensible. (In other words, we wont concern ourselves with
 roll backs--something possible while a "layer" is still around.)
>>>
>>> Actually roll backs would still be very possible even if layers are merged.
>>>
>>> Ie, one could keep multiple commits around, and the older commits
>>> would still be referring to the old postings + layers, keeping them
>>> alive.
>>>
>>> Lucene would still be transactional with such an approach.
>>>
>>> Mike
>>>
>>> -
>>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>>
>>>
>>
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Incremental Field Updates

2010-04-06 Thread Michael McCandless

write once, plus the option to the app to keep multiple commit points
around (by customizing the deletion policy).

Actually order of operations / commits very much matters in Lucene today.

Deletions are not idempotent: if you add a doc w/ term X, delete by
term X, add a new doc with term X... that's very different than if you
moved the delete op to the end.  Ie the deletion only applies to the
docs added before it.

Mike

On Mon, Apr 5, 2010 at 12:45 AM, Babak Farhang  wrote:
> Sure. Because of the write once principle.  But at some cost
> (duplicated data). I was just agreeing that it would not be a good
> idea to bake in version-ing by keeping the layers around forever in a
> merged index; I wasn't keying in on transactions per se.
>
> Speaking of transactions: I'm not sure if we should worry about this
> much yet, but with "updates" the order of the transaction commits
> seems important. I think commit order is less important today in
> Lucene because its model supports only 2 types of events: document
> creation--which only happens once, and document deletion, which is
> idempotent.  What do you think? Will commits have to be ordered if we
> introduce updates?  Or does the onus of maintaining order fall on the
> application?
>
> -Babak
>
> On Sat, Apr 3, 2010 at 3:28 AM, Michael McCandless
>  wrote:
>> On Sat, Apr 3, 2010 at 1:25 AM, Babak Farhang  wrote:
 I think they get merged in by the merger, ideally in the background.
>>>
>>> That sounds sensible. (In other words, we wont concern ourselves with
>>> roll backs--something possible while a "layer" is still around.)
>>
>> Actually roll backs would still be very possible even if layers are merged.
>>
>> Ie, one could keep multiple commits around, and the older commits
>> would still be referring to the old postings + layers, keeping them
>> alive.
>>
>> Lucene would still be transactional with such an approach.
>>
>> Mike
>>
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Incremental Field Updates

2010-04-04 Thread Babak Farhang

Sure. Because of the write once principle.  But at some cost
(duplicated data). I was just agreeing that it would not be a good
idea to bake in version-ing by keeping the layers around forever in a
merged index; I wasn't keying in on transactions per se.

Speaking of transactions: I'm not sure if we should worry about this
much yet, but with "updates" the order of the transaction commits
seems important. I think commit order is less important today in
Lucene because its model supports only 2 types of events: document
creation--which only happens once, and document deletion, which is
idempotent.  What do you think? Will commits have to be ordered if we
introduce updates?  Or does the onus of maintaining order fall on the
application?

-Babak

On Sat, Apr 3, 2010 at 3:28 AM, Michael McCandless
 wrote:
> On Sat, Apr 3, 2010 at 1:25 AM, Babak Farhang  wrote:
>>> I think they get merged in by the merger, ideally in the background.
>>
>> That sounds sensible. (In other words, we wont concern ourselves with
>> roll backs--something possible while a "layer" is still around.)
>
> Actually roll backs would still be very possible even if layers are merged.
>
> Ie, one could keep multiple commits around, and the older commits
> would still be referring to the old postings + layers, keeping them
> alive.
>
> Lucene would still be transactional with such an approach.
>
> Mike
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Incremental Field Updates

2010-04-03 Thread Michael McCandless

On Sat, Apr 3, 2010 at 1:25 AM, Babak Farhang  wrote:
>> I think they get merged in by the merger, ideally in the background.
>
> That sounds sensible. (In other words, we wont concern ourselves with
> roll backs--something possible while a "layer" is still around.)

Actually roll backs would still be very possible even if layers are merged.

Ie, one could keep multiple commits around, and the older commits
would still be referring to the old postings + layers, keeping them
alive.

Lucene would still be transactional with such an approach.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Incremental Field Updates

2010-04-02 Thread Babak Farhang

Grant,

Reading your post a 3rd time, I see my "suggestion" is in fact the
approach you describe.  Sorry for being redundant.

-Babak

On Fri, Apr 2, 2010 at 11:25 PM, Babak Farhang  wrote:
>> I think they get merged in by the merger, ideally in the background.
>
> That sounds sensible. (In other words, we wont concern ourselves with
> roll backs--something possible while a "layer" is still around.)
>
> I've been thinking about this problem also. One approach discussed
> earlier in these mailing lists has been to somehow maintain a parallel
> index of the update-able of the fields in such a way that the docIds
> of the parallel index remain in sync with the "master" index. Mike
> McCandless and I were discussing some variants of this approach a few
> months back: 
> http://markmail.org/message/uifz5v37k6qxxhvz?q=%22incremental+document+field+update%22+site:markmail%2Eorg&page=1&refer=ipebtbf24y7rleps
>  That approach involved the concept of mapping (chaining, if you will)
> internal docIds to view ids.  That docid mapping concept sounds
> analogous to this layer concept we are discussing now.
>
> I now think the parallel index approach may not be such a great idea,
> after all: it simply pushes the problem to the edge--the slave index.
> If we can solve update problem in the slave index, I reason, then
> shouldn't we also be able to solve the same update problem in the
> master index (and thereby remove the necessity of maintaining a
> (user-level) parallel index in the first place)?
>
> Which seems to align with the approach being discussed here..
>
> I imagine the "layers" being discussed here are somehow threaded by
> docId. That is, given a docId, you can quickly find it's "layers."  If
> so, then the docId mapping idea may be one way to thread these layers.
> (A logical document would be constructed by a chain of docIds, each
> overriding the previous for each field it defines (or deletes).  Such
> a construction would have to be "merge-aware" (perhaps using machinery
> similar to that used in LUCENE-1879) in order that it may maintain the
> docId chain.
>
> What do you think?
>
>
> On Fri, Apr 2, 2010 at 4:56 AM, Grant Ingersoll  wrote:
>>
>> On Apr 2, 2010, at 2:50 AM, Babak Farhang wrote:
>>
>>> [Late to this party, but thought I'd chime in]
>>>
>>> I think this "layer" concept is right on.  But I'm wondering about the
>>> life cycle of these layers.  Do layers live forever? Or do they
>>> collapse at some point? (Like, as I think was already pointed out,
>>> deletes are when segments are merged today.)
>>
>> I think they get merged in by the merger, ideally in the background.
>>
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>>
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Incremental Field Updates

2010-04-02 Thread Babak Farhang

> I think they get merged in by the merger, ideally in the background.

That sounds sensible. (In other words, we wont concern ourselves with
roll backs--something possible while a "layer" is still around.)

I've been thinking about this problem also. One approach discussed
earlier in these mailing lists has been to somehow maintain a parallel
index of the update-able of the fields in such a way that the docIds
of the parallel index remain in sync with the "master" index. Mike
McCandless and I were discussing some variants of this approach a few
months back: 
http://markmail.org/message/uifz5v37k6qxxhvz?q=%22incremental+document+field+update%22+site:markmail%2Eorg&page=1&refer=ipebtbf24y7rleps
 That approach involved the concept of mapping (chaining, if you will)
internal docIds to view ids.  That docid mapping concept sounds
analogous to this layer concept we are discussing now.

I now think the parallel index approach may not be such a great idea,
after all: it simply pushes the problem to the edge--the slave index.
If we can solve update problem in the slave index, I reason, then
shouldn't we also be able to solve the same update problem in the
master index (and thereby remove the necessity of maintaining a
(user-level) parallel index in the first place)?

Which seems to align with the approach being discussed here..

I imagine the "layers" being discussed here are somehow threaded by
docId. That is, given a docId, you can quickly find it's "layers."  If
so, then the docId mapping idea may be one way to thread these layers.
(A logical document would be constructed by a chain of docIds, each
overriding the previous for each field it defines (or deletes).  Such
a construction would have to be "merge-aware" (perhaps using machinery
similar to that used in LUCENE-1879) in order that it may maintain the
docId chain.

What do you think?

On Fri, Apr 2, 2010 at 4:56 AM, Grant Ingersoll  wrote:
>
> On Apr 2, 2010, at 2:50 AM, Babak Farhang wrote:
>
>> [Late to this party, but thought I'd chime in]
>>
>> I think this "layer" concept is right on.  But I'm wondering about the
>> life cycle of these layers.  Do layers live forever? Or do they
>> collapse at some point? (Like, as I think was already pointed out,
>> deletes are when segments are merged today.)
>
> I think they get merged in by the merger, ideally in the background.
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Incremental Field Updates

2010-04-02 Thread Grant Ingersoll


On Apr 2, 2010, at 2:50 AM, Babak Farhang wrote:

> [Late to this party, but thought I'd chime in]
> 
> I think this "layer" concept is right on.  But I'm wondering about the
> life cycle of these layers.  Do layers live forever? Or do they
> collapse at some point? (Like, as I think was already pointed out,
> deletes are when segments are merged today.)

I think they get merged in by the merger, ideally in the background.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Incremental Field Updates

2010-04-01 Thread Babak Farhang

[Late to this party, but thought I'd chime in]

I think this "layer" concept is right on.  But I'm wondering about the
life cycle of these layers.  Do layers live forever? Or do they
collapse at some point? (Like, as I think was already pointed out,
deletes are when segments are merged today.)

-Babak

On Sat, Mar 27, 2010 at 5:25 AM, Grant Ingersoll  wrote:
> First off, this is something I've had in my head for a long time, but don't
> have any code.
> As many of you know, one of the main things that vexes any search engine
> based on an inverted index is how to do fast updates of just one field w/o
> having to delete and re-add the whole document like we do today.   When I
> think about the whole update problem, I keep coming back to the notion of
> Photoshop (or any other real photo editing solution) Layers.  In a photo
> editing solution, when you want to hide/change a piece of a photo, it is
> considered best practice to add a layer over that part of the photo to be
> changed.  This way, the original photo is maintained and you don't have to
> worry about accidentally damaging the area you aren't interested in.  Thus,
> a layer is essentially a mask on the original photo. The analogy isn't quite
> the same here, but nevertheless...
>
> So, thinking out loud here and I'm not sure on the best wording of this:
>
> When a document first comes in, it is all in one place, just as it is now.
> Then, when an update comes in on a particular field, we somehow mark in the
> index that the document in question is modified and then we add the new
> change onto the end of the index (just like we currently do when adding new
> docs, but this time it's just a doc w/ a single field). Then, when
> searching, we would, when scoring the affected documents, go to a secondary
> process that knew where to look up the incremental changes. As background
> merging takes place, these "disjoint" documents would be merged back
> together. We'd maybe even consider a "high update" merge scheduler that
> could more frequently handle these incremental merges.
>
> I'm not sure where we would maintain the list of changes.  That is, is it
> something that goes in the posting list, or is it a side structure.  I think
> in the posting list would be to slow.  Also, perhaps it is worthwhile for
> people to indicate that a particular field is expected to be updated while
> others maintain their current format so as not to incur the penalty on each.
>
>  In a sense, the old field for that document is masked by the new field. I
> think, given proper index structure, that we maybe could make that marking
> of the old field fast (maybe it's a pointer to the new field, maybe it's
> just a bit indicating to go look in the "update" segment)
>
> On the search side, I think performance would still be maintained b/c even
> in high update envs. you aren't usually talking about more than a few
> thousand changes in a minute or two and the background merger would be
> responsible for keeping the total number of disjoint documents low.
>
> I realize there isn't a whole lot to go on here just yet, but perhaps it
> will spawn some questions/ideas that will help us work it out in a better
> way.
> At any rate, I think adding incr. field update capability would be a huge
> win for Lucene.
> -Grant

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Incremental Field Updates

2010-03-30 Thread Grant Ingersoll


On Mar 29, 2010, at 10:11 AM, mark harwood wrote:

> >Of course, but what about the Lucene doc id doesn't provide that?
> 
> The question being how you determine the correct doc id to use in the first 
> place (especially when they are know to be volatile) - the current answer is 
> to use a stable identifier term which your app holds in the index, AKA a 
> primary key. 
> To support single-doc updates, app developers currently have to :
> a) allocate keys uniquely
> b) ensure they do not store >1 document with the same key.
> 
> My suggestion was, being fundamental requirements to supporting updates 
> Lucene could, as a convenience, provide some support for this in it's API - 
> in the same way a database typically does.

I don't think Lucene needs a primary key.  I don't see why this number can't be 
determined in the usual ways.

> 
> Earwin has perhaps extended your (and my) original thinking to incorporate 
> set-based updates (a single set of values applied to many documents which 
> match a query).
> His proposal (correct me if I'm wrong, Earwin) is that single and set-based 
> changes could both be supported by a single 
> IndexWriter.updateDocuments(query, changedFields) type method.
> The benefit of this scheme is that we are providing a simple method, re-using 
> established concepts (Queries for document selection) but this does not 
> change the fact that many users will still need to use primary keys for 
> single-doc updates and they have to assume responsibility for a) and b) above.

Hmmm, this sounds like the Parallel Incr. Indexing Busch has put up in a patch.

> 
> On reflection, I guess these responsibilities are not too tough.
> a) is catered for by the fact that Lucene is not typically the master data 
> store (yet!) and filesystem/webserver/database datasources where document 
> content is sourced  usually have the responsibility to allocate some form of 
> unique identifier in the form of URLs, database keys or filenames which can 
> be used. Also, b) is not too hard to handle in app code if you always use the 
> IndexWriter.updateDocument(term,doc) method for inserts.
> 
> 
> Cheers,
> Mark
> 
> From: Grant Ingersoll 
> To: java-dev@lucene.apache.org
> Sent: Mon, 29 March, 2010 13:11:56
> Subject: Re: Incremental Field Updates
> 
> 
> On Mar 29, 2010, at 2:26 AM, Mark Harwood wrote:
> 
>> 
>>> 
>>>> Of course introducing the idea of updates also introduces the notion of a 
>>>> primary key and there's probably an entirely separate discussion to be had 
>>>> around user-supplied vs Lucene-generated keys.
>>> 
>>> Not sure I see that need.  Can you explain your reasoning a bit more?
>>>> 
>> 
>> If you want to update a document you need a way of expressing *which* 
>> document you are updating.
> 
> Of course, but what about the Lucene doc id doesn't provide that?
> 
> 

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search

Re: Incremental Field Updates

2010-03-29 Thread mark harwood

>Of course, but what about the Lucene doc id doesn't provide that?

The question being how you determine the correct doc id to use in the first 
place (especially when they are know to be volatile) - the current answer is to 
use a stable identifier term which your app holds in the index, AKA a primary 
key. 
To support single-doc updates, app developers currently have to :
a) allocate keys uniquely
b) ensure they do not store >1 document with the same key.

My suggestion was, being fundamental requirements to supporting updates Lucene 
could, as a convenience, provide some support for this in it's API - in the 
same way a database typically does.

Earwin has perhaps extended your (and my) original thinking to incorporate 
set-based updates (a single set of values applied to many documents which match 
a query).
His proposal (correct me if I'm wrong, Earwin) is that single and set-based 
changes could both be supported by a single IndexWriter.updateDocuments(query, 
changedFields) type method.
The benefit of this scheme is that we are providing a simple method, re-using 
established concepts (Queries for document selection) but this does not change 
the fact that many users will still need to use primary keys for single-doc 
updates and they have to assume responsibility for a) and b) above.

On reflection, I guess these responsibilities are not too tough.
a) is catered for by the fact that Lucene is not typically the master data 
store (yet!) and filesystem/webserver/database datasources where document 
content is sourced  usually have the responsibility to allocate some form of 
unique identifier in the form of URLs, database keys or filenames which can be 
used. Also, b) is not too hard to handle in app code if you always use the 
IndexWriter.updateDocument(term,doc) method for inserts.

Cheers,
Mark

From: Grant Ingersoll 
To: java-dev@lucene.apache.org
Sent: Mon, 29 March, 2010 13:11:56
Subject: Re: Incremental Field Updates

On Mar 29, 2010, at 2:26 AM, Mark Harwood wrote:

>
>
>>
>>Of course introducing the idea of updates also introduces the notion of a 
>>primary key and there's probably an entirely separate discussion to be had 
>>around user-supplied vs Lucene-generated keys.
>>
>>
>>Not sure I see that need.  Can you explain your reasoning a bit more?
>
>>>
>
>
>If you want to update a document you need a way of expressing *which* document 
>you are updating.

Of course, but what about the Lucene doc id doesn't provide that?

Re: AW: Incremental Field Updates

2010-03-29 Thread Andrzej Bialecki


On 2010-03-29 15:11, Uwe Goetzke wrote:

The filed this as patent, too:
http://www.freepatentsonline.com/y2009/0228528.html


.. which is not granted yet, right? It's a patent application. Besides, 
I live in EU ;)



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

AW: Incremental Field Updates

2010-03-29 Thread Uwe Goetzke

The filed this as patent, too:
http://www.freepatentsonline.com/y2009/0228528.html

Regards

Uwe Goetzke

-Ursprüngliche Nachricht-
Von: Andrzej Bialecki [mailto:a...@getopt.org] 
Gesendet: Montag, 29. März 2010 14:50
An: java-dev@lucene.apache.org
Betreff: Re: Incremental Field Updates

On 2010-03-29 12:26, Michael McCandless wrote:
> I agree this is a long overdue feature... we need to get it into
> Lucene somehow.
>
> I like the Layers analogy... I think that will work well with Lucene's
> transactional semantics, ie a prior commit point would continue to see
> the index before the updates but new commit points would see the
> updates.

I'm coming late to this discussion ... are you guys familiar with this 
paper? It seems to describe the same model of incremental field-level 
updates, and the algo operates on internal Lucene ids:

http://portal.acm.org/citation.cfm?id=1458171


-- 
Best regards,
Andrzej Bialecki <><
  ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Incremental Field Updates

2010-03-29 Thread Andrzej Bialecki


On 2010-03-29 12:26, Michael McCandless wrote:

I agree this is a long overdue feature... we need to get it into
Lucene somehow.

I like the Layers analogy... I think that will work well with Lucene's
transactional semantics, ie a prior commit point would continue to see
the index before the updates but new commit points would see the
updates.


I'm coming late to this discussion ... are you guys familiar with this 
paper? It seems to describe the same model of incremental field-level 
updates, and the algo operates on internal Lucene ids:


http://portal.acm.org/citation.cfm?id=1458171


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Incremental Field Updates

2010-03-29 Thread Grant Ingersoll


On Mar 29, 2010, at 2:26 AM, Mark Harwood wrote:

> 
>> 
>>> Of course introducing the idea of updates also introduces the notion of a 
>>> primary key and there's probably an entirely separate discussion to be had 
>>> around user-supplied vs Lucene-generated keys.
>> 
>> Not sure I see that need.  Can you explain your reasoning a bit more?
>>> 
> 
> If you want to update a document you need a way of expressing *which* 
> document you are updating.

Of course, but what about the Lucene doc id doesn't provide that?

Re: Incremental Field Updates

2010-03-29 Thread Earwin Burrfoot

>>Who ever said that some_condition should point to a unique document?
>
> My assumption was, for now, we were still talking about the simpler case of 
> updating a single document.
> If we extend the discussion to support set-based updates it's worth 
> considering the common requirements for updating sets:
> a)  update values can be non-constants such as "reduce price of all products 
> in ski-wear dept by 10%".
> b)  the criteria to define the set can be most usefully expressed as a query 
> rather than mandating a single term e.g. "set published:false on all docs in 
> last week's date range"
>
> That feels like too much functionality to consider adding right now but I can 
> see a much more basic solution is possible which supports single and simple 
> set based updates.

I must be missing something :)
a) We're not a freaking database, why the constant attempts to compare
ourselves to it / mimic some functionality?
b) The criteria to define the set of deleted documents can already be
expressed as a query - IndexWriter.deleteDocuments(query).

So what I am offering is to preserve the way to point at the docs we
want to see deleted, and allow to do partial modifications on them.
Thus we add new and exciting functionality, while introducing zero new
concepts. Profit?

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Incremental Field Updates

2010-03-29 Thread mark harwood



>Who ever said that some_condition should point to a unique document?

My assumption was, for now, we were still talking about the simpler case of 
updating a single document.
If we extend the discussion to support set-based updates it's worth considering 
the common requirements for updating sets:
a)  update values can be non-constants such as "reduce price of all products in 
ski-wear dept by 10%".
b)  the criteria to define the set can be most usefully expressed as a query 
rather than mandating a single term e.g. "set published:false on all docs in 
last week's date range"

That feels like too much functionality to consider adding right now but I can 
see a much more basic solution is possible which supports single and simple set 
based updates.






- Original Message 
From: Earwin Burrfoot 
To: java-dev@lucene.apache.org
Sent: Mon, 29 March, 2010 11:05:39
Subject: Re: Incremental Field Updates

>>Variant d) sounds most logical? And enables all sorts of fun stuff.
>
> So the duplicate-key docs can have different values for initial-insert fields 
> but partial updates will cause sharing of  a common field value?
> And subsequent same-key doc inserts do or don't share these previous 
> "partial-update" values?
>
> Sounds like a complex model for users to understand let alone code support 
> for.
> Everyone gets primary keys though.

What you say IS complex. Sharing? Bleargh.

But everyone digs "update qweqwe set field=value where some_condition".
Who ever said that some_condition should point to a unique document?
It could, if you wish it so. Or you can do bulk updates if that's what
you need. Very flexible and no need to introduce any new concepts.


--
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Incremental Field Updates

2010-03-29 Thread Michael McCandless

I agree this is a long overdue feature... we need to get it into
Lucene somehow.

I like the Layers analogy... I think that will work well with Lucene's
transactional semantics, ie a prior commit point would continue to see
the index before the updates but new commit points would see the
updates.

I think we would somehow want the new postings "layer" written to
cleanly be merged under Docs/PositionsEnum?  So that searching is
unaffected -- ie the scorers just see a normal postings enum.
FieldCache would also just populate normally.  But somehow these
partial docs would have to not "count" as real docIDs... and the
normal merging of segments would coalesce these updates...

Also: how would we handle stored fields & term vectors?

Mike

On Sat, Mar 27, 2010 at 7:25 AM, Grant Ingersoll  wrote:
> First off, this is something I've had in my head for a long time, but don't
> have any code.
> As many of you know, one of the main things that vexes any search engine
> based on an inverted index is how to do fast updates of just one field w/o
> having to delete and re-add the whole document like we do today.   When I
> think about the whole update problem, I keep coming back to the notion of
> Photoshop (or any other real photo editing solution) Layers.  In a photo
> editing solution, when you want to hide/change a piece of a photo, it is
> considered best practice to add a layer over that part of the photo to be
> changed.  This way, the original photo is maintained and you don't have to
> worry about accidentally damaging the area you aren't interested in.  Thus,
> a layer is essentially a mask on the original photo. The analogy isn't quite
> the same here, but nevertheless...
>
> So, thinking out loud here and I'm not sure on the best wording of this:
>
> When a document first comes in, it is all in one place, just as it is now.
> Then, when an update comes in on a particular field, we somehow mark in the
> index that the document in question is modified and then we add the new
> change onto the end of the index (just like we currently do when adding new
> docs, but this time it's just a doc w/ a single field). Then, when
> searching, we would, when scoring the affected documents, go to a secondary
> process that knew where to look up the incremental changes. As background
> merging takes place, these "disjoint" documents would be merged back
> together. We'd maybe even consider a "high update" merge scheduler that
> could more frequently handle these incremental merges.
>
> I'm not sure where we would maintain the list of changes.  That is, is it
> something that goes in the posting list, or is it a side structure.  I think
> in the posting list would be to slow.  Also, perhaps it is worthwhile for
> people to indicate that a particular field is expected to be updated while
> others maintain their current format so as not to incur the penalty on each.
>
>  In a sense, the old field for that document is masked by the new field. I
> think, given proper index structure, that we maybe could make that marking
> of the old field fast (maybe it's a pointer to the new field, maybe it's
> just a bit indicating to go look in the "update" segment)
>
> On the search side, I think performance would still be maintained b/c even
> in high update envs. you aren't usually talking about more than a few
> thousand changes in a minute or two and the background merger would be
> responsible for keeping the total number of disjoint documents low.
>
> I realize there isn't a whole lot to go on here just yet, but perhaps it
> will spawn some questions/ideas that will help us work it out in a better
> way.
> At any rate, I think adding incr. field update capability would be a huge
> win for Lucene.
> -Grant

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Incremental Field Updates

2010-03-29 Thread Earwin Burrfoot

>>Variant d) sounds most logical? And enables all sorts of fun stuff.
>
> So the duplicate-key docs can have different values for initial-insert fields 
> but partial updates will cause sharing of  a common field value?
> And subsequent same-key doc inserts do or don't share these previous 
> "partial-update" values?
>
> Sounds like a complex model for users to understand let alone code support 
> for.
> Everyone gets primary keys though.

What you say IS complex. Sharing? Bleargh.

But everyone digs "update qweqwe set field=value where some_condition".
Who ever said that some_condition should point to a unique document?
It could, if you wish it so. Or you can do bulk updates if that's what
you need. Very flexible and no need to introduce any new concepts.


--
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Incremental Field Updates

2010-03-29 Thread mark harwood

>Variant d) sounds most logical? And enables all sorts of fun stuff.

So the duplicate-key docs can have different values for initial-insert fields 
but partial updates will cause sharing of  a common field value?
And subsequent same-key doc inserts do or don't share these previous 
"partial-update" values?

Sounds like a complex model for users to understand let alone code support for.
Everyone gets primary keys though.




- Original Message 
From: Earwin Burrfoot 
To: java-dev@lucene.apache.org
Sent: Mon, 29 March, 2010 10:14:24
Subject: Re: Incremental Field Updates

>>If someone needs this, it can be built over lucene, without
>>introducing it as a core feature and needlessly complicating things.
>
> I think with any partial-update feature the *absence* of primary key support 
> would  "needlessly complicate things":
> If Lucene is not capable of performing duplicate detection on insert (because 
> it has no notion of a primary key field), we need to be prepared for the 
> situation where we have duplicate-key docs in the index.
> What then happens when Grant wants to do a "partial update" as opposed to the 
> existing full-update semantics which first deletes all documents containing 
> the supplied term (always a form of primary key)?
> Which document instance gets "partially updated"? We either:
> a) throw a "duplicate" error (which ideally should have happened back at dup 
> insert time)
> b) Choose one of the documents to "partially update" and keep the duplicate(s)
> c) Choose one of the documents to "partially update" and delete the 
> duplicate(s)
> d) "Partially update" all of the duplicate(s)
> All less than ideal.

Variant d) sounds most logical? And enables all sorts of fun stuff.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Incremental Field Updates

2010-03-29 Thread Earwin Burrfoot

>>If someone needs this, it can be built over lucene, without
>>introducing it as a core feature and needlessly complicating things.
>
> I think with any partial-update feature the *absence* of primary key support 
> would  "needlessly complicate things":
> If Lucene is not capable of performing duplicate detection on insert (because 
> it has no notion of a primary key field), we need to be prepared for the 
> situation where we have duplicate-key docs in the index.
> What then happens when Grant wants to do a "partial update" as opposed to the 
> existing full-update semantics which first deletes all documents containing 
> the supplied term (always a form of primary key)?
> Which document instance gets "partially updated"? We either:
> a) throw a "duplicate" error (which ideally should have happened back at dup 
> insert time)
> b) Choose one of the documents to "partially update" and keep the duplicate(s)
> c) Choose one of the documents to "partially update" and delete the 
> duplicate(s)
> d) "Partially update" all of the duplicate(s)
> All less than ideal.

Variant d) sounds most logical? And enables all sorts of fun stuff.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Incremental Field Updates

2010-03-29 Thread mark harwood

>I can delete by lucene-generated docId. 


Which users used to have to find by first coding a primary-key-term search. 
Delete by term removed this step to make life easier.


>If someone needs this, it can be built over lucene, without
>introducing it as a core feature and needlessly complicating things.

I think with any partial-update feature the *absence* of primary key support 
would  "needlessly complicate things":
If Lucene is not capable of performing duplicate detection on insert (because 
it has no notion of a primary key field), we need to be prepared for the 
situation where we have duplicate-key docs in the index.
What then happens when Grant wants to do a "partial update" as opposed to the 
existing full-update semantics which first deletes all documents containing the 
supplied term (always a form of primary key)? 
Which document instance gets "partially updated"? We either:
a) throw a "duplicate" error (which ideally should have happened back at dup 
insert time)
b) Choose one of the documents to "partially update" and keep the duplicate(s)
c) Choose one of the documents to "partially update" and delete the duplicate(s)
d) "Partially update" all of the duplicate(s)
All less than ideal.

I know we are schema-averse with Lucene (and I value that) but surely any 
partial update feature has to start with a strongly maintained notion of 
document identity as a foundation?
Rather than "needless complexity" I'd argue this was "needed rigour" and 
actually simplifies the user's job if Lucene can do the duplicate-key-on-insert 
check automatically rather than relying on ropy application code and dealing 
with any failures in that.
Of course primary keys are not mandatory. You only use them when you need this 
behaviour - just like in SQL.

Cheers
Mark





-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Incremental Field Updates

2010-03-29 Thread Earwin Burrfoot

> Of course introducing the idea of updates also introduces the notion of a
> primary key and there's probably an entirely separate discussion to be had
> around user-supplied vs Lucene-generated keys.
 Not sure I see that need.  Can you explain your reasoning a bit more?
>>> If you want to update a document you need a way of expressing *which*
>>> document you are updating.
>> This already works somehow for 'deleting' documents?
> Yes, the convention being user-supplied keys.
I can delete by lucene-generated docId. It's too volatile to be
database-style PK, but nonetheless.

> The question posed is if we add another use case where keys are required do 
> we want to turn this existing informal convention
> into more formalized support the way databases do eg duplicate key checks on 
> insert, auto-inc primary key generators.
If someone needs this, it can be built over lucene, without
introducing it as a core feature and needlessly complicating things.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Incremental Field Updates

2010-03-29 Thread Mark Harwood



On 29 Mar 2010, at 07:45, Earwin Burrfoot  wrote:

Of course introducing the idea of updates also introduces the notion of a
primary key and there's probably an entirely separate discussion to be had
around user-supplied vs Lucene-generated keys.
Not sure I see that need.  Can you explain your reasoning a bit more?
If you want to update a document you need a way of expressing *which*
document you are updating.

This already works somehow for 'deleting' documents?


Yes, the convention being user-supplied keys. The question posed is if we add 
another use case where keys are required do we want to turn this existing 
informal convention into more formalized support the way databases do eg 
duplicate key checks on insert, auto-inc primary key generators. 




-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org







-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Incremental Field Updates

2010-03-28 Thread Earwin Burrfoot

>>> Of course introducing the idea of updates also introduces the notion of a
>>> primary key and there's probably an entirely separate discussion to be had
>>> around user-supplied vs Lucene-generated keys.
>> Not sure I see that need.  Can you explain your reasoning a bit more?
> If you want to update a document you need a way of expressing *which*
> document you are updating.

This already works somehow for 'deleting' documents?

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Incremental Field Updates

2010-03-28 Thread Mark Harwood



Of course introducing the idea of updates also introduces the notion of a 
primary key and there's probably an entirely separate discussion to be had 
around user-supplied vs Lucene-generated keys.

Not sure I see that need.  Can you explain your reasoning a bit more?


If you want to update a document you need a way of expressing *which* document 
you are updating.

Cheers 
Mark

Re: Incremental Field Updates

2010-03-28 Thread Grant Ingersoll


On Mar 27, 2010, at 11:14 AM, Mark Harwood wrote:

> Of course introducing the idea of updates also introduces the notion of a 
> primary key and there's probably an entirely separate discussion to be had 
> around user-supplied vs Lucene-generated keys.

Not sure I see that need.  Can you explain your reasoning a bit more?

> 
> That aside, the biggest concern for me here is the impact that this is likely 
> to have on search -  currently queries such as "a:1 AND b:2" are streamed 
> efficiently when evaluated because fields a and b have long postings lists 
> conveniently sorted in doc-id insertion order that can be walked in sequence. 
> If there are to be disjoint, partial  docs, with updated contents arriving 
> out-of-primary-key-order this is bound to introduce costly disk seeks to the 
> query process or require commit-time merges/sorts to preserve the doc-ordered 
> posting lists needed to maintain search speed. Both of these strategies come 
> at a reasonable cost. Of course some form of RAM-based value caching 
> (allowing us to randomly look up the latest value for field b in doc x) is 
> fast but probably only suited to small-scale deployments.

Indeed, part of me thinks this is especially suited for flex indexing, where I 
can make a design time decision to pay the cost in exchange for high updates at 
the cost of potentially slower search.


> 
> It's probably worth thinking through the scenarios we want to cater for. 
> Maybe a Digg-like scenario with users voting on document popularity *can* be 
> catered for with RAM-based field caches because the data (count of votes) is 
> small enough to cache? 

Agreed.  Many social applications require updating one or two fields very 
frequently (popularity, ratings, votes, etc.)


> 
> Cheers,
> Mark
> 
> 
> On 27 Mar 2010, at 11:25, Grant Ingersoll wrote:
> 
>> First off, this is something I've had in my head for a long time, but don't 
>> have any code.
>> 
>> As many of you know, one of the main things that vexes any search engine 
>> based on an inverted index is how to do fast updates of just one field w/o 
>> having to delete and re-add the whole document like we do today.   When I 
>> think about the whole update problem, I keep coming back to the notion of 
>> Photoshop (or any other real photo editing solution) Layers.  In a photo 
>> editing solution, when you want to hide/change a piece of a photo, it is 
>> considered best practice to add a layer over that part of the photo to be 
>> changed.  This way, the original photo is maintained and you don't have to 
>> worry about accidentally damaging the area you aren't interested in.  Thus, 
>> a layer is essentially a mask on the original photo. The analogy isn't quite 
>> the same here, but nevertheless...
>> So, thinking out loud here and I'm not sure on the best wording of this: 
>> 
>> When a document first comes in, it is all in one place, just as it is now. 
>> Then, when an update comes in on a particular field, we somehow mark in the 
>> index that the document in question is modified and then we add the new 
>> change onto the end of the index (just like we currently do when adding new 
>> docs, but this time it's just a doc w/ a single field). Then, when 
>> searching, we would, when scoring the affected documents, go to a secondary 
>> process that knew where to look up the incremental changes. As background 
>> merging takes place, these "disjoint" documents would be merged back 
>> together. We'd maybe even consider a "high update" merge scheduler that 
>> could more frequently handle these incremental merges.   
>> 
>> 
>> I'm not sure where we would maintain the list of changes.  That is, is it 
>> something that goes in the posting list, or is it a side structure.  I think 
>> in the posting list would be to slow.  Also, perhaps it is worthwhile for 
>> people to indicate that a particular field is expected to be updated while 
>> others maintain their current format so as not to incur the penalty on each.
>>  In a sense, the old field for that document is masked by the new field. I 
>> think, given proper index structure, that we maybe could make that marking 
>> of the old field fast (maybe it's a pointer to the new field, maybe it's 
>> just a bit indicating to go look in the "update" segment)
>> 
>> On the search side, I think performance would still be maintained b/c even 
>> in high update envs. you aren't usually talking about more than a few 
>> thousand changes in a minute or two and the background merger would be 
>> responsible for keeping the total number of disjoint documents low.
>> 
>> I realize there isn't a whole lot to go on here just yet, but perhaps it 
>> will spawn some questions/ideas that will help us work it out in a better 
>> way.
>> 
>> At any rate, I think adding incr. field update capability would be a huge 
>> win for Lucene.
>> 
>> -Grant
>

Re: Incremental Field Updates

2010-03-27 Thread Mark Harwood

Of course introducing the idea of updates also introduces the notion of a 
primary key and there's probably an entirely separate discussion to be had 
around user-supplied vs Lucene-generated keys.

That aside, the biggest concern for me here is the impact that this is likely 
to have on search -  currently queries such as "a:1 AND b:2" are streamed 
efficiently when evaluated because fields a and b have long postings lists 
conveniently sorted in doc-id insertion order that can be walked in sequence. 
If there are to be disjoint, partial  docs, with updated contents arriving 
out-of-primary-key-order this is bound to introduce costly disk seeks to the 
query process or require commit-time merges/sorts to preserve the doc-ordered 
posting lists needed to maintain search speed. Both of these strategies come at 
a reasonable cost. Of course some form of RAM-based value caching (allowing us 
to randomly look up the latest value for field b in doc x) is fast but probably 
only suited to small-scale deployments.

It's probably worth thinking through the scenarios we want to cater for. Maybe 
a Digg-like scenario with users voting on document popularity *can* be catered 
for with RAM-based field caches because the data (count of votes) is small 
enough to cache? 

Cheers,
Mark


On 27 Mar 2010, at 11:25, Grant Ingersoll wrote:

> First off, this is something I've had in my head for a long time, but don't 
> have any code.
> 
> As many of you know, one of the main things that vexes any search engine 
> based on an inverted index is how to do fast updates of just one field w/o 
> having to delete and re-add the whole document like we do today.   When I 
> think about the whole update problem, I keep coming back to the notion of 
> Photoshop (or any other real photo editing solution) Layers.  In a photo 
> editing solution, when you want to hide/change a piece of a photo, it is 
> considered best practice to add a layer over that part of the photo to be 
> changed.  This way, the original photo is maintained and you don't have to 
> worry about accidentally damaging the area you aren't interested in.  Thus, a 
> layer is essentially a mask on the original photo. The analogy isn't quite 
> the same here, but nevertheless...
> So, thinking out loud here and I'm not sure on the best wording of this: 
> 
> When a document first comes in, it is all in one place, just as it is now. 
> Then, when an update comes in on a particular field, we somehow mark in the 
> index that the document in question is modified and then we add the new 
> change onto the end of the index (just like we currently do when adding new 
> docs, but this time it's just a doc w/ a single field). Then, when searching, 
> we would, when scoring the affected documents, go to a secondary process that 
> knew where to look up the incremental changes. As background merging takes 
> place, these "disjoint" documents would be merged back together. We'd maybe 
> even consider a "high update" merge scheduler that could more frequently 
> handle these incremental merges.   
> 
> 
> I'm not sure where we would maintain the list of changes.  That is, is it 
> something that goes in the posting list, or is it a side structure.  I think 
> in the posting list would be to slow.  Also, perhaps it is worthwhile for 
> people to indicate that a particular field is expected to be updated while 
> others maintain their current format so as not to incur the penalty on each.
>  In a sense, the old field for that document is masked by the new field. I 
> think, given proper index structure, that we maybe could make that marking of 
> the old field fast (maybe it's a pointer to the new field, maybe it's just a 
> bit indicating to go look in the "update" segment)
> 
> On the search side, I think performance would still be maintained b/c even in 
> high update envs. you aren't usually talking about more than a few thousand 
> changes in a minute or two and the background merger would be responsible for 
> keeping the total number of disjoint documents low.
> 
> I realize there isn't a whole lot to go on here just yet, but perhaps it will 
> spawn some questions/ideas that will help us work it out in a better way.
> 
> At any rate, I think adding incr. field update capability would be a huge win 
> for Lucene.
> 
> -Grant

Incremental Field Updates

2010-03-27 Thread Grant Ingersoll

First off, this is something I've had in my head for a long time, but don't 
have any code.

As many of you know, one of the main things that vexes any search engine based 
on an inverted index is how to do fast updates of just one field w/o having to 
delete and re-add the whole document like we do today.   When I think about the 
whole update problem, I keep coming back to the notion of Photoshop (or any 
other real photo editing solution) Layers.  In a photo editing solution, when 
you want to hide/change a piece of a photo, it is considered best practice to 
add a layer over that part of the photo to be changed.  This way, the original 
photo is maintained and you don't have to worry about accidentally damaging the 
area you aren't interested in.  Thus, a layer is essentially a mask on the 
original photo. The analogy isn't quite the same here, but nevertheless...
So, thinking out loud here and I'm not sure on the best wording of this: 

When a document first comes in, it is all in one place, just as it is now. 
Then, when an update comes in on a particular field, we somehow mark in the 
index that the document in question is modified and then we add the new change 
onto the end of the index (just like we currently do when adding new docs, but 
this time it's just a doc w/ a single field). Then, when searching, we would, 
when scoring the affected documents, go to a secondary process that knew where 
to look up the incremental changes. As background merging takes place, these 
"disjoint" documents would be merged back together. We'd maybe even consider a 
"high update" merge scheduler that could more frequently handle these 
incremental merges.   


I'm not sure where we would maintain the list of changes.  That is, is it 
something that goes in the posting list, or is it a side structure.  I think in 
the posting list would be to slow.  Also, perhaps it is worthwhile for people 
to indicate that a particular field is expected to be updated while others 
maintain their current format so as not to incur the penalty on each.
 In a sense, the old field for that document is masked by the new field. I 
think, given proper index structure, that we maybe could make that marking of 
the old field fast (maybe it's a pointer to the new field, maybe it's just a 
bit indicating to go look in the "update" segment)

On the search side, I think performance would still be maintained b/c even in 
high update envs. you aren't usually talking about more than a few thousand 
changes in a minute or two and the background merger would be responsible for 
keeping the total number of disjoint documents low.

I realize there isn't a whole lot to go on here just yet, but perhaps it will 
spawn some questions/ideas that will help us work it out in a better way.

At any rate, I think adding incr. field update capability would be a huge win 
for Lucene.

-Grant

Re: Incremental Field Updates

Re: Incremental Field Updates

Re: Incremental Field Updates

Re: Incremental Field Updates

Re: Incremental Field Updates

Re: Incremental Field Updates

Re: Incremental Field Updates

Re: Incremental Field Updates

Re: Incremental Field Updates

Re: Incremental Field Updates

Re: AW: Incremental Field Updates

AW: Incremental Field Updates

Re: Incremental Field Updates

Re: Incremental Field Updates

Re: Incremental Field Updates

Re: Incremental Field Updates

Re: Incremental Field Updates

Re: Incremental Field Updates

Re: Incremental Field Updates

Re: Incremental Field Updates

Re: Incremental Field Updates

Re: Incremental Field Updates

Re: Incremental Field Updates

Re: Incremental Field Updates

Re: Incremental Field Updates

Re: Incremental Field Updates

Re: Incremental Field Updates

Incremental Field Updates

28 matches

Site Navigation

Mail list logo

Footer information