> > When I update a field, I > want to update all of it, not just part of it. No? >
Well ... might be. But the most common case for us are fields to which we want to add data or remove. For example ACLs - you don't want to replace the entire field w/ the document, but simply to add/remove access for certain people/groups. Same goes for "social" fields, like tags, ratings, bookmarks etc. - the granularity of the update is to associate/dissociate a particular value w/ the field + doc, and not update the entire field. Shai On Sun, May 9, 2010 at 10:31 AM, Babak Farhang <farh...@gmail.com> wrote: > > No, actually, you can update index-only fields also. It all depends on > the > > operation that you ask to do on the fields. > > I love the level of control this provides, but again, I was talking at > the user level. > > > If you want to e.g. remove an entire field w/ such update operation, then > it > > becomes more expensive > > That's the typical usage scenario, I imagine. When I update a field, I > want to update all of it, not just part of it. No? > > (The lower level semantics of twiddling with the postings is poorly > understood by the typical user..) > > > We didn't > > face such a scenario though, and I think it's probably a rare one? > > As rare as any time you want to update an indexed-only field. Not a > serious limitation (but perhaps worth noting?) > Perhaps at the API level, you provide an updateIndexedOnlyField that > takes the old value as well as the new value for the field. > > Anyway, I think your approach would be a great addition to the > toolkit. Would love to see it even in rough cut form :)) > > -Babak > > On Sun, May 9, 2010 at 12:49 AM, Shai Erera <ser...@gmail.com> wrote: > > No, actually, you can update index-only fields also. It all depends on > the > > operation that you ask to do on the fields. For example, if the query to > > execute is something like "update all documents w/ tags:ibm -> remove > terms > > t1, t2, t3 and add terms t4, t5", then the result of such request would > > dissociate t1-3 from those docs that answer tags:ibm and associate them > w/ > > t4 and t5. Specifically, if docs 1, 4, 10 answer tags:ibm, then the > > following posting updates will be done: > > t1: -1, -4, -10 > > t2: -1, -4, -10 > > t3: -1, -4, -10 > > t4, 1, 4, 10 > > t5, 1, 4, 10 > > (in addition to whatever other updates that are associated with those > > postings). > > > > At search time, if you search for "t1 OR t2", then the regular t1 and t2 > > postings will be merged on-the-fly w/ the updated ones to remove docs 1, > 4, > > 10. > > > > If you want to e.g. remove an entire field w/ such update operation, then > it > > becomes more expensive, but in general you'd need to iterate over the > > field's terms and dissociate the documents from all the terms. We didn't > > face such a scenario though, and I think it's probably a rare one? > > > > Shai > > > > On Sun, May 9, 2010 at 9:39 AM, Babak Farhang <farh...@gmail.com> wrote: > >> > >> Shai, > >> > >> I think this is an interesting approach. I can see how I could > >> [incrementally] update a stored, indexed field this way, but I don't > >> understand how I could update an indexed-only field. Here's why: for a > >> stored (and indexed) field, I can always determine what terms to > >> remove ('-') from the postings, but for an indexed-only field I'd have > >> no [practical] way to know.. > >> > >> So under this approach, I'm thinking at a user level, a Lucene field > >> would be updateable only if it's at least stored. > >> > >> Is that right? > >> > >> -Babak > >> > >> On Wed, May 5, 2010 at 11:17 AM, Shai Erera <ser...@gmail.com> wrote: > >> > Yes Mike - I don't know yet if two MPs will be used, one for the > stacked > >> > segments and one for the general segments (which will include the > >> > stacked > >> > ones as well) .. feels like one MP should be enough, but this can be > >> > decided > >> > on as we progress. > >> > > >> > This approach allows you to update every term in every already indexed > >> > field, as well as add terms to already indexed fields ... and add > >> > totally > >> > new fields, with lots of text in them. So e.g. there are two neat use > >> > cases > >> > that come to mind: > >> > 1) Allow users to rate search results, favor them etc. > >> > 2) Or even comment them, > >> > I think Google offers the 2nd. Both translate into updating the search > >> > result's already indexed document w/ the new rating, comment etc. w/o > >> > needing to reindex the document. > >> > > >> > I will try to find perf results numbers. It's been long time ago, hope > >> > all > >> > the documents are still where they were :). Indeed, for terms like > ACLs, > >> > it > >> > means that each query had to merge dozens of postings lists. For that > I > >> > implemented an alternative solution, which uses a payload-like > structure > >> > that registers for each document the list of ACLs that are associated > >> > with > >> > it (not as strings, it was more efficient). Then, if the query > included > >> > dozens of such terms, I created a filter-like object which for every > >> > matching document by the query checked if it matches the ACLs list of > >> > the > >> > document. This is usually slower, because the ACLs themselves don't > >> > drive > >> > the query, which means more matches will be found. That was a tradeoff > >> > which > >> > one could configure based on the number of such terms in the query, > the > >> > number of updated terms etc. > >> > > >> > But in essence you're right - if the solution is generic to cover any > >> > term, > >> > we cannot use such payload-based feature. We might need to merge the > >> > stacked > >> > segments more frequently. This is pending perf testing though. > >> > > >> > Shai > >> > > >> > On Wed, May 5, 2010 at 6:54 PM, Michael McCandless > >> > <luc...@mikemccandless.com> wrote: > >> >> > >> >> Catching up here :) > >> >> > >> >> This is great stuff Shai -- I like the notion of "negative" postings, > >> >> that "subtract" docs from previous generations as you iterate them. > >> >> > >> >> And I like the term "stacked segments". This fits very well with > >> >> Lucene's write-once approach, ie, a writer can at any time stack a > new > >> >> segment (writes to new files) "over" an old segment, like the layers > >> >> in photoshop. A reader merges all stacks on-the-fly when reading. > >> >> > >> >> And the merge policy now picks from 2 dimensions right? Ie it may > >> >> want to simply consolidate stacks on an old segment but otherwise not > >> >> merge that segment (eg for very large segments that have accumulated > >> >> updates), and normal merging would of course consolidate all stacks > >> >> for all segments merged. > >> >> > >> >> Wouldn't this approach conceivably allow you to alter single terms > >> >> within a single field (we'd have to figure out how we'd expose the > API > >> >> for this)? EG if I appended some text to an already-indexed field? > >> >> (In addition to adding a new field to an already indexed doc, or > >> >> replacing an indexed field on a previously indexed doc). > >> >> > >> >> Did you have any hard perf numbers? Merge sorting N streams is > >> >> surprisingly costly... we may need/want to have a reader pre-merge > >> >> (using up RAM) any "long tail" of stacked segments as long as they > are > >> >> small enough... > >> >> > >> >> This sounds great!! > >> >> > >> >> Mike > >> >> > >> >> On Sun, Apr 25, 2010 at 7:32 AM, Shai Erera <ser...@gmail.com> > wrote: > >> >> > Hi, > >> >> > > >> >> > WARNING: following email is a bit long, but I think is worth the > >> >> > reading > >> >> > :) > >> >> > > >> >> > I would like to describe my implementation of incremental field > >> >> > updates > >> >> > in Juru (the former search library I've worked on in IBM). I will > >> >> > later > >> >> > discuss how I think it can be implemented in Lucene. > >> >> > > >> >> > The motivation/requirement came from a doc management system which > >> >> > used > >> >> > Juru as its search component. The system included document > libraries > >> >> > where users could create files and upload documents. A user could > >> >> > belong > >> >> > to any number of libraries and complex ACLs model was used (down to > >> >> > the > >> >> > level of the file). ACLs and Folders were modeled as categories in > >> >> > the > >> >> > index (boolean-like terms). Files and folders could be moved around > >> >> > and > >> >> > access to a library/folder/file could be granted/revoked at any > given > >> >> > time. Therefore, such updates usually affected hundreds (and > >> >> > thousands) > >> >> > of documents. Overall, the index managed millions of documents, > tens > >> >> > of > >> >> > thousands of libraries and hundreds of thousands of ACLs (large > >> >> > organization :)). To get a rough understanding on the number of > such > >> >> > updates - every several minutes, tens of thousands of documents > were > >> >> > updated due to such changes only (in addition to the regular > content > >> >> > updates). > >> >> > > >> >> > We were asked to support requests in the following form: "update > all > >> >> > docs > >> >> > that match <criteria> --> do <operation>" where: > >> >> > * <criteria> was "a single doc", "docs belonging to a category" and > >> >> > "docs > >> >> > belonging to a set of categories". > >> >> > * <operation> was "add categories NEW" (NEW might not even exist in > >> >> > the > >> >> > index yet, or already associated w/ the document), "remove > categories > >> >> > OLD" > >> >> > (irregardless if the docs were already associated w/ OLD or not) > and > >> >> > "remove all OLD and add all NEW". > >> >> > > >> >> > The solution I've implemented to support these requests turned out > to > >> >> > actually allow you to update every term (!) in the index: suppose > >> >> > that > >> >> > you have a table, which recorded tuples like <docid, term, +/->. > The > >> >> > record <1, "ibm", '+'> means that doc 1 is associated w/ the term > >> >> > "ibm", > >> >> > and the record <1, "hp", '-'> means that the same document is not > >> >> > associated w/ the word "hp". Then, you could very easily ask for > all > >> >> > documents that are assoicated w/ "hp", and the result would not > >> >> > include > >> >> > doc 1. Note that docid=1 is not the app Doc_ID, but the internal id > >> >> > the > >> >> > document received. > >> >> > > >> >> > I've kept two types of postings in the index: regular and updates. > >> >> > Taking the above examples, "ibm" regular posting looked like > <"ibm", > >> >> > 1, > >> >> > 3, 1, 2, 5 ...> (dgaps) and the updates posting looked like <"ibm", > >> >> > +2, > >> >> > -3, +6, +10 ...> (absolute docid value, w/ a +/- sign). Similarly > for > >> >> > "hp". > >> >> > > >> >> > During search time, when a query with the word "ibm" was submitted, > I > >> >> > create a virtual posting which reads from both the regular and the > >> >> > updates, and merges them on the fly according to the +/- signs. > Since > >> >> > both postings are sorted in ascending order, the merge is very > >> >> > efficient, and query time is hardly affected. > >> >> > > >> >> > Those postings are merged from time to time in a process that is > >> >> > similar > >> >> > to how Lucene works today, which keeps the update postings > relatively > >> >> > small and manageable. > >> >> > > >> >> > Now here comes the fun part - how I think it can be implemented in > >> >> > Lucene ! > >> >> > > >> >> > To be honest, this sat on my TODO list for a very long time and > only > >> >> > a > >> >> > couple of months ago I figured out how to implement it in Lucene. > The > >> >> > main difficulty I had was around the difference between the > >> >> > write-once > >> >> > policy in Juru and Lucene - in Lucene, once a segment is written, > it > >> >> > cannot be changed. BUT, I've only recently realized that this isn't > >> >> > exactly true, because deleted docs do change existing segments. The > >> >> > deletes are kept in a separate file to the segment (.del) and have > >> >> > their > >> >> > own generation. Deletes, as I understood then, and Grant helped me > >> >> > term > >> >> > them better, can be defined as "Stacked Segments" - they add data > to > >> >> > a > >> >> > segment, which from time to time are integrated into the segment > >> >> > (unlike > >> >> > Photoshop Layers, but my understanding of Photoshop is limited). > And > >> >> > the > >> >> > Lucene engine knows how to combine the two, giving precedence to > the > >> >> > deletes. > >> >> > > >> >> > By introducing an "Updates Stacked Segment", we can encode postings > >> >> > w/ > >> >> > the '+'/'-' signs, and when TermDocs/Positions is requested, we can > >> >> > create a variation which merges the two lists. When segments are > >> >> > merged, > >> >> > the updates will be merged into the regular postings (just like > >> >> > deletes) > >> >> > and thus will be gone. In addition, this plays very nicely with > >> >> > readers > >> >> > that are currently reading the index, as well as we can have > >> >> > generations > >> >> > for the updates - really like deletes ! > >> >> > > >> >> > I think that Lucene's architecture allows for such a solution very > >> >> > cleanly and nicely (and I believe flex makes it even easier). We > can > >> >> > (later, after you've digested the idea) discuss whether this should > >> >> > be > >> >> > built into the current IW, or an extension like UpdateableIW. The > API > >> >> > I've been thinking about should really be like deletes, allowing to > >> >> > update docs based on Term or Query. I defer the API discussion for > >> >> > later > >> >> > for now. > >> >> > > >> >> > As for stored fields, this was a real challenge to support in Juru, > >> >> > but > >> >> > I think that w/ "Stacked Segments" and Lucene's architecture, this > >> >> > should > >> >> > be much easier - adding stacked stored fields ... > >> >> > > >> >> > As you've noticed, the update postings are not DGap encoded, and > sign > >> >> > needs to be preserved. While I haven't implemented it in Juru, I > >> >> > think > >> >> > that perhaps this can be improved by keeping the '-' and '+' lists > >> >> > separated. We will need to register somewhere which came before > which > >> >> > because order matters a lot here (and I'm not talking about > >> >> > concurrency > >> >> > - simple update instructions order). I have some idea how this can > be > >> >> > achieved, but I refrain from describing it now, to not make this > >> >> > email > >> >> > even longer :). > >> >> > > >> >> > I've mentioned that this approach can be applied to any term and > not > >> >> > just categories under some circumstances. Basically, as soon as you > >> >> > update a term, its DF is no longer true, unless you are able to > take > >> >> > the > >> >> > updates into account. We can defer the discussion on that, but > >> >> > clearly > >> >> > for many fields, incrementally update them should not affect > >> >> > precision, > >> >> > as they're not used for that type of scoring ... Maybe, by keeping > >> >> > separate '+' and '-' lists we can compute statistics precisely. And > I > >> >> > haven't given much thought yet to how this and Mike's flex scoring > >> >> > will > >> >> > be integrated. > >> >> > > >> >> > BTW, a word on Parallel Indexing - the two are completely > orthogonal. > >> >> > Once PI is introduced, one can index all the updateable fields in a > >> >> > dedicated slice, for perhaps improving search performance for > slices > >> >> > that are not updateable (not involving code which attempts to read > >> >> > and > >> >> > merge update and regular lists on the fly). Also, incremental field > >> >> > updates support all of PI's scenarios, even though some will be > done > >> >> > more efficiently w/ PI. But this too is a matter for a separate > >> >> > discussion :). > >> >> > > >> >> > That's it ! I believe I've given you all the details I have about > the > >> >> > approach and high level proposed solution for Lucene. Perhaps some > >> >> > details slipped my mind, but if you ask the right questions, I'm > sure > >> >> > I'll be able to answer them :). I would like to emphasize that > since > >> >> > this was already implemented (in Juru) - this is more than just a > "I > >> >> > think this approach can work" proposal ... > >> >> > > >> >> > I would appreciate your comments on this. I would like to start > >> >> > implementing it soon, and so as a first step, please share your > >> >> > comments > >> >> > on the overall approach. I'll then write a more detailed > description > >> >> > on > >> >> > how I think to impl it in Lucene (been spending some time on that), > >> >> > and > >> >> > we can have more detailed (and fun) discussions on the low level > >> >> > details. > >> >> > > >> >> > Shai > >> >> > > >> >> > On Fri, Apr 9, 2010 at 5:05 AM, Babak Farhang <farh...@gmail.com> > >> >> > wrote: > >> >> >> > >> >> >> Good point. I meant the model at the document level: i.e. what > >> >> >> milestones does a document go through in its life cycle. Today: > >> >> >> > >> >> >> created --> deleted > >> >> >> > >> >> >> With incremental updates: > >> >> >> > >> >> >> created --> update1 --> update2 --> deleted > >> >> >> > >> >> >> I think what I'm trying to say is that this second threaded > sequence > >> >> >> of state changes seems intuitively more fragile under concurrent > >> >> >> scenarios. So for example, in a lock-free design, the system > would > >> >> >> also have to anticipate the following sequence of events: > >> >> >> > >> >> >> created --> update1 --> deleted --> update2 > >> >> >> > >> >> >> and consider update2 a null op. I'm imagining there are other > cases > >> >> >> that I can't think of.. > >> >> >> > >> >> >> -Babak > >> >> >> > >> >> >> > >> >> >> On Tue, Apr 6, 2010 at 3:40 AM, Michael McCandless > >> >> >> <luc...@mikemccandless.com> wrote: > >> >> >> > write once, plus the option to the app to keep multiple commit > >> >> >> > points > >> >> >> > around (by customizing the deletion policy). > >> >> >> > > >> >> >> > Actually order of operations / commits very much matters in > Lucene > >> >> >> > today. > >> >> >> > > >> >> >> > Deletions are not idempotent: if you add a doc w/ term X, delete > >> >> >> > by > >> >> >> > term X, add a new doc with term X... that's very different than > if > >> >> >> > you > >> >> >> > moved the delete op to the end. Ie the deletion only applies to > >> >> >> > the > >> >> >> > docs added before it. > >> >> >> > > >> >> >> > Mike > >> >> >> > > >> >> >> > On Mon, Apr 5, 2010 at 12:45 AM, Babak Farhang < > farh...@gmail.com> > >> >> >> > wrote: > >> >> >> >> Sure. Because of the write once principle. But at some cost > >> >> >> >> (duplicated data). I was just agreeing that it would not be a > >> >> >> >> good > >> >> >> >> idea to bake in version-ing by keeping the layers around > forever > >> >> >> >> in > >> >> >> >> a > >> >> >> >> merged index; I wasn't keying in on transactions per se. > >> >> >> >> > >> >> >> >> Speaking of transactions: I'm not sure if we should worry about > >> >> >> >> this > >> >> >> >> much yet, but with "updates" the order of the transaction > commits > >> >> >> >> seems important. I think commit order is less important today > in > >> >> >> >> Lucene because its model supports only 2 types of events: > >> >> >> >> document > >> >> >> >> creation--which only happens once, and document deletion, which > >> >> >> >> is > >> >> >> >> idempotent. What do you think? Will commits have to be ordered > >> >> >> >> if > >> >> >> >> we > >> >> >> >> introduce updates? Or does the onus of maintaining order fall > on > >> >> >> >> the > >> >> >> >> application? > >> >> >> >> > >> >> >> >> -Babak > >> >> >> >> > >> >> >> >> On Sat, Apr 3, 2010 at 3:28 AM, Michael McCandless > >> >> >> >> <luc...@mikemccandless.com> wrote: > >> >> >> >>> On Sat, Apr 3, 2010 at 1:25 AM, Babak Farhang > >> >> >> >>> <farh...@gmail.com> > >> >> >> >>> wrote: > >> >> >> >>>>> I think they get merged in by the merger, ideally in the > >> >> >> >>>>> background. > >> >> >> >>>> > >> >> >> >>>> That sounds sensible. (In other words, we wont concern > >> >> >> >>>> ourselves > >> >> >> >>>> with > >> >> >> >>>> roll backs--something possible while a "layer" is still > >> >> >> >>>> around.) > >> >> >> >>> > >> >> >> >>> Actually roll backs would still be very possible even if > layers > >> >> >> >>> are > >> >> >> >>> merged. > >> >> >> >>> > >> >> >> >>> Ie, one could keep multiple commits around, and the older > >> >> >> >>> commits > >> >> >> >>> would still be referring to the old postings + layers, keeping > >> >> >> >>> them > >> >> >> >>> alive. > >> >> >> >>> > >> >> >> >>> Lucene would still be transactional with such an approach. > >> >> >> >>> > >> >> >> >>> Mike > >> >> >> >>> > >> >> >> >>> > >> >> >> >>> > >> >> >> >>> > --------------------------------------------------------------------- > >> >> >> >>> To unsubscribe, e-mail: > java-dev-unsubscr...@lucene.apache.org > >> >> >> >>> For additional commands, e-mail: > java-dev-h...@lucene.apache.org > >> >> >> >>> > >> >> >> >>> > >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > --------------------------------------------------------------------- > >> >> >> >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > >> >> >> >> For additional commands, e-mail: > java-dev-h...@lucene.apache.org > >> >> >> >> > >> >> >> >> > >> >> >> > > >> >> >> > > >> >> >> > > --------------------------------------------------------------------- > >> >> >> > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > >> >> >> > For additional commands, e-mail: > java-dev-h...@lucene.apache.org > >> >> >> > > >> >> >> > > >> >> >> > >> >> >> > >> >> >> > --------------------------------------------------------------------- > >> >> >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > >> >> >> For additional commands, e-mail: java-dev-h...@lucene.apache.org > >> >> >> > >> >> > > >> >> > > >> >> > >> >> --------------------------------------------------------------------- > >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >> >> For additional commands, e-mail: dev-h...@lucene.apache.org > >> >> > >> > > >> > > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: dev-h...@lucene.apache.org > >> > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >