[jira] [Commented] (LUCENE-4258) Incremental Field Updates through Stacked Segments

Shai Erera (JIRA) Thu, 26 Jul 2012 08:17:36 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13423139#comment-13423139
 ]


Shai Erera commented on LUCENE-4258:
------------------------------------

There is more to it than just the referenced email. I've had a couple of 
discussions in the past about this with various people (and it is my fault that 
I didn't wrote them down and shared them with the rest of you) -- I'll try to 
summarize below a more detailed proposal:

*API*
Add an updateFields method which takes a Constraint and OP (eventually, it 
might replace today's updateDocument):
* Constraint defines 'which documents' should be updated, and follows today's 
deleteDocument API (takes Term, Query and arrays of each)
* OP defines the actual update to do on those documents: 
** It has a TYPE, with 3 values (At least for now):
**# REPLACE_DOC -- replaces an entire document (essentially what updateDocument 
is today)
**# UPDATE_FIELD -- incrementally update a field
**# REPLACE_FIELD -- replaces a field entirely
** In addition, it takes a Field[] (or Iterable) to remove/add.
** In light of the recent changes to IndexableField and friends, perhaps what 
it should take is a concrete UpdateField with a boolean specifying whether to 
add/remove its content. Suggestions are welcome !

*Implementation*
The idea is to create StackedSegments, which, well, stack on top of current 
segments. The inspiration came from deletes, which can be viewed as a segment 
stacked on an existing segment, that marks which documents are deleted.

Following that semantics, a segment could be comprised of these files:
* Layer 1: _0.prx, _0.fnm, _0.fdt ...
* Layer 2: _0_1.prx, _0_1.fdt (no updates to .fnm) -- override/merge info from 
layer 1
* Layer 3: _0_2.prx  -- override/merge info from layer 2
* Layer 4: _0_1.del -- deletes are *always* the last layer, irregardless of 
their 'layer id' -- _0_1.del overrides everything, even _0_100.prx.
** And they can be stacked on themselves as today, e.g. _0_2.del etc.
I believe that we'll need an UpdateCodec or something ... this is the part of 
the internal API that we still need to understand better. Help from folks like 
you Robert will be greatly appreciated !

Two options to encode the posting lists:
* field:value --> +1, -5, +8, +12, -17 ... (simple, but cannot be encoded 
efficiently
*# +field:value --> 1, 8, 12
*# -field:value --> 5, 17

Ideally, the way incremental updates will be applied will follow how deletes 
are applied today:
* An update always applies to *all* documents that are flushed
* And to all documents currently in the RAM buffer
* But never to documents that are indexed later

Again, this is an internal detail that I'd appreciate if someone can give us a 
pointer to where that happens in the code today (now with concurrent flushing). 
I remember PackedDeletes existed at some point, has that changed?

If it's a new Codec, then SegmentReader may not even need to change ...

The REPLACE_FIELD OP is tricky ... perhaps it's like how deletes are 
materialized on disk -- as a sparse bit vector that marks the documents that 
are no longer associated with it ...

I also think that we should introduce this feature in steps:
# Support only fields that omit TFAP (i.e. DOCS_ONLY). This is very valuable 
for fields like ACL, TAGS, CATEGORIES etc.
** Ideally, the app would just need to say "add/remove ACL:SHAI to/from 
document X", rather than passing the entire list of ACLs every on every update 
operation.
** This I believe is also the most common use case for incremental field updates
# Support stored fields, whether as part of (1) or a follow-on, but adding 
TAG:LUCENE to the postings, but not the stored fields, is limiting ...
# Support terms with positions, but no norms. What I'm thinking about are terms 
that store stuff in the payload, but don't care about the positions themselves. 
An example are the category dimensions of the facet module, which stores 
category ordinals in the payload
#* Positions are tricky, and we'll need to do this carefully, I know. But I 
don't rule it out at this point.
# Then, support fields with norms. I get your concern Robert, and I agree it's 
a challenge, hence why I leave it to last. The scenario I have in mind is: a 
search engine that lets you comment on a result or tag it, and the comment/tag 
should be added to the document's 'catchall' field for later searches. I think 
it's a valuable scenario, and this is something I'd like to support. If we 
cannot find a way to deal with it and the norms, then I see two options:
## Document a limitation to updating a field with norms, at your own risk.
## Enforce REPLACE_FIELD OP on fields with norms.

* Since norms are under DocValues now, maybe that's solvable, I don't know. At 
the moment I think that we have a lot to do before we worry about norms ...

* I also think that we should start with the simpler ADD_FIELD operation, and 
not support REMOVE_FIELD ... really to keep things simple at start.

I suggest we do this work in a dedicated branch of course. Ideally, we can port 
everything to 4.x at some point, as I think most of the changes are internal 
details ...
                
> Incremental Field Updates through Stacked Segments
> --------------------------------------------------
>
>                 Key: LUCENE-4258
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4258
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Sivan Yogev
>   Original Estimate: 2,520h
>  Remaining Estimate: 2,520h
>
> Shai and I would like to start working on the proposal to Incremental Field 
> Updates outlined here (http://markmail.org/message/zhrdxxpfk6qvdaex).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4258) Incremental Field Updates through Stacked Segments

Reply via email to