[
https://issues.apache.org/jira/browse/LUCENE-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13423139#comment-13423139
]
Shai Erera commented on LUCENE-4258:
------------------------------------
There is more to it than just the referenced email. I've had a couple of
discussions in the past about this with various people (and it is my fault that
I didn't wrote them down and shared them with the rest of you) -- I'll try to
summarize below a more detailed proposal:
*API*
Add an updateFields method which takes a Constraint and OP (eventually, it
might replace today's updateDocument):
* Constraint defines 'which documents' should be updated, and follows today's
deleteDocument API (takes Term, Query and arrays of each)
* OP defines the actual update to do on those documents:
** It has a TYPE, with 3 values (At least for now):
**# REPLACE_DOC -- replaces an entire document (essentially what updateDocument
is today)
**# UPDATE_FIELD -- incrementally update a field
**# REPLACE_FIELD -- replaces a field entirely
** In addition, it takes a Field[] (or Iterable) to remove/add.
** In light of the recent changes to IndexableField and friends, perhaps what
it should take is a concrete UpdateField with a boolean specifying whether to
add/remove its content. Suggestions are welcome !
*Implementation*
The idea is to create StackedSegments, which, well, stack on top of current
segments. The inspiration came from deletes, which can be viewed as a segment
stacked on an existing segment, that marks which documents are deleted.
Following that semantics, a segment could be comprised of these files:
* Layer 1: _0.prx, _0.fnm, _0.fdt ...
* Layer 2: _0_1.prx, _0_1.fdt (no updates to .fnm) -- override/merge info from
layer 1
* Layer 3: _0_2.prx -- override/merge info from layer 2
* Layer 4: _0_1.del -- deletes are *always* the last layer, irregardless of
their 'layer id' -- _0_1.del overrides everything, even _0_100.prx.
** And they can be stacked on themselves as today, e.g. _0_2.del etc.
I believe that we'll need an UpdateCodec or something ... this is the part of
the internal API that we still need to understand better. Help from folks like
you Robert will be greatly appreciated !
Two options to encode the posting lists:
* field:value --> +1, -5, +8, +12, -17 ... (simple, but cannot be encoded
efficiently
*# +field:value --> 1, 8, 12
*# -field:value --> 5, 17
Ideally, the way incremental updates will be applied will follow how deletes
are applied today:
* An update always applies to *all* documents that are flushed
* And to all documents currently in the RAM buffer
* But never to documents that are indexed later
Again, this is an internal detail that I'd appreciate if someone can give us a
pointer to where that happens in the code today (now with concurrent flushing).
I remember PackedDeletes existed at some point, has that changed?
If it's a new Codec, then SegmentReader may not even need to change ...
The REPLACE_FIELD OP is tricky ... perhaps it's like how deletes are
materialized on disk -- as a sparse bit vector that marks the documents that
are no longer associated with it ...
I also think that we should introduce this feature in steps:
# Support only fields that omit TFAP (i.e. DOCS_ONLY). This is very valuable
for fields like ACL, TAGS, CATEGORIES etc.
** Ideally, the app would just need to say "add/remove ACL:SHAI to/from
document X", rather than passing the entire list of ACLs every on every update
operation.
** This I believe is also the most common use case for incremental field updates
# Support stored fields, whether as part of (1) or a follow-on, but adding
TAG:LUCENE to the postings, but not the stored fields, is limiting ...
# Support terms with positions, but no norms. What I'm thinking about are terms
that store stuff in the payload, but don't care about the positions themselves.
An example are the category dimensions of the facet module, which stores
category ordinals in the payload
#* Positions are tricky, and we'll need to do this carefully, I know. But I
don't rule it out at this point.
# Then, support fields with norms. I get your concern Robert, and I agree it's
a challenge, hence why I leave it to last. The scenario I have in mind is: a
search engine that lets you comment on a result or tag it, and the comment/tag
should be added to the document's 'catchall' field for later searches. I think
it's a valuable scenario, and this is something I'd like to support. If we
cannot find a way to deal with it and the norms, then I see two options:
## Document a limitation to updating a field with norms, at your own risk.
## Enforce REPLACE_FIELD OP on fields with norms.
* Since norms are under DocValues now, maybe that's solvable, I don't know. At
the moment I think that we have a lot to do before we worry about norms ...
* I also think that we should start with the simpler ADD_FIELD operation, and
not support REMOVE_FIELD ... really to keep things simple at start.
I suggest we do this work in a dedicated branch of course. Ideally, we can port
everything to 4.x at some point, as I think most of the changes are internal
details ...
> Incremental Field Updates through Stacked Segments
> --------------------------------------------------
>
> Key: LUCENE-4258
> URL: https://issues.apache.org/jira/browse/LUCENE-4258
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/index
> Reporter: Sivan Yogev
> Original Estimate: 2,520h
> Remaining Estimate: 2,520h
>
> Shai and I would like to start working on the proposal to Incremental Field
> Updates outlined here (http://markmail.org/message/zhrdxxpfk6qvdaex).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]