Explore write-once attr bindings in the analysis chain
------------------------------------------------------

                 Key: LUCENE-2450
                 URL: https://issues.apache.org/jira/browse/LUCENE-2450
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Analysis
            Reporter: Michael McCandless


I'd like to propose a new means of tracking attrs through the analysis
chain, whereby a given stage in the pipeline cannot overwrite attrs
from stages before it (write once).  It can only write to new attrs
(possibly w/ the same name) that future stages can see; it can never
alter the attrs or bindings from the prior stages.

I coded up a prototype chain in python (I'll attach), showing the
equivalent of WhitespaceTokenizer -> StopFilter -> SynonymFilter ->
Indexer.

Each stage "sees" a frozen namespace of attr bindings as its input;
these attrs are all read-only from its standpoint.  Then, it writes to
an "output namespace", which is read/write, eg it can add new attrs,
remove attrs from its input, change the values of attrs.  If that
stage doesn't alter a given attr it "passes through", unchanged.

This would be an enormous change to how attrs are managed... so this
is very very exploratory at this point.  Once we decouple indexer from
analysis, creating such an alternate chain should be possible -- it'd
at least be a good test that we've decoupled enough :)

I think the idea offers some compelling improvements over the "global
read/write namespace" (AttrFactory) approach we have today:

  * Injection filters can be more efficient -- they need not
    capture/restoreState at all

  * No more need for the initial tokenizer to "clear all attrs" --
    each stage becomes responsible for clearing the attrs it "owns"

  * You can truly stack stages (vs having to make a custom
    AttrFactory) -- eg you could make a Bocu1 stage which can stack
    onto any other stage.  It'd look up the CharTermAttr, remove it
    from its output namespace, and add a BytesRefTermAttr.

  * Indexer should be more efficient, in that it doesn't need to
    re-get the attrs on each next() -- it gets them up front, and
    re-uses them.

Note that in this model, the indexer itself is just another stage in
the pipeline, so you could do some wild things like use 2 indexer
stages (writing to different indexes, or maybe the same index but
somehow with further processing or something).

Also, in this approach, the analysis chain is more informed about the
what each stage is allowed to change, up front after the chain is
created.  EG (say) we will know that only 2 stages write to the term
attr, and that only 1 writes posIncr/offset attrs, etc.  Not sure
if/how this helps us... but it's more strongly typed than what we have
today.

I think we could use a similar chain for processing a document at the
field level, ie, different stages could add/remove/change different
fields in the doc....


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to