[ https://issues.apache.org/jira/browse/LUCENE-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12865896#action_12865896 ]
Michael McCandless commented on LUCENE-2450: -------------------------------------------- Another benefit of the stage model: you can just stack stages onto the end of an existing pipeline to change things up. Ie, you don't need to "own" the AttrFactory of the whole chain, eg to make sure certain specific impls are used for certain attrs. If you want/need a different attr impl, the stage just removes the last one and binds its own impl -- every stage has full freedom to alter the attr bindings visible to stages after it. > Explore write-once attr bindings in the analysis chain > ------------------------------------------------------ > > Key: LUCENE-2450 > URL: https://issues.apache.org/jira/browse/LUCENE-2450 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael McCandless > Attachments: LUCENE-2450.patch, pipeline.py > > > I'd like to propose a new means of tracking attrs through the analysis > chain, whereby a given stage in the pipeline cannot overwrite attrs > from stages before it (write once). It can only write to new attrs > (possibly w/ the same name) that future stages can see; it can never > alter the attrs or bindings from the prior stages. > I coded up a prototype chain in python (I'll attach), showing the > equivalent of WhitespaceTokenizer -> StopFilter -> SynonymFilter -> > Indexer. > Each stage "sees" a frozen namespace of attr bindings as its input; > these attrs are all read-only from its standpoint. Then, it writes to > an "output namespace", which is read/write, eg it can add new attrs, > remove attrs from its input, change the values of attrs. If that > stage doesn't alter a given attr it "passes through", unchanged. > This would be an enormous change to how attrs are managed... so this > is very very exploratory at this point. Once we decouple indexer from > analysis, creating such an alternate chain should be possible -- it'd > at least be a good test that we've decoupled enough :) > I think the idea offers some compelling improvements over the "global > read/write namespace" (AttrFactory) approach we have today: > * Injection filters can be more efficient -- they need not > capture/restoreState at all > * No more need for the initial tokenizer to "clear all attrs" -- > each stage becomes responsible for clearing the attrs it "owns" > * You can truly stack stages (vs having to make a custom > AttrFactory) -- eg you could make a Bocu1 stage which can stack > onto any other stage. It'd look up the CharTermAttr, remove it > from its output namespace, and add a BytesRefTermAttr. > * Indexer should be more efficient, in that it doesn't need to > re-get the attrs on each next() -- it gets them up front, and > re-uses them. > Note that in this model, the indexer itself is just another stage in > the pipeline, so you could do some wild things like use 2 indexer > stages (writing to different indexes, or maybe the same index but > somehow with further processing or something). > Also, in this approach, the analysis chain is more informed about the > what each stage is allowed to change, up front after the chain is > created. EG (say) we will know that only 2 stages write to the term > attr, and that only 1 writes posIncr/offset attrs, etc. Not sure > if/how this helps us... but it's more strongly typed than what we have > today. > I think we could use a similar chain for processing a document at the > field level, ie, different stages could add/remove/change different > fields in the doc.... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org