[jira] Updated: (LUCENE-2450) Explore write-once attr bindings in the analysis chain
[ https://issues.apache.org/jira/browse/LUCENE-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2450: Labels: gsoc2011 lucene-gsoc-11 mentor (was: gsoc2011, lucene-gsoc-11 mentor,) > Explore write-once attr bindings in the analysis chain > -- > > Key: LUCENE-2450 > URL: https://issues.apache.org/jira/browse/LUCENE-2450 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Michael McCandless > Labels: gsoc2011, lucene-gsoc-11, mentor > Attachments: LUCENE-2450.patch, LUCENE-2450.patch, pipeline.py > > > I'd like to propose a new means of tracking attrs through the analysis > chain, whereby a given stage in the pipeline cannot overwrite attrs > from stages before it (write once). It can only write to new attrs > (possibly w/ the same name) that future stages can see; it can never > alter the attrs or bindings from the prior stages. > I coded up a prototype chain in python (I'll attach), showing the > equivalent of WhitespaceTokenizer -> StopFilter -> SynonymFilter -> > Indexer. > Each stage "sees" a frozen namespace of attr bindings as its input; > these attrs are all read-only from its standpoint. Then, it writes to > an "output namespace", which is read/write, eg it can add new attrs, > remove attrs from its input, change the values of attrs. If that > stage doesn't alter a given attr it "passes through", unchanged. > This would be an enormous change to how attrs are managed... so this > is very very exploratory at this point. Once we decouple indexer from > analysis, creating such an alternate chain should be possible -- it'd > at least be a good test that we've decoupled enough :) > I think the idea offers some compelling improvements over the "global > read/write namespace" (AttrFactory) approach we have today: > * Injection filters can be more efficient -- they need not > capture/restoreState at all > * No more need for the initial tokenizer to "clear all attrs" -- > each stage becomes responsible for clearing the attrs it "owns" > * You can truly stack stages (vs having to make a custom > AttrFactory) -- eg you could make a Bocu1 stage which can stack > onto any other stage. It'd look up the CharTermAttr, remove it > from its output namespace, and add a BytesRefTermAttr. > * Indexer should be more efficient, in that it doesn't need to > re-get the attrs on each next() -- it gets them up front, and > re-uses them. > Note that in this model, the indexer itself is just another stage in > the pipeline, so you could do some wild things like use 2 indexer > stages (writing to different indexes, or maybe the same index but > somehow with further processing or something). > Also, in this approach, the analysis chain is more informed about the > what each stage is allowed to change, up front after the chain is > created. EG (say) we will know that only 2 stages write to the term > attr, and that only 1 writes posIncr/offset attrs, etc. Not sure > if/how this helps us... but it's more strongly typed than what we have > today. > I think we could use a similar chain for processing a document at the > field level, ie, different stages could add/remove/change different > fields in the doc -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2450) Explore write-once attr bindings in the analysis chain
[ https://issues.apache.org/jira/browse/LUCENE-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2450: Labels: gsoc2011, lucene-gsoc-11 mentor, (was: mentor) > Explore write-once attr bindings in the analysis chain > -- > > Key: LUCENE-2450 > URL: https://issues.apache.org/jira/browse/LUCENE-2450 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Michael McCandless > Labels: gsoc2011,, lucene-gsoc-11, mentor, > Attachments: LUCENE-2450.patch, LUCENE-2450.patch, pipeline.py > > > I'd like to propose a new means of tracking attrs through the analysis > chain, whereby a given stage in the pipeline cannot overwrite attrs > from stages before it (write once). It can only write to new attrs > (possibly w/ the same name) that future stages can see; it can never > alter the attrs or bindings from the prior stages. > I coded up a prototype chain in python (I'll attach), showing the > equivalent of WhitespaceTokenizer -> StopFilter -> SynonymFilter -> > Indexer. > Each stage "sees" a frozen namespace of attr bindings as its input; > these attrs are all read-only from its standpoint. Then, it writes to > an "output namespace", which is read/write, eg it can add new attrs, > remove attrs from its input, change the values of attrs. If that > stage doesn't alter a given attr it "passes through", unchanged. > This would be an enormous change to how attrs are managed... so this > is very very exploratory at this point. Once we decouple indexer from > analysis, creating such an alternate chain should be possible -- it'd > at least be a good test that we've decoupled enough :) > I think the idea offers some compelling improvements over the "global > read/write namespace" (AttrFactory) approach we have today: > * Injection filters can be more efficient -- they need not > capture/restoreState at all > * No more need for the initial tokenizer to "clear all attrs" -- > each stage becomes responsible for clearing the attrs it "owns" > * You can truly stack stages (vs having to make a custom > AttrFactory) -- eg you could make a Bocu1 stage which can stack > onto any other stage. It'd look up the CharTermAttr, remove it > from its output namespace, and add a BytesRefTermAttr. > * Indexer should be more efficient, in that it doesn't need to > re-get the attrs on each next() -- it gets them up front, and > re-uses them. > Note that in this model, the indexer itself is just another stage in > the pipeline, so you could do some wild things like use 2 indexer > stages (writing to different indexes, or maybe the same index but > somehow with further processing or something). > Also, in this approach, the analysis chain is more informed about the > what each stage is allowed to change, up front after the chain is > created. EG (say) we will know that only 2 stages write to the term > attr, and that only 1 writes posIncr/offset attrs, etc. Not sure > if/how this helps us... but it's more strongly typed than what we have > today. > I think we could use a similar chain for processing a document at the > field level, ie, different stages could add/remove/change different > fields in the doc -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2450) Explore write-once attr bindings in the analysis chain
[ https://issues.apache.org/jira/browse/LUCENE-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2450: Labels: mentor (was: gsoc2011 lucene-gsoc-11) > Explore write-once attr bindings in the analysis chain > -- > > Key: LUCENE-2450 > URL: https://issues.apache.org/jira/browse/LUCENE-2450 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Michael McCandless > Labels: mentor > Attachments: LUCENE-2450.patch, LUCENE-2450.patch, pipeline.py > > > I'd like to propose a new means of tracking attrs through the analysis > chain, whereby a given stage in the pipeline cannot overwrite attrs > from stages before it (write once). It can only write to new attrs > (possibly w/ the same name) that future stages can see; it can never > alter the attrs or bindings from the prior stages. > I coded up a prototype chain in python (I'll attach), showing the > equivalent of WhitespaceTokenizer -> StopFilter -> SynonymFilter -> > Indexer. > Each stage "sees" a frozen namespace of attr bindings as its input; > these attrs are all read-only from its standpoint. Then, it writes to > an "output namespace", which is read/write, eg it can add new attrs, > remove attrs from its input, change the values of attrs. If that > stage doesn't alter a given attr it "passes through", unchanged. > This would be an enormous change to how attrs are managed... so this > is very very exploratory at this point. Once we decouple indexer from > analysis, creating such an alternate chain should be possible -- it'd > at least be a good test that we've decoupled enough :) > I think the idea offers some compelling improvements over the "global > read/write namespace" (AttrFactory) approach we have today: > * Injection filters can be more efficient -- they need not > capture/restoreState at all > * No more need for the initial tokenizer to "clear all attrs" -- > each stage becomes responsible for clearing the attrs it "owns" > * You can truly stack stages (vs having to make a custom > AttrFactory) -- eg you could make a Bocu1 stage which can stack > onto any other stage. It'd look up the CharTermAttr, remove it > from its output namespace, and add a BytesRefTermAttr. > * Indexer should be more efficient, in that it doesn't need to > re-get the attrs on each next() -- it gets them up front, and > re-uses them. > Note that in this model, the indexer itself is just another stage in > the pipeline, so you could do some wild things like use 2 indexer > stages (writing to different indexes, or maybe the same index but > somehow with further processing or something). > Also, in this approach, the analysis chain is more informed about the > what each stage is allowed to change, up front after the chain is > created. EG (say) we will know that only 2 stages write to the term > attr, and that only 1 writes posIncr/offset attrs, etc. Not sure > if/how this helps us... but it's more strongly typed than what we have > today. > I think we could use a similar chain for processing a document at the > field level, ie, different stages could add/remove/change different > fields in the doc -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2450) Explore write-once attr bindings in the analysis chain
[ https://issues.apache.org/jira/browse/LUCENE-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-2450: --- Labels: gsoc2011 lucene-gsoc-11 (was: ) > Explore write-once attr bindings in the analysis chain > -- > > Key: LUCENE-2450 > URL: https://issues.apache.org/jira/browse/LUCENE-2450 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Michael McCandless > Labels: gsoc2011, lucene-gsoc-11 > Attachments: LUCENE-2450.patch, LUCENE-2450.patch, pipeline.py > > > I'd like to propose a new means of tracking attrs through the analysis > chain, whereby a given stage in the pipeline cannot overwrite attrs > from stages before it (write once). It can only write to new attrs > (possibly w/ the same name) that future stages can see; it can never > alter the attrs or bindings from the prior stages. > I coded up a prototype chain in python (I'll attach), showing the > equivalent of WhitespaceTokenizer -> StopFilter -> SynonymFilter -> > Indexer. > Each stage "sees" a frozen namespace of attr bindings as its input; > these attrs are all read-only from its standpoint. Then, it writes to > an "output namespace", which is read/write, eg it can add new attrs, > remove attrs from its input, change the values of attrs. If that > stage doesn't alter a given attr it "passes through", unchanged. > This would be an enormous change to how attrs are managed... so this > is very very exploratory at this point. Once we decouple indexer from > analysis, creating such an alternate chain should be possible -- it'd > at least be a good test that we've decoupled enough :) > I think the idea offers some compelling improvements over the "global > read/write namespace" (AttrFactory) approach we have today: > * Injection filters can be more efficient -- they need not > capture/restoreState at all > * No more need for the initial tokenizer to "clear all attrs" -- > each stage becomes responsible for clearing the attrs it "owns" > * You can truly stack stages (vs having to make a custom > AttrFactory) -- eg you could make a Bocu1 stage which can stack > onto any other stage. It'd look up the CharTermAttr, remove it > from its output namespace, and add a BytesRefTermAttr. > * Indexer should be more efficient, in that it doesn't need to > re-get the attrs on each next() -- it gets them up front, and > re-uses them. > Note that in this model, the indexer itself is just another stage in > the pipeline, so you could do some wild things like use 2 indexer > stages (writing to different indexes, or maybe the same index but > somehow with further processing or something). > Also, in this approach, the analysis chain is more informed about the > what each stage is allowed to change, up front after the chain is > created. EG (say) we will know that only 2 stages write to the term > attr, and that only 1 writes posIncr/offset attrs, etc. Not sure > if/how this helps us... but it's more strongly typed than what we have > today. > I think we could use a similar chain for processing a document at the > field level, ie, different stages could add/remove/change different > fields in the doc -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2450) Explore write-once attr bindings in the analysis chain
[ https://issues.apache.org/jira/browse/LUCENE-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-2450: --- Attachment: LUCENE-2450.patch New patch attached. This patch adds a new pipeline stage called AppendingStage. You provide it multiple things to analyze (currently as a String[], but we can generalize that), and it will step through them one at a time, logically appending their tokens. You also give it posIncrGap and offsetGap, which it adds in on switching to the next field. I think this is a compelling way to handle fields with multiple values, and it can make our "decouple indexing from analysis" even stronger. Ie, today indexer is hardwired to call analyzer's getPositionIncrementGap/getOffsetGap. But with this AppendingStage approach, how multi-valued fields are appended is purely an analysis detail, hidden to the indexer. EG you could make a stage that inserts some kind of marker token on each field transition, instead. And since it's a fully pluggable stage, you're free to move it anywhere (beginning, middle, end) in your pipeline. > Explore write-once attr bindings in the analysis chain > -- > > Key: LUCENE-2450 > URL: https://issues.apache.org/jira/browse/LUCENE-2450 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Michael McCandless > Attachments: LUCENE-2450.patch, LUCENE-2450.patch, pipeline.py > > > I'd like to propose a new means of tracking attrs through the analysis > chain, whereby a given stage in the pipeline cannot overwrite attrs > from stages before it (write once). It can only write to new attrs > (possibly w/ the same name) that future stages can see; it can never > alter the attrs or bindings from the prior stages. > I coded up a prototype chain in python (I'll attach), showing the > equivalent of WhitespaceTokenizer -> StopFilter -> SynonymFilter -> > Indexer. > Each stage "sees" a frozen namespace of attr bindings as its input; > these attrs are all read-only from its standpoint. Then, it writes to > an "output namespace", which is read/write, eg it can add new attrs, > remove attrs from its input, change the values of attrs. If that > stage doesn't alter a given attr it "passes through", unchanged. > This would be an enormous change to how attrs are managed... so this > is very very exploratory at this point. Once we decouple indexer from > analysis, creating such an alternate chain should be possible -- it'd > at least be a good test that we've decoupled enough :) > I think the idea offers some compelling improvements over the "global > read/write namespace" (AttrFactory) approach we have today: > * Injection filters can be more efficient -- they need not > capture/restoreState at all > * No more need for the initial tokenizer to "clear all attrs" -- > each stage becomes responsible for clearing the attrs it "owns" > * You can truly stack stages (vs having to make a custom > AttrFactory) -- eg you could make a Bocu1 stage which can stack > onto any other stage. It'd look up the CharTermAttr, remove it > from its output namespace, and add a BytesRefTermAttr. > * Indexer should be more efficient, in that it doesn't need to > re-get the attrs on each next() -- it gets them up front, and > re-uses them. > Note that in this model, the indexer itself is just another stage in > the pipeline, so you could do some wild things like use 2 indexer > stages (writing to different indexes, or maybe the same index but > somehow with further processing or something). > Also, in this approach, the analysis chain is more informed about the > what each stage is allowed to change, up front after the chain is > created. EG (say) we will know that only 2 stages write to the term > attr, and that only 1 writes posIncr/offset attrs, etc. Not sure > if/how this helps us... but it's more strongly typed than what we have > today. > I think we could use a similar chain for processing a document at the > field level, ie, different stages could add/remove/change different > fields in the doc -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2450) Explore write-once attr bindings in the analysis chain
[ https://issues.apache.org/jira/browse/LUCENE-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-2450: --- Attachment: LUCENE-2450.patch OK I ported it (roughly) to Java -- gonna need some serious Uwe help to get the generics right :) I also made a simplistic example (TestStages)... it just analyzes a canned sentence (same from above) using WhitespaceTokenizer, LowercaseFilter, StopFilter, SynonymFilter (borrowed from LIA2), all converted from TokenStream to Stage, the class that impls a single stage of the pipeline. That test also does a simplistic perf test -- analyzing that canned text many times with that pipeline -- the write-once attr bindings gets ~9% speedup. > Explore write-once attr bindings in the analysis chain > -- > > Key: LUCENE-2450 > URL: https://issues.apache.org/jira/browse/LUCENE-2450 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Michael McCandless > Attachments: LUCENE-2450.patch, pipeline.py > > > I'd like to propose a new means of tracking attrs through the analysis > chain, whereby a given stage in the pipeline cannot overwrite attrs > from stages before it (write once). It can only write to new attrs > (possibly w/ the same name) that future stages can see; it can never > alter the attrs or bindings from the prior stages. > I coded up a prototype chain in python (I'll attach), showing the > equivalent of WhitespaceTokenizer -> StopFilter -> SynonymFilter -> > Indexer. > Each stage "sees" a frozen namespace of attr bindings as its input; > these attrs are all read-only from its standpoint. Then, it writes to > an "output namespace", which is read/write, eg it can add new attrs, > remove attrs from its input, change the values of attrs. If that > stage doesn't alter a given attr it "passes through", unchanged. > This would be an enormous change to how attrs are managed... so this > is very very exploratory at this point. Once we decouple indexer from > analysis, creating such an alternate chain should be possible -- it'd > at least be a good test that we've decoupled enough :) > I think the idea offers some compelling improvements over the "global > read/write namespace" (AttrFactory) approach we have today: > * Injection filters can be more efficient -- they need not > capture/restoreState at all > * No more need for the initial tokenizer to "clear all attrs" -- > each stage becomes responsible for clearing the attrs it "owns" > * You can truly stack stages (vs having to make a custom > AttrFactory) -- eg you could make a Bocu1 stage which can stack > onto any other stage. It'd look up the CharTermAttr, remove it > from its output namespace, and add a BytesRefTermAttr. > * Indexer should be more efficient, in that it doesn't need to > re-get the attrs on each next() -- it gets them up front, and > re-uses them. > Note that in this model, the indexer itself is just another stage in > the pipeline, so you could do some wild things like use 2 indexer > stages (writing to different indexes, or maybe the same index but > somehow with further processing or something). > Also, in this approach, the analysis chain is more informed about the > what each stage is allowed to change, up front after the chain is > created. EG (say) we will know that only 2 stages write to the term > attr, and that only 1 writes posIncr/offset attrs, etc. Not sure > if/how this helps us... but it's more strongly typed than what we have > today. > I think we could use a similar chain for processing a document at the > field level, ie, different stages could add/remove/change different > fields in the doc -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2450) Explore write-once attr bindings in the analysis chain
[ https://issues.apache.org/jira/browse/LUCENE-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-2450: --- Attachment: pipeline.py Attached patch. Run it like this: {{python pipeline.py}}, and it analyzes the silly sentence "this is a test of the emergency broadcast system", producing this output: {noformat} TERM=test pos=3 TERM=emergency pos=6 TERM=911 pos=6 TERM=broadcast pos=7 TERM=television pos=7 TERM=tv pos=7 TERM=system pos=8 done! {noformat} It's very much just a prototype -- I cheat in certain places (eg I don't have strongly typed attrs, just a single Anything class) -- but the gist of the idea is visible. > Explore write-once attr bindings in the analysis chain > -- > > Key: LUCENE-2450 > URL: https://issues.apache.org/jira/browse/LUCENE-2450 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Michael McCandless > Attachments: pipeline.py > > > I'd like to propose a new means of tracking attrs through the analysis > chain, whereby a given stage in the pipeline cannot overwrite attrs > from stages before it (write once). It can only write to new attrs > (possibly w/ the same name) that future stages can see; it can never > alter the attrs or bindings from the prior stages. > I coded up a prototype chain in python (I'll attach), showing the > equivalent of WhitespaceTokenizer -> StopFilter -> SynonymFilter -> > Indexer. > Each stage "sees" a frozen namespace of attr bindings as its input; > these attrs are all read-only from its standpoint. Then, it writes to > an "output namespace", which is read/write, eg it can add new attrs, > remove attrs from its input, change the values of attrs. If that > stage doesn't alter a given attr it "passes through", unchanged. > This would be an enormous change to how attrs are managed... so this > is very very exploratory at this point. Once we decouple indexer from > analysis, creating such an alternate chain should be possible -- it'd > at least be a good test that we've decoupled enough :) > I think the idea offers some compelling improvements over the "global > read/write namespace" (AttrFactory) approach we have today: > * Injection filters can be more efficient -- they need not > capture/restoreState at all > * No more need for the initial tokenizer to "clear all attrs" -- > each stage becomes responsible for clearing the attrs it "owns" > * You can truly stack stages (vs having to make a custom > AttrFactory) -- eg you could make a Bocu1 stage which can stack > onto any other stage. It'd look up the CharTermAttr, remove it > from its output namespace, and add a BytesRefTermAttr. > * Indexer should be more efficient, in that it doesn't need to > re-get the attrs on each next() -- it gets them up front, and > re-uses them. > Note that in this model, the indexer itself is just another stage in > the pipeline, so you could do some wild things like use 2 indexer > stages (writing to different indexes, or maybe the same index but > somehow with further processing or something). > Also, in this approach, the analysis chain is more informed about the > what each stage is allowed to change, up front after the chain is > created. EG (say) we will know that only 2 stages write to the term > attr, and that only 1 writes posIncr/offset attrs, etc. Not sure > if/how this helps us... but it's more strongly typed than what we have > today. > I think we could use a similar chain for processing a document at the > field level, ie, different stages could add/remove/change different > fields in the doc -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org