[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12735124#action_12735124 ] Robert Muir edited comment on LUCENE-1693 at 7/24/09 12:08 PM: --- {quote} I completely agree. We should discourage users to reference Token and rather use the Attribute interfaces. That's the whole beauty and flexibility about this new API. {quote} Has there been any thought into reconsidering the new API's experimental status then? I don't think the WARNING: encourages users to use these interfaces! or maybe a compromise: maybe modify the javadocs to be a little less scary: does this text have to be FF (red) ? was (Author: rcmuir): {quote} I completely agree. We should discourage users to reference Token and rather use the Attribute interfaces. That's the whole beauty and flexibility about this new API. {quote} Has there been any thought into reconsidering the new API's experimental status then? I don't think the WARNING: encourages users to use these interfaces! AttributeSource/TokenStream API improvements Key: LUCENE-1693 URL: https://issues.apache.org/jira/browse/LUCENE-1693 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 2.9 Attachments: lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, PerfTest3.java, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java This patch makes the following improvements to AttributeSource and TokenStream/Filter: - introduces interfaces for all Attributes. The corresponding implementations have the postfix 'Impl', e.g. TermAttribute and TermAttributeImpl. AttributeSource now has a factory for creating the Attribute instances; the default implementation looks for implementing classes with the postfix 'Impl'. Token now implements all 6 TokenAttribute interfaces. - new method added to AttributeSource: addAttributeImpl(AttributeImpl). Using reflection it walks up in the class hierarchy of the passed in object and finds all interfaces that the class or superclasses implement and that extend the Attribute interface. It then adds the interface-instance mappings to the attribute map for each of the found interfaces. - removes the set/getUseNewAPI() methods (including the standard ones). Instead it is now enough to only implement the new API, if one old TokenStream implements still the old API (next()/next(Token)), it is wrapped automatically. The delegation path is determined via reflection (the patch determines, which of the three methods was overridden). - Token is no longer deprecated, instead it implements all 6 standard token interfaces (see above). The wrapper for next() and next(Token) uses this, to automatically map all attribute interfaces to one TokenWrapper instance (implementing all 6 interfaces), that contains a Token instance. next() and next(Token) exchange the inner Token instance as needed. For the new incrementToken(), only one TokenWrapper instance is visible, delegating to the currect reusable Token. This API also preserves custom Token subclasses, that maybe created by very special token streams (see example in Backwards-Test). - AttributeImpl now has a default implementation of toString that uses reflection to print out the values of the attributes in a default formatting. This makes it a bit easier to implement AttributeImpl, because toString() was declared abstract before. - Cloning is now done much more efficiently in captureState. The method figures out which unique AttributeImpl instances are contained as values in the attributes map, because those are the ones that need to be cloned. It creates a single linked list that supports deep cloning (in the inner class AttributeSource.State). AttributeSource keeps track of when this state changes, i.e. whenever new attributes are added to the AttributeSource. Only in that case will captureState recompute the state, otherwise it will simply clone the precomputed state and return the clone. restoreState(AttributeSource.State) walks the linked list and uses the copyTo() method of AttributeImpl to copy all values over into the attribute that the source stream (e.g.
[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12735210#action_12735210 ] Mark Miller edited comment on LUCENE-1693 at 7/24/09 4:59 PM: -- Not sure what issue it stems from, but Token has a bunch of constructors that are deprecated, but that don't point you to something new. edit must have come from the setBuffer stuff was (Author: markrmil...@gmail.com): Not sure what issue it stems from, but Token has a bunch of constructors that are deprecated, but that don't point you to something new. AttributeSource/TokenStream API improvements Key: LUCENE-1693 URL: https://issues.apache.org/jira/browse/LUCENE-1693 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 2.9 Attachments: lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, PerfTest3.java, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java This patch makes the following improvements to AttributeSource and TokenStream/Filter: - introduces interfaces for all Attributes. The corresponding implementations have the postfix 'Impl', e.g. TermAttribute and TermAttributeImpl. AttributeSource now has a factory for creating the Attribute instances; the default implementation looks for implementing classes with the postfix 'Impl'. Token now implements all 6 TokenAttribute interfaces. - new method added to AttributeSource: addAttributeImpl(AttributeImpl). Using reflection it walks up in the class hierarchy of the passed in object and finds all interfaces that the class or superclasses implement and that extend the Attribute interface. It then adds the interface-instance mappings to the attribute map for each of the found interfaces. - removes the set/getUseNewAPI() methods (including the standard ones). Instead it is now enough to only implement the new API, if one old TokenStream implements still the old API (next()/next(Token)), it is wrapped automatically. The delegation path is determined via reflection (the patch determines, which of the three methods was overridden). - Token is no longer deprecated, instead it implements all 6 standard token interfaces (see above). The wrapper for next() and next(Token) uses this, to automatically map all attribute interfaces to one TokenWrapper instance (implementing all 6 interfaces), that contains a Token instance. next() and next(Token) exchange the inner Token instance as needed. For the new incrementToken(), only one TokenWrapper instance is visible, delegating to the currect reusable Token. This API also preserves custom Token subclasses, that maybe created by very special token streams (see example in Backwards-Test). - AttributeImpl now has a default implementation of toString that uses reflection to print out the values of the attributes in a default formatting. This makes it a bit easier to implement AttributeImpl, because toString() was declared abstract before. - Cloning is now done much more efficiently in captureState. The method figures out which unique AttributeImpl instances are contained as values in the attributes map, because those are the ones that need to be cloned. It creates a single linked list that supports deep cloning (in the inner class AttributeSource.State). AttributeSource keeps track of when this state changes, i.e. whenever new attributes are added to the AttributeSource. Only in that case will captureState recompute the state, otherwise it will simply clone the precomputed state and return the clone. restoreState(AttributeSource.State) walks the linked list and uses the copyTo() method of AttributeImpl to copy all values over into the attribute that the source stream (e.g. SinkTokenizer) uses. - Tee- and SinkTokenizer were deprecated, because they use Token instances for caching. This is not compatible to the new API using AttributeSource.State objects. You can still use the old deprecated ones, but new features provided by new Attribute types may get lost in the chain. A replacement is a new TeeSinkTokenFilter, which has a factory to create new Sink instances, that have compatible attributes. Sink instances created by one Tee can also
[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12733087#action_12733087 ] Uwe Schindler edited comment on LUCENE-1693 at 7/19/09 4:05 PM: Forgot CHANGES.txt entries with the backwards-break note and other changes. When starting to transform the contrib streams, we should not forget to do the following (maybe add this note to the JIRA issues): - rewrite and replace next(Token)/next() implementations by new API - if the class is final, no next(Token)/next() methods needed *(must be removed!!!)* - if the class is non-final add the following methods to the class: {code} /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should * not be overridden. Delegates to the backwards compatibility layer. */ public final Token next(final Token reusableToken) throws java.io.IOException { return super.next(reusableToken); } /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should * not be overridden. Delegates to the backwards compatibility layer. */ public final Token next() throws java.io.IOException { return super.next(); } {code} Also the incrementToken() method must be final in this case. was (Author: thetaphi): Forgot CHANGES.txt entries with the backwards-break note and other changes. When starting to transform the contrib streams, we should not forget to do the following (maybe add this note to the JIRA issues): - rewrite and replace next(Token)/next() implementations by new API - if the class is final, no next(Token)/next() methods needed *(must be removed!!!)* - if the class is non-final add the following methods to the class: {code} /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should * not be overridden. Delegates to the backwards compatibility layer. */ public final Token next(final Token reusableToken) throws java.io.IOException { return super.next(reusableToken); } /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should * not be overridden. Delegates to the backwards compatibility layer. */ public final Token next() throws java.io.IOException { return super.next(); } {code} AttributeSource/TokenStream API improvements Key: LUCENE-1693 URL: https://issues.apache.org/jira/browse/LUCENE-1693 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 2.9 Attachments: LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, PerfTest3.java, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java This patch makes the following improvements to AttributeSource and TokenStream/Filter: - removes the set/getUseNewAPI() methods (including the standard ones). Instead by default incrementToken() throws a subclass of UnsupportedOperationException. The indexer tries to call incrementToken() initially once to see if the exception is thrown; if so, it falls back to the old API. - introduces interfaces for all Attributes. The corresponding implementations have the postfix 'Impl', e.g. TermAttribute and TermAttributeImpl. AttributeSource now has a factory for creating the Attribute instances; the default implementation looks for implementing classes with the postfix 'Impl'. Token now implements all 6 TokenAttribute interfaces. - new method added to AttributeSource: addAttributeImpl(AttributeImpl). Using reflection it walks up in the class hierarchy of the passed in object and finds all interfaces that the class or superclasses implement and that extend the Attribute interface. It then adds the interface-instance mappings to the attribute map for each of the found interfaces. - AttributeImpl now has a default implementation of toString that uses reflection to print out the values of the attributes in a default formatting. This makes it a bit easier to implement AttributeImpl, because toString() was declared abstract before. - Cloning is now done much more efficiently in captureState. The method figures out which unique AttributeImpl instances are contained as values in the attributes map, because those are the ones that need to be cloned. It creates a single linked list that supports deep cloning (in the inner class AttributeSource.State). AttributeSource
[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732878#action_12732878 ] Uwe Schindler edited comment on LUCENE-1693 at 7/18/09 4:06 AM: In this case we should rename TeeSink to something like SplitTokenStream (which does not extend TokenStream). One could get then any number of sinks from it: {code} SplitTokenFilter splitter=new SplitTokenStream(stream); // does not extend TokenStream!!! TokenStream stream1=splitter.newSinkTokenStream(); TokenStream stream2=splitter.newSinkTokenStream(); ... {code} In this case the caching would be done directly in the splitter and the sinks are only consumers. The first sink that calls to get the attribute states forces the splitter to harvest and cache the input stream (exactly like CachingTokenStream does it). In principle it would be the same like a CachingTokenStream. But on the other hand: You can always create a CachingTokenFilter and reuse the same instance for different fields. Because the indexer always calls reset() before consuming, you could re-read it easily. Any additional filters could then plugged in front for each field. In this case the order is not important: {code} TokenStream stream=new CachingTokenFilter(input); doc.add(new Field(xyz, new DoSomethingTokenFilter(stream))); doc.add(new Field(abc, new DoSometingOtherTokenFilter(stream))); ... {code} This would not work, if the indexer can consume the different fields in parallel. But with the current state it would even not work with Tee/Sink (not multithread compatible). was (Author: thetaphi): In this case we should rename TeeSink to something like SplitTokenStream (which does not extend TokenStream). One could get then any number of sinks from it: {code} SplitTokenFilter splitter=new SplitTokenStream(stream); // does not extend TokenStream!!! TokenStream stream1=splitter.newSinkTokenStream(); TokenStream stream2=splitter.newSinkTokenStream(); ... {code} In this case the caching would be done directly in the splitter and the sinks are only consumers. The first sink that calls to get the attribute states forces the splitter to harvest and cache the input stream (exactly like CachingTokenStream does it). In principle it would be the same like a CachingTokenStream. But on the other hand: You can always create a CachingTokenStream and reuse the same instance for different fields. Because the indexer always calls reset() before consuming, you could re-read it easily. Any additional filters could then plugged in front for each field. In this case the order is not important: {code} TokenStream stream=new CachingTokenStream(input); doc.add(new Field(xyz, new DoSomethingTokenFilter(stream); doc.add(new Field(abc, new DoSometingOtherTokenFilter(stream); ... {code} This would not work, if the indexer can consume the different fields in parallel. But with the current state it would even not work with Tee/Sink (not multithread compatible). AttributeSource/TokenStream API improvements Key: LUCENE-1693 URL: https://issues.apache.org/jira/browse/LUCENE-1693 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 2.9 Attachments: LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java This patch makes the following improvements to AttributeSource and TokenStream/Filter: - removes the set/getUseNewAPI() methods (including the standard ones). Instead by default incrementToken() throws a subclass of UnsupportedOperationException. The indexer tries to call incrementToken() initially once to see if the exception is thrown; if so, it falls back to the old API. - introduces interfaces for all Attributes. The corresponding implementations have the postfix 'Impl', e.g. TermAttribute and TermAttributeImpl. AttributeSource now has a factory for creating the Attribute instances; the default implementation looks for implementing classes with the postfix 'Impl'. Token now implements all 6 TokenAttribute interfaces. - new method added to AttributeSource: addAttributeImpl(AttributeImpl). Using reflection it walks up in the class hierarchy of the passed in object and finds all interfaces that the class or superclasses implement and that extend
[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732407#action_12732407 ] Michael Busch edited comment on LUCENE-1693 at 7/17/09 1:37 AM: {quote} As CachingTokenFilter has no API to access the cached attributes/tokens directly, {quote} Oh true :) well, scratch it from the TODO list... We knew it'd work conceptually the same for AttributeSource.State; unlike Tee/Sink, which wouldn't even be save to use with the new API if it hadn't the getTokens() method for the reasons I explained above. {quote} Another idea would be to clone all attribute impls and then add them to the sink - the factory would then not be used? {quote} Yes, I thought about this for a while. It would be nice to have this (i. e. cloning an AttributeSource) in general: you could reduce the costs for initializing the TokenStreams with onlyUseNewAPI=false. We just need to keep a static AttributeSource around, that contains the wrapper and the mappings from the 6 default interfaces. Then instead of constructing it every time we just clone that AttributeSource for new TokenStreams. The query parser could do the same to keep initialization costs of the TokenStreams minimal, because it always needs the same attributes. I think it should be easy? We just need to implement clone() for AttributeSource. was (Author: michaelbusch): {quote} As CachingTokenFilter has no API to access the cached attributes/tokens directly, {quote} Oh true :) well, scratch it from the TODO list... We knew it'd work conceptually the same for AttributeSource.State; unlike Tee/Sink, which wouldn't even be save to use with the new API if it hadn't the getTokens() method for the reasons I explained above. {quote] Another idea would be to clone all attribute impls and then add them to the sink - the factory would then not be used? {quote} Yes, I thought about this for a while. It would be nice to have this (i. e. cloning an AttributeSource) in general: you could reduce the costs for initializing the TokenStreams with onlyUseNewAPI=false. We just need to keep a static AttributeSource around, that contains the wrapper and the mappings from the 6 default interfaces. Then instead of constructing it every time we just clone that AttributeSource for new TokenStreams. The query parser could do the same to keep initialization costs of the TokenStreams minimal, because it always needs the same attributes. I think it should be easy? We just need to implement clone() for AttributeSource. AttributeSource/TokenStream API improvements Key: LUCENE-1693 URL: https://issues.apache.org/jira/browse/LUCENE-1693 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 2.9 Attachments: LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java This patch makes the following improvements to AttributeSource and TokenStream/Filter: - removes the set/getUseNewAPI() methods (including the standard ones). Instead by default incrementToken() throws a subclass of UnsupportedOperationException. The indexer tries to call incrementToken() initially once to see if the exception is thrown; if so, it falls back to the old API. - introduces interfaces for all Attributes. The corresponding implementations have the postfix 'Impl', e.g. TermAttribute and TermAttributeImpl. AttributeSource now has a factory for creating the Attribute instances; the default implementation looks for implementing classes with the postfix 'Impl'. Token now implements all 6 TokenAttribute interfaces. - new method added to AttributeSource: addAttributeImpl(AttributeImpl). Using reflection it walks up in the class hierarchy of the passed in object and finds all interfaces that the class or superclasses implement and that extend the Attribute interface. It then adds the interface-instance mappings to the attribute map for each of the found interfaces. - AttributeImpl now has a default implementation of toString that uses reflection to print out the values of the attributes in a default formatting. This makes it a bit easier to implement AttributeImpl, because toString() was declared abstract before. - Cloning is now done much more efficiently in captureState. The method
[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732762#action_12732762 ] Uwe Schindler edited comment on LUCENE-1693 at 7/17/09 3:31 PM: In my opinion, this test that I was not able to reimplement using the new TeeSink is very theoretical. In my opinion someone who consumes two streams could always implement this very simple using the two aproaches: - consume the first stream completely, the the other - consume one token from the first, then one token from the next and so on Who invented the whole Tee/Sink originally for what? Maybe we should simply ask on java-user who have ever used it. I have never heard of Tee/Sink before Michael confronted me with his strange POSToken tests :-) was (Author: thetaphi): In my opinion, this test that I was not able to reimplement using the new TeeSink is very theoretical. In my opinion someone who consumes two streams could always implement this very simple using the two aproaches: - consume the first stream completely, the the other - consume one token from the first, then one token from the next and so on Who invented the whole Tee/Sink originally for what? Maybe we should simply ask on java-user who have ever used it. I have never heard of Tee/Sink before Michael confronted me with his strange POSToken tests :-) AttributeSource/TokenStream API improvements Key: LUCENE-1693 URL: https://issues.apache.org/jira/browse/LUCENE-1693 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 2.9 Attachments: LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java This patch makes the following improvements to AttributeSource and TokenStream/Filter: - removes the set/getUseNewAPI() methods (including the standard ones). Instead by default incrementToken() throws a subclass of UnsupportedOperationException. The indexer tries to call incrementToken() initially once to see if the exception is thrown; if so, it falls back to the old API. - introduces interfaces for all Attributes. The corresponding implementations have the postfix 'Impl', e.g. TermAttribute and TermAttributeImpl. AttributeSource now has a factory for creating the Attribute instances; the default implementation looks for implementing classes with the postfix 'Impl'. Token now implements all 6 TokenAttribute interfaces. - new method added to AttributeSource: addAttributeImpl(AttributeImpl). Using reflection it walks up in the class hierarchy of the passed in object and finds all interfaces that the class or superclasses implement and that extend the Attribute interface. It then adds the interface-instance mappings to the attribute map for each of the found interfaces. - AttributeImpl now has a default implementation of toString that uses reflection to print out the values of the attributes in a default formatting. This makes it a bit easier to implement AttributeImpl, because toString() was declared abstract before. - Cloning is now done much more efficiently in captureState. The method figures out which unique AttributeImpl instances are contained as values in the attributes map, because those are the ones that need to be cloned. It creates a single linked list that supports deep cloning (in the inner class AttributeSource.State). AttributeSource keeps track of when this state changes, i.e. whenever new attributes are added to the AttributeSource. Only in that case will captureState recompute the state, otherwise it will simply clone the precomputed state and return the clone. restoreState(AttributeSource.State) walks the linked list and uses the copyTo() method of AttributeImpl to copy all values over into the attribute that the source stream (e.g. SinkTokenizer) uses. The cloning performance can be greatly improved if not multiple AttributeImpl instances are used in one TokenStream. A user can e.g. simply add a Token instance to the stream instead of the individual attributes. Or the user could implement a subclass of AttributeImpl that implements exactly the Attribute interfaces needed. I think this should be considered an expert API (addAttributeImpl), as this manual optimization is
[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731295#action_12731295 ] Uwe Schindler edited comment on LUCENE-1693 at 7/14/09 11:12 PM: - bq. However, if people are using next() in a chain with core streams/filters, then every single token will now be cloned, possibly multiple times, right? That was a similar problem in 2.4, whenever you call next() to consume a stream, every token is a new instance. Only cloning WAS not needed but its now needed for BW compatibility (full private copy). In my opinion, this is neglectible, as indexing speed is important, and the indexer always uses incrementToken(). If somebody written an own query parser that uses next() to consume speed is not really important. And it is deprecated and in the docs (even in 2.4) stands: but will be slower than calling next(Token). bq. So if people are using streams/filters that implement next(Token) I think the performance should be comparable - even though there's also a (hopefully small) performance hit to expect because of more method calls and if checks. I found no performance hit, it is about same speed. The varieties between tests is bigger than a measureable performance impact. The other sensitive thing (TokenWrapper): The wrapping using TokenWrapper was in the original indexing code, too (this BackwardsCompatibilityStream private class). bq. It's failing right now, because it's testing a lot of different combinations. I need to check if all of those different tests are actually valid, because we're saying you can't use Tee/Sink with the new API anymore. Have you seen my backwards compatibility test, too? It is a copy of yours (with some variation)? The Lucene24* classes were removed, because Tee/SinkTokenizer werde reverted to their original 2.4 status in the patch (only implement old API). As far as I see (not yet tried out), you try to test new-style-API streams with the old Tee/Sink tokenizer, that is deprecated. You were not able to do this before 2.9 (no new API) and so the bw problem is not there. If you rewrite your streams with new API, you should use TeeSinkTokenizer, too. was (Author: thetaphi): bq. However, if people are using next() in a chain with core streams/filters, then every single token will now be cloned, possibly multiple times, right? That was a similar problem in 2.4, whenever you call next() to consume a stream, every token is a new instance. Only cloning is not needed but its now needed for BW compatibility (full private copy). In my opinion, this is neglectible, as indexing speed is important, and the indexer always uses incrementToken(). If somebody written an own query parser that uses next() to consume speed is not really important. And it is deprecated and in the docs (even in 2.4) stands: but will be slower than calling next(Token). bq. So if people are using streams/filters that implement next(Token) I think the performance should be comparable - even though there's also a (hopefully small) performance hit to expect because of more method calls and if checks. I found no performance hit, it is about same speed. The varieties between tests is bigger than a measureable performance impact. The other sensitive thing (TokenWrapper): The wrapping using TokenWrapper was in the original indexing code, too (this BackwardsCompatibilityStream private class). bq. It's failing right now, because it's testing a lot of different combinations. I need to check if all of those different tests are actually valid, because we're saying you can't use Tee/Sink with the new API anymore. Have you seen my backwards compatibility test, too? It is a copy of yours (with some variation)? The Lucene24* classes were removed, because Tee/SinkTokenizer werde reverted to their original 2.4 status in the patch (only implement old API). As far as I see (not yet tried out), you try to test new-style-API streams with the old Tee/Sink tokenizer, that is deprecated. You were not able to do this before 2.9 (no new API) and so the bw problem is not there. If you rewrite your streams with new API, you should use TeeSinkTokenizer, too. AttributeSource/TokenStream API improvements Key: LUCENE-1693 URL: https://issues.apache.org/jira/browse/LUCENE-1693 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 2.9 Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestAPIBackwardsCompatibility.java,
[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723727#action_12723727 ] Uwe Schindler edited comment on LUCENE-1693 at 6/24/09 1:46 PM: In your example, you mean SinkTokenizer is from the core and only implements incrementToken(), the others are handled by the default in TokenStream (using this reflection-based redirect?) I tested this with overriding LowerCaseFilter. In my latest patch LowerCaseFilter only implements incrementToken(), next() and next(Token) are handled by the forwarder in TokenStream. If you override next(Token) and consume this stream using incrementToken(), then your overriden next(Token) is never called. This exact same backwards-compatibility problem was also there when changed from next() to next(Token). If some core filter only implemented next(Token) and a subclass overrides only next() (because older code-base) and calls super() there, the same happened: the tokenizer code called next(Token), but next() was never called. You have the same problem with our recent deprecations in DocIdSetIterator. I would no see this as a problem, this problem can be found everywhere in Lucene, where we deprecated (originally abstract) methods and replaced by newer ones calling the old ones as default. was (Author: thetaphi): In your example, you mean SinkTokenizer is from the core and only implements incrementToken(), the others are handles by the default in TokenStream (using this reflection-based redirect?) I tested this with overriding LowerCaseFilter. In my latest patch LowerCaseFilter only implements incrementToken(), next() and next(Token) are handled by the forwarder in TokenStream. If you override next(Token) and consume this stream using incrementToken(), then your overriden next(Token) is never called. This exact same backwards-compatibility problem was also there when changed from next() to next(Token). If some core filter only implemented next(Token) and a subclass overrides only next() (because older code-base) and calls super() there, the same happened: the tokenizer code called next(Token), but next() was never called. You have the same problem with our recent deprecations in DocIdSetIterator. I would no see this as a problem, this problem can be found everywhere in Lucene, where we deprecated methods and replaced by newer ones calling the old ones as default. AttributeSource/TokenStream API improvements Key: LUCENE-1693 URL: https://issues.apache.org/jira/browse/LUCENE-1693 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 2.9 Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java This patch makes the following improvements to AttributeSource and TokenStream/Filter: - removes the set/getUseNewAPI() methods (including the standard ones). Instead by default incrementToken() throws a subclass of UnsupportedOperationException. The indexer tries to call incrementToken() initially once to see if the exception is thrown; if so, it falls back to the old API. - introduces interfaces for all Attributes. The corresponding implementations have the postfix 'Impl', e.g. TermAttribute and TermAttributeImpl. AttributeSource now has a factory for creating the Attribute instances; the default implementation looks for implementing classes with the postfix 'Impl'. Token now implements all 6 TokenAttribute interfaces. - new method added to AttributeSource: addAttributeImpl(AttributeImpl). Using reflection it walks up in the class hierarchy of the passed in object and finds all interfaces that the class or superclasses implement and that extend the Attribute interface. It then adds the interface-instance mappings to the attribute map for each of the found interfaces. - AttributeImpl now has a default implementation of toString that uses reflection to print out the values of the attributes in a default formatting. This makes it a bit easier to implement AttributeImpl, because toString() was declared abstract before. - Cloning is now done much more efficiently in captureState. The method figures out which unique AttributeImpl instances are contained as values in the attributes map, because those are the ones that need to be cloned. It creates a single linked list that supports deep cloning (in the inner class AttributeSource.State). AttributeSource keeps track of when this state changes, i.e.
[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723759#action_12723759 ] Michael Busch edited comment on LUCENE-1693 at 6/24/09 2:50 PM: {quote} By the way: I want to have my beer!!! {quote} I decided a while ago that you'll definitely get it, because of your admirable tenaciousness. :) was (Author: michaelbusch): {quote} I decided a while ago that you'll definitely get it, because of your admirable tenaciousness. :) {quote} AttributeSource/TokenStream API improvements Key: LUCENE-1693 URL: https://issues.apache.org/jira/browse/LUCENE-1693 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 2.9 Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java This patch makes the following improvements to AttributeSource and TokenStream/Filter: - removes the set/getUseNewAPI() methods (including the standard ones). Instead by default incrementToken() throws a subclass of UnsupportedOperationException. The indexer tries to call incrementToken() initially once to see if the exception is thrown; if so, it falls back to the old API. - introduces interfaces for all Attributes. The corresponding implementations have the postfix 'Impl', e.g. TermAttribute and TermAttributeImpl. AttributeSource now has a factory for creating the Attribute instances; the default implementation looks for implementing classes with the postfix 'Impl'. Token now implements all 6 TokenAttribute interfaces. - new method added to AttributeSource: addAttributeImpl(AttributeImpl). Using reflection it walks up in the class hierarchy of the passed in object and finds all interfaces that the class or superclasses implement and that extend the Attribute interface. It then adds the interface-instance mappings to the attribute map for each of the found interfaces. - AttributeImpl now has a default implementation of toString that uses reflection to print out the values of the attributes in a default formatting. This makes it a bit easier to implement AttributeImpl, because toString() was declared abstract before. - Cloning is now done much more efficiently in captureState. The method figures out which unique AttributeImpl instances are contained as values in the attributes map, because those are the ones that need to be cloned. It creates a single linked list that supports deep cloning (in the inner class AttributeSource.State). AttributeSource keeps track of when this state changes, i.e. whenever new attributes are added to the AttributeSource. Only in that case will captureState recompute the state, otherwise it will simply clone the precomputed state and return the clone. restoreState(AttributeSource.State) walks the linked list and uses the copyTo() method of AttributeImpl to copy all values over into the attribute that the source stream (e.g. SinkTokenizer) uses. The cloning performance can be greatly improved if not multiple AttributeImpl instances are used in one TokenStream. A user can e.g. simply add a Token instance to the stream instead of the individual attributes. Or the user could implement a subclass of AttributeImpl that implements exactly the Attribute interfaces needed. I think this should be considered an expert API (addAttributeImpl), as this manual optimization is only needed if cloning performance is crucial. I ran some quick performance tests using Tee/Sink tokenizers (which do cloning) and the performance was roughly 20% faster with the new API. I'll run some more performance tests and post more numbers then. Note also that when we add serialization to the Attributes, e.g. for supporting storing serialized TokenStreams in the index, then the serialization should benefit even significantly more from the new API than cloning. Also, the TokenStream API does not change, except for the removal of the set/getUseNewAPI methods. So the patches in LUCENE-1460 should still work. All core tests pass, however, I need to update all the documentation and also add some unit tests for the new AttributeSource functionality. So this patch is not ready to commit yet, but I wanted to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. -
[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720534#action_12720534 ] Uwe Schindler edited comment on LUCENE-1693 at 6/17/09 12:39 AM: - Hi Michael, in principle your test is invalid. It has other tokenfilters in the chain, which the user has no control on. With the two mentioned filters it may work, because they do not change the reuseableToken instance. But the API clearly states, that the reuseableToken must not be used and another one can be returned. So this is really unsupported behaviour. If you remove the filters in between, it would work correct. And this could even fail with 2.4 if you put other tokenfilters in your chain. In my opinion, the advantages of the token reuse clearly overweigh the small problems with (unsupported) usage. The API does exactly, what is menthioned in the API Docs for 2.4.1. The main advantage is, that you can mix old and new filter instances and you loose nothing... was (Author: thetaphi): Hi Michael, in principle your test is invalid. It has other tokenfilter over which the user has no control in it. With the two mentioned filters it may work, because they do not change the reuseableToken. But the API clearly states, that the reuseableToken must not be used and another one returned. So this is really unsupported behaviour. If you remove the filters in between, it would work correct. And this could even fail with 2.4 if you put other tokenfilters in your chain. In my opinion, the advantages of the token reuse clearly overweigh the small problems with (unsupported) usage. The API does exactly, what is menthioned in the API Docs for 2.4.1. The main advantage is, that you can mix old and new filter instances and you loose nothing... AttributeSource/TokenStream API improvements Key: LUCENE-1693 URL: https://issues.apache.org/jira/browse/LUCENE-1693 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 2.9 Attachments: LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java This patch makes the following improvements to AttributeSource and TokenStream/Filter: - removes the set/getUseNewAPI() methods (including the standard ones). Instead by default incrementToken() throws a subclass of UnsupportedOperationException. The indexer tries to call incrementToken() initially once to see if the exception is thrown; if so, it falls back to the old API. - introduces interfaces for all Attributes. The corresponding implementations have the postfix 'Impl', e.g. TermAttribute and TermAttributeImpl. AttributeSource now has a factory for creating the Attribute instances; the default implementation looks for implementing classes with the postfix 'Impl'. Token now implements all 6 TokenAttribute interfaces. - new method added to AttributeSource: addAttributeImpl(AttributeImpl). Using reflection it walks up in the class hierarchy of the passed in object and finds all interfaces that the class or superclasses implement and that extend the Attribute interface. It then adds the interface-instance mappings to the attribute map for each of the found interfaces. - AttributeImpl now has a default implementation of toString that uses reflection to print out the values of the attributes in a default formatting. This makes it a bit easier to implement AttributeImpl, because toString() was declared abstract before. - Cloning is now done much more efficiently in captureState. The method figures out which unique AttributeImpl instances are contained as values in the attributes map, because those are the ones that need to be cloned. It creates a single linked list that supports deep cloning (in the inner class AttributeSource.State). AttributeSource keeps track of when this state changes, i.e. whenever new attributes are added to the AttributeSource. Only in that case will captureState recompute the state, otherwise it will simply clone the precomputed state and return the clone. restoreState(AttributeSource.State) walks the linked list and uses the copyTo() method of AttributeImpl to copy all values over into the attribute that the source stream (e.g. SinkTokenizer) uses. The cloning performance can be greatly improved if not multiple AttributeImpl instances are used in one TokenStream. A user can e.g. simply add a Token instance to the stream instead of the individual attributes. Or the user could implement a subclass of AttributeImpl that implements exactly the Attribute interfaces needed. I think this should be considered an expert API
[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720052#action_12720052 ] Uwe Schindler edited comment on LUCENE-1693 at 6/16/09 3:43 AM: {quote} bq. What was your concusion about my idea yesterday, to pass the Token around even with the old API and copy it on demand I don't think the indexer should know at all about Token? It should only use the interfaces so that we can maintain the full flexibility. Also I don't really like the fact very much that some user might get a performance hit. I had the idea to throw the exception in incrementToken() to automatically being able to fallback to the old API. I think this is nice and gets rid of the explicit useNewAPI methods. The only drawback is still the fact that we have to implement both old and new APIs in Lucene's tokenizers and filters until we remove the old API. Grant won't like this but I think this is better than the possible performance hit? Also we don't add new filters THAT often to Lucene and implementing both methods is often mostly copypaste. {quote} I did not mean to use Token in the indexer, I would like to remove the whole old API from the indexer (even the UOE). My idea would be the following: - Let the default interfaces implemented by Token, the default factory creates it for all requests to the default attributes - If somebody implements the new API, the indexer can use it without problems. If he doesn't, the default impl in TokenStream would call next(Token), using the token instance from AttributeSource. If the method returns with another Token instance (because it did not reuse, which is seldom I think), copy this returned token into the per instance AttributeSource Token. - The other way round, if one uses a TokenStream with the old API (own indexer, query parser,...), the TokenStream only implemented the new API, the deprectated old method would also have a default impl, ignoring the token supplied to next() and returning always the instance-token after calling incrementToken(). Because of this, the indexer would always use the new API, the old API is wrapped. Core analyzers only need to implement the new methods (or could even stay with the old). There are two problems: - If somebody does not implement either method, the call to incrementToken() will loop endless, there should be some early break with UOE in this case. Do not know how to implement this without inspecting the stack trace in the default methods. - Filters are a bit tricky: They could pass in both methods per default to the delegate. The problem, if one only implements one of the methods, the other passes down to the delegate, and doing nothing. But this could be fixed by delegating in the deprecated old method always to the new one (or vice versa). Hope this was clear, maybe I should create a patch. was (Author: thetaphi): {quote} bq. What was your concusion about my idea yesterday, to pass the Token around even with the old API and copy it on demand I don't think the indexer should know at all about Token? It should only use the interfaces so that we can maintain the full flexibility. Also I don't really like the fact very much that some user might get a performance hit. I had the idea to throw the exception in incrementToken() to automatically being able to fallback to the old API. I think this is nice and gets rid of the explicit useNewAPI methods. The only drawback is still the fact that we have to implement both old and new APIs in Lucene's tokenizers and filters until we remove the old API. Grant won't like this but I think this is better than the possible performance hit? Also we don't add new filters THAT often to Lucene and implementing both methods is often mostly copypaste. {quote} I did not mean to use Token in the indexer, I would like to remove the whole old API from Token (even the UOE). My idea would be the following: - Let the default interfaces implemented by Token, the default factory creates it for all requests to the default attributes - If somebody implements the new API, the indexer can use it without problems. If he doesn't, the default impl in TokenStream would call next(Token), using the token instance from AttributeSource. If the method returns with another Token instance (because it did not reuse, which is seldom I think), copy this returned token into the per instance AttributeSource Token. - The other way round, if one uses a TokenStream with the old API (own indexer, query parser,...), the TokenStream only implemented the new API, the deprectated old method would also have a default impl, ignoring the token supplied to next() and returning always the instance-token after calling incrementToken(). Because of this, the indexer would always use the new API, the old API is wrapped. Core analyzers only need to
[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720237#action_12720237 ] Uwe Schindler edited comment on LUCENE-1693 at 6/16/09 9:57 AM: Grant: But with the new TokenStream API, you must always clone the attributes before caching, because the TokenFilter only has *one* final working instance per Attribute whose contents change all the time. In this case it would not be better than wrapping next(Token) with the new api and feeding in the current instance token and copy it back into the instance-one if next(Token) returns a different instance. was (Author: thetaphi): Grant: But with the new TokenStream API, you must always clone the attributes before caching, because the TokenFilter only has one working instance that changes all the time. In this case it would not be better than wrapping next(Token) with the new api and feeding in the current instance token and copy it back into the instance-one if next(Token) returns a different instance. AttributeSource/TokenStream API improvements Key: LUCENE-1693 URL: https://issues.apache.org/jira/browse/LUCENE-1693 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 2.9 Attachments: lucene-1693.patch This patch makes the following improvements to AttributeSource and TokenStream/Filter: - removes the set/getUseNewAPI() methods (including the standard ones). Instead by default incrementToken() throws a subclass of UnsupportedOperationException. The indexer tries to call incrementToken() initially once to see if the exception is thrown; if so, it falls back to the old API. - introduces interfaces for all Attributes. The corresponding implementations have the postfix 'Impl', e.g. TermAttribute and TermAttributeImpl. AttributeSource now has a factory for creating the Attribute instances; the default implementation looks for implementing classes with the postfix 'Impl'. Token now implements all 6 TokenAttribute interfaces. - new method added to AttributeSource: addAttributeImpl(AttributeImpl). Using reflection it walks up in the class hierarchy of the passed in object and finds all interfaces that the class or superclasses implement and that extend the Attribute interface. It then adds the interface-instance mappings to the attribute map for each of the found interfaces. - AttributeImpl now has a default implementation of toString that uses reflection to print out the values of the attributes in a default formatting. This makes it a bit easier to implement AttributeImpl, because toString() was declared abstract before. - Cloning is now done much more efficiently in captureState. The method figures out which unique AttributeImpl instances are contained as values in the attributes map, because those are the ones that need to be cloned. It creates a single linked list that supports deep cloning (in the inner class AttributeSource.State). AttributeSource keeps track of when this state changes, i.e. whenever new attributes are added to the AttributeSource. Only in that case will captureState recompute the state, otherwise it will simply clone the precomputed state and return the clone. restoreState(AttributeSource.State) walks the linked list and uses the copyTo() method of AttributeImpl to copy all values over into the attribute that the source stream (e.g. SinkTokenizer) uses. The cloning performance can be greatly improved if not multiple AttributeImpl instances are used in one TokenStream. A user can e.g. simply add a Token instance to the stream instead of the individual attributes. Or the user could implement a subclass of AttributeImpl that implements exactly the Attribute interfaces needed. I think this should be considered an expert API (addAttributeImpl), as this manual optimization is only needed if cloning performance is crucial. I ran some quick performance tests using Tee/Sink tokenizers (which do cloning) and the performance was roughly 20% faster with the new API. I'll run some more performance tests and post more numbers then. Note also that when we add serialization to the Attributes, e.g. for supporting storing serialized TokenStreams in the index, then the serialization should benefit even significantly more from the new API than cloning. Also, the TokenStream API does not change, except for the removal of the set/getUseNewAPI methods. So the patches in LUCENE-1460 should still work. All core tests pass, however, I need to update all the documentation and also add some unit tests for the new