[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12729819#action_12729819 ]
Michael McCandless commented on LUCENE-1693: -------------------------------------------- I'm still trying to wrap my brain around this issue :) So to sum up where things [seem] to stand here: * We are switching to separate interface + impl for token attributes. This is nice because a single class (eg Token.java) can provide multiple attrs, it makes cloning more efficient, etc. * We are keeping good old Token.java around; it simply implements all the attrs, and is very convenient to use. I think this is a great improvement to the new tokenStream API. It's also a sizable perf gain to clone only one impl that has only the attrs you care about. Uwe then extended the original patch: * Use reflection on init'ing a TokenStream to determine which of the 3 APIs (old, newer, newest) it implements * Indexer (and in general any consumer) now only has to consume the new API. Any parts of the chain that aren't new are automatically wrapped. * Core tokenizers/filters (and contrib, when we get to them) only implement the new API; old API is "wrapped on demand" if needed * One can mix & match old, new tokenizers, though at some perf cost. But that's actually OK since the original patch it's "all or none", ie your chain must be entirely new or old * I think (?) a chain of all-old-tokenizer/filters and all-new wouldn't see a perf hit? Wrapping only happens when there's a mismatch b/w two filters in the chain? I'm tentatively optimistic about the extended patch... (though these back-compat, corner cases, etc., are real hard to think about). I agree it's doing alot of "magic" under the hood to figure out how to best wrap things, but the appeal of only implementing the new api in core/contrib tokenizers, only consuming new, being free to mix&match, etc, is strong. One concern is: TokenStream.initialize looks spookily heavy weight; eg I don't know the "typical" cost of reflection. I think there are likely many apps out there that make a new TokenStream for ever field that's analyzed (ie implement Analyzer.tokenStream not reusableTokenStream) and this (introspection every time) could be a sizable perf hit. Uwe was this included in your perf tests? > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, > LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, > lucene-1693.patch, TestCompatibility.java, TestCompatibility.java, > TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org