[
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12729819#action_12729819
]
Michael McCandless commented on LUCENE-1693:
--------------------------------------------
I'm still trying to wrap my brain around this issue :)
So to sum up where things [seem] to stand here:
* We are switching to separate interface + impl for token
attributes. This is nice because a single class (eg Token.java)
can provide multiple attrs, it makes cloning more efficient, etc.
* We are keeping good old Token.java around; it simply implements all
the attrs, and is very convenient to use.
I think this is a great improvement to the new tokenStream API. It's
also a sizable perf gain to clone only one impl that has only the
attrs you care about.
Uwe then extended the original patch:
* Use reflection on init'ing a TokenStream to determine which of the
3 APIs (old, newer, newest) it implements
* Indexer (and in general any consumer) now only has to consume the
new API. Any parts of the chain that aren't new are automatically
wrapped.
* Core tokenizers/filters (and contrib, when we get to them) only
implement the new API; old API is "wrapped on demand" if needed
* One can mix & match old, new tokenizers, though at some perf cost.
But that's actually OK since the original patch it's "all or
none", ie your chain must be entirely new or old
* I think (?) a chain of all-old-tokenizer/filters and all-new
wouldn't see a perf hit? Wrapping only happens when there's a
mismatch b/w two filters in the chain?
I'm tentatively optimistic about the extended patch... (though these
back-compat, corner cases, etc., are real hard to think about). I
agree it's doing alot of "magic" under the hood to figure out how to
best wrap things, but the appeal of only implementing the new api in
core/contrib tokenizers, only consuming new, being free to mix&match,
etc, is strong.
One concern is: TokenStream.initialize looks spookily heavy weight; eg
I don't know the "typical" cost of reflection. I think there are
likely many apps out there that make a new TokenStream for ever field
that's analyzed (ie implement Analyzer.tokenStream not
reusableTokenStream) and this (introspection every time) could be a
sizable perf hit. Uwe was this included in your perf tests?
> AttributeSource/TokenStream API improvements
> --------------------------------------------
>
> Key: LUCENE-1693
> URL: https://issues.apache.org/jira/browse/LUCENE-1693
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Reporter: Michael Busch
> Assignee: Michael Busch
> Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch,
> LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch,
> lucene-1693.patch, TestCompatibility.java, TestCompatibility.java,
> TestCompatibility.java, TestCompatibility.java
>
>
> This patch makes the following improvements to AttributeSource and
> TokenStream/Filter:
> - removes the set/getUseNewAPI() methods (including the standard
> ones). Instead by default incrementToken() throws a subclass of
> UnsupportedOperationException. The indexer tries to call
> incrementToken() initially once to see if the exception is thrown;
> if so, it falls back to the old API.
> - introduces interfaces for all Attributes. The corresponding
> implementations have the postfix 'Impl', e.g. TermAttribute and
> TermAttributeImpl. AttributeSource now has a factory for creating
> the Attribute instances; the default implementation looks for
> implementing classes with the postfix 'Impl'. Token now implements
> all 6 TokenAttribute interfaces.
> - new method added to AttributeSource:
> addAttributeImpl(AttributeImpl). Using reflection it walks up in the
> class hierarchy of the passed in object and finds all interfaces
> that the class or superclasses implement and that extend the
> Attribute interface. It then adds the interface->instance mappings
> to the attribute map for each of the found interfaces.
> - AttributeImpl now has a default implementation of toString that uses
> reflection to print out the values of the attributes in a default
> formatting. This makes it a bit easier to implement AttributeImpl,
> because toString() was declared abstract before.
> - Cloning is now done much more efficiently in
> captureState. The method figures out which unique AttributeImpl
> instances are contained as values in the attributes map, because
> those are the ones that need to be cloned. It creates a single
> linked list that supports deep cloning (in the inner class
> AttributeSource.State). AttributeSource keeps track of when this
> state changes, i.e. whenever new attributes are added to the
> AttributeSource. Only in that case will captureState recompute the
> state, otherwise it will simply clone the precomputed state and
> return the clone. restoreState(AttributeSource.State) walks the
> linked list and uses the copyTo() method of AttributeImpl to copy
> all values over into the attribute that the source stream
> (e.g. SinkTokenizer) uses.
> The cloning performance can be greatly improved if not multiple
> AttributeImpl instances are used in one TokenStream. A user can
> e.g. simply add a Token instance to the stream instead of the individual
> attributes. Or the user could implement a subclass of AttributeImpl that
> implements exactly the Attribute interfaces needed. I think this
> should be considered an expert API (addAttributeImpl), as this manual
> optimization is only needed if cloning performance is crucial. I ran
> some quick performance tests using Tee/Sink tokenizers (which do
> cloning) and the performance was roughly 20% faster with the new
> API. I'll run some more performance tests and post more numbers then.
> Note also that when we add serialization to the Attributes, e.g. for
> supporting storing serialized TokenStreams in the index, then the
> serialization should benefit even significantly more from the new API
> than cloning.
> Also, the TokenStream API does not change, except for the removal
> of the set/getUseNewAPI methods. So the patches in LUCENE-1460
> should still work.
> All core tests pass, however, I need to update all the documentation
> and also add some unit tests for the new AttributeSource
> functionality. So this patch is not ready to commit yet, but I wanted
> to post it already for some feedback.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]