[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

Michael Busch (JIRA) Fri, 10 Jul 2009 00:51:44 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12729565#action_12729565
 ]


Michael Busch commented on LUCENE-1693:
---------------------------------------

Thanks for all your hard work here, Uwe.

I think this patch is as good as it can be for achieving the goal of being able 
to combine the old and the new API.
And I agree that my patch that I posted here has the same potential 
backwards-compatibility problems regarding inheritance. 

I think in the majority of use cases we're fine here. Only the corner cases 
make me a bit nervous. I think the case I feel most uncomfortable with is when 
people use Lucene + some external analyzer package + their own subclasses. If 
they use Lucene 2.9, the external package is not upgraded to the new API yet, 
but they did upgrade their own classes already to the new API, then they might 
run into undefined problems. However, I don't even know how many of such 
"external analyzer packages" exist (well, I think Grant mentioned he was 
working on one...)

And I still just have this not-going-away slightly bad feeling in my gut that 
there are still other corner case problems we haven't thought about yet. What 
makes this feeling worse is the fact that those problems might not result in 
exceptions, but in unexpected and hard-to-find search problems, because the 
wrong tokens were indexed.

The current patch uses reflection extensively to figure out which of the three 
APIs the user has implemented. The comments above mention the possible 
problems. The solution is cool, but also a bit hack-ish (no offense Uwe, you 
called it that yourself ;) )

So, having said all this, I'd like other people to chime in here and give their 
opinion. I'm okay with committing this solution if everyone else is too.
I think the only solution to not break compatibility at all is to not touch the 
old API at all and provide APIs that switch on/off using the new API. That's 
what the code in trunk currently does. It has the major disadvantage that it 
doesn't allow combining the old and new API in the same chain, and that we have 
to implement both APIs in core Lucene until the old API is fully removed.

So Mike, Grant, Mark, or others, could you please comment here?

PS: Uwe, in any case, your solution is cool and I like how cleverly you solved 
the problems!!


> AttributeSource/TokenStream API improvements
> --------------------------------------------
>
>                 Key: LUCENE-1693
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1693
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
> LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
> lucene-1693.patch, TestCompatibility.java, TestCompatibility.java, 
> TestCompatibility.java, TestCompatibility.java
>
>
> This patch makes the following improvements to AttributeSource and
> TokenStream/Filter:
> - removes the set/getUseNewAPI() methods (including the standard
>   ones). Instead by default incrementToken() throws a subclass of
>   UnsupportedOperationException. The indexer tries to call
>   incrementToken() initially once to see if the exception is thrown;
>   if so, it falls back to the old API.
> - introduces interfaces for all Attributes. The corresponding
>   implementations have the postfix 'Impl', e.g. TermAttribute and
>   TermAttributeImpl. AttributeSource now has a factory for creating
>   the Attribute instances; the default implementation looks for
>   implementing classes with the postfix 'Impl'. Token now implements
>   all 6 TokenAttribute interfaces.
> - new method added to AttributeSource:
>   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
>   class hierarchy of the passed in object and finds all interfaces
>   that the class or superclasses implement and that extend the
>   Attribute interface. It then adds the interface->instance mappings
>   to the attribute map for each of the found interfaces.
> - AttributeImpl now has a default implementation of toString that uses
>   reflection to print out the values of the attributes in a default
>   formatting. This makes it a bit easier to implement AttributeImpl,
>   because toString() was declared abstract before.
> - Cloning is now done much more efficiently in
>   captureState. The method figures out which unique AttributeImpl
>   instances are contained as values in the attributes map, because
>   those are the ones that need to be cloned. It creates a single
>   linked list that supports deep cloning (in the inner class
>   AttributeSource.State). AttributeSource keeps track of when this
>   state changes, i.e. whenever new attributes are added to the
>   AttributeSource. Only in that case will captureState recompute the
>   state, otherwise it will simply clone the precomputed state and
>   return the clone. restoreState(AttributeSource.State) walks the
>   linked list and uses the copyTo() method of AttributeImpl to copy
>   all values over into the attribute that the source stream
>   (e.g. SinkTokenizer) uses. 
> The cloning performance can be greatly improved if not multiple
> AttributeImpl instances are used in one TokenStream. A user can
> e.g. simply add a Token instance to the stream instead of the individual
> attributes. Or the user could implement a subclass of AttributeImpl that
> implements exactly the Attribute interfaces needed. I think this
> should be considered an expert API (addAttributeImpl), as this manual
> optimization is only needed if cloning performance is crucial. I ran
> some quick performance tests using Tee/Sink tokenizers (which do
> cloning) and the performance was roughly 20% faster with the new
> API. I'll run some more performance tests and post more numbers then.
> Note also that when we add serialization to the Attributes, e.g. for
> supporting storing serialized TokenStreams in the index, then the
> serialization should benefit even significantly more from the new API
> than cloning. 
> Also, the TokenStream API does not change, except for the removal 
> of the set/getUseNewAPI methods. So the patches in LUCENE-1460
> should still work.
> All core tests pass, however, I need to update all the documentation
> and also add some unit tests for the new AttributeSource
> functionality. So this patch is not ready to commit yet, but I wanted
> to post it already for some feedback. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

Reply via email to