[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

Uwe Schindler (JIRA) Thu, 18 Jun 2009 12:28:31 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721426#action_12721426
 ]


Uwe Schindler commented on LUCENE-1693:
---------------------------------------

I only tested performance with the lucene benchmarker on the various standard 
analyzers. The tokenizer.alg produces after the patch the same results as 
before in almost the same time (time variations are bigger than differences). 
With an unmodified benchmarker, this is clear, benchmarkers tokenizer task call 
still the deprecated next(Token) and as all core analyzers still implement this 
directly, so there is no wrapping. I modified the tested tokenstreams and 
filters in core, that were used, and removed next(Token) and left only 
incrementToken() avalilable, in this case the speed difference was also not 
measureable in my configuration (Thinkpad T60, Core Duo, Win32). I also changed 
some of the filters to implement next(Token) only, others to only 
incrementToken(), to have a completely mixed old/new API chain, and still the 
same results (and same tokenization results, as seen in generated indexes for 
wikipedia). I also changed the benchmarker to use incrementToken(), which was 
also fine.

To have a small speed incresase (but I was not able to measure it), I changed 
all tokenizers to use only incrementToken for the whole chain and changed the 
benchmarker to also use this method. In this case I was able to 
TokenStream.setOnlyUseNewAPI(true), which removed the 
backwards-compatibility-wrapper and the Token instance, so the chain only used 
the unwrapped simple attributes. In my opinion, tokenization was a little bit 
faster, faster than without any patch and next(Token). When the old API is 
completely removed, this will be the default behaviour.

So I would suggest to review this patch, add some tests for heterogenous 
tokenizer chains and remove all next(...) implementations from all streams and 
filters and only implement incrementToken(). Contrib analyzers should then only 
be rewritten to the new API without the old API.

The mentioned bugs with Tee/Sink are not related to this bug, but are more 
serious now, because the tokenizer chain is no longer fixed to on specfic API 
variant (it supports both mixed together).


> AttributeSource/TokenStream API improvements
> --------------------------------------------
>
>                 Key: LUCENE-1693
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1693
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
> LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java, 
> TestCompatibility.java
>
>
> This patch makes the following improvements to AttributeSource and
> TokenStream/Filter:
> - removes the set/getUseNewAPI() methods (including the standard
>   ones). Instead by default incrementToken() throws a subclass of
>   UnsupportedOperationException. The indexer tries to call
>   incrementToken() initially once to see if the exception is thrown;
>   if so, it falls back to the old API.
> - introduces interfaces for all Attributes. The corresponding
>   implementations have the postfix 'Impl', e.g. TermAttribute and
>   TermAttributeImpl. AttributeSource now has a factory for creating
>   the Attribute instances; the default implementation looks for
>   implementing classes with the postfix 'Impl'. Token now implements
>   all 6 TokenAttribute interfaces.
> - new method added to AttributeSource:
>   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
>   class hierarchy of the passed in object and finds all interfaces
>   that the class or superclasses implement and that extend the
>   Attribute interface. It then adds the interface->instance mappings
>   to the attribute map for each of the found interfaces.
> - AttributeImpl now has a default implementation of toString that uses
>   reflection to print out the values of the attributes in a default
>   formatting. This makes it a bit easier to implement AttributeImpl,
>   because toString() was declared abstract before.
> - Cloning is now done much more efficiently in
>   captureState. The method figures out which unique AttributeImpl
>   instances are contained as values in the attributes map, because
>   those are the ones that need to be cloned. It creates a single
>   linked list that supports deep cloning (in the inner class
>   AttributeSource.State). AttributeSource keeps track of when this
>   state changes, i.e. whenever new attributes are added to the
>   AttributeSource. Only in that case will captureState recompute the
>   state, otherwise it will simply clone the precomputed state and
>   return the clone. restoreState(AttributeSource.State) walks the
>   linked list and uses the copyTo() method of AttributeImpl to copy
>   all values over into the attribute that the source stream
>   (e.g. SinkTokenizer) uses. 
> The cloning performance can be greatly improved if not multiple
> AttributeImpl instances are used in one TokenStream. A user can
> e.g. simply add a Token instance to the stream instead of the individual
> attributes. Or the user could implement a subclass of AttributeImpl that
> implements exactly the Attribute interfaces needed. I think this
> should be considered an expert API (addAttributeImpl), as this manual
> optimization is only needed if cloning performance is crucial. I ran
> some quick performance tests using Tee/Sink tokenizers (which do
> cloning) and the performance was roughly 20% faster with the new
> API. I'll run some more performance tests and post more numbers then.
> Note also that when we add serialization to the Attributes, e.g. for
> supporting storing serialized TokenStreams in the index, then the
> serialization should benefit even significantly more from the new API
> than cloning. 
> Also, the TokenStream API does not change, except for the removal 
> of the set/getUseNewAPI methods. So the patches in LUCENE-1460
> should still work.
> All core tests pass, however, I need to update all the documentation
> and also add some unit tests for the new AttributeSource
> functionality. So this patch is not ready to commit yet, but I wanted
> to post it already for some feedback. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

Reply via email to