[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements

Uwe Schindler (JIRA) Sun, 19 Jul 2009 14:24:38 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Uwe Schindler updated LUCENE-1693:
----------------------------------

    Attachment: PerfTest3.java
                LUCENE-1693.patch

New and final patch. I will be in holidays from tomorrow and have limited time 
for Lucene. I will respond to comments and if there are major faults with the 
new API.

The new patch has some imporvements:
- TeeSinkTokenizer is now able to also feed multiple tees to one sink. You 
create first a TeeSinkTokenizer, retrieve a TeeTokenStream from it 
(newSinkTokenStream()). After that you can add this TeeTokenStream to another 
tee (addSinkTokenStream()). The test (now similar to the old test) and javadocs 
demonstrates this.
- Reflection performance was greatly improved by using caches. Most time was 
used in AttributeSource.addAttributeImpl() because it iterates through all 
interfaces of the supplied instance. It caches the found interfaces using a 
IdentityHashMap<Class<AttributeImpl>,LinkedList<Class<Attribute>>> keyed by the 
implementation class. Also the default AttributeFactory uses a cache 
(IdentityHashMap) for the mapping <Class<Attribute>,Class<AttributeImpl>>. So 
the number of Class.forName() is drastically reduced.
- Also fixed a bug in addAttributeImpl after refactoring for the cache.
- TokenStream now has a separate AttributeFactory available, that creates a 
TokenWrapper for the 6 default attributes. This is now a more clear 
implementation. The extra checks in next() default impls were removed because 
of this. Filters now also reuse the tokenWrapper instance already resolved by 
the input stream.

I did some performance tests with the final impl, analyzing the lorem ipsum 
text 100000 times with new instances for each time, using reused instances, 
old/new API for the trunk with latest patch, current trunk and lucene-2.4 (old 
api only):

The results (but these test are not very representative due to a variance of 
+/- 4 sec per run):

{code}
Testing trunk w/ newest API...
Time for 100000 runs with new instances (old API): 27.344s
Time for 100000 runs with reused stream (old API): 21.828s
Time for 100000 runs with new instances (new API only): 27.297s
Time for 100000 runs with reused stream (new API only): 24.484s
Testing trunk w/o newest API...
Time for 100000 runs with new instances (old API): 22.485s
Time for 100000 runs with reused stream (old API): 19.047s
Time for 100000 runs with new instances (new API only): 26.89s
Time for 100000 runs with reused stream (new API only): 23.719s
Testing 2.4...
Time for 100000 runs with new instances (old API): 18.984s
Time for 100000 runs with reused stream (old API): 18.75s
{code}

The cost of creating 100000 new instances on my 32 bit Thinkpad T60 is about 5 
sec (no difference between new api and old api). The cost is not caused by 
reflection, it is caused by building the LinkedHashMaps for the attributes on 
creation. A little bit faster was the current trunk, because it uses only one 
LinkedHasMap.

One interesting thing: Using *only the new api* is little slower during 
tokenization, because it seems faster to use only *one* instance (Token) 
instead of 6 instances.

The cost of creating new instances is smallest with Lucene 2.4, because no 
attributes are used (in 2.9 it always creates the LinkedHashMaps, even if only 
the old API was used in current trunk).

> AttributeSource/TokenStream API improvements
> --------------------------------------------
>
>                 Key: LUCENE-1693
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1693
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, 
> LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
> LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
> LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
> LUCENE-1693.patch, lucene-1693.patch, PerfTest3.java, 
> TestAPIBackwardsCompatibility.java, TestCompatibility.java, 
> TestCompatibility.java, TestCompatibility.java, TestCompatibility.java
>
>
> This patch makes the following improvements to AttributeSource and
> TokenStream/Filter:
> - removes the set/getUseNewAPI() methods (including the standard
>   ones). Instead by default incrementToken() throws a subclass of
>   UnsupportedOperationException. The indexer tries to call
>   incrementToken() initially once to see if the exception is thrown;
>   if so, it falls back to the old API.
> - introduces interfaces for all Attributes. The corresponding
>   implementations have the postfix 'Impl', e.g. TermAttribute and
>   TermAttributeImpl. AttributeSource now has a factory for creating
>   the Attribute instances; the default implementation looks for
>   implementing classes with the postfix 'Impl'. Token now implements
>   all 6 TokenAttribute interfaces.
> - new method added to AttributeSource:
>   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
>   class hierarchy of the passed in object and finds all interfaces
>   that the class or superclasses implement and that extend the
>   Attribute interface. It then adds the interface->instance mappings
>   to the attribute map for each of the found interfaces.
> - AttributeImpl now has a default implementation of toString that uses
>   reflection to print out the values of the attributes in a default
>   formatting. This makes it a bit easier to implement AttributeImpl,
>   because toString() was declared abstract before.
> - Cloning is now done much more efficiently in
>   captureState. The method figures out which unique AttributeImpl
>   instances are contained as values in the attributes map, because
>   those are the ones that need to be cloned. It creates a single
>   linked list that supports deep cloning (in the inner class
>   AttributeSource.State). AttributeSource keeps track of when this
>   state changes, i.e. whenever new attributes are added to the
>   AttributeSource. Only in that case will captureState recompute the
>   state, otherwise it will simply clone the precomputed state and
>   return the clone. restoreState(AttributeSource.State) walks the
>   linked list and uses the copyTo() method of AttributeImpl to copy
>   all values over into the attribute that the source stream
>   (e.g. SinkTokenizer) uses. 
> The cloning performance can be greatly improved if not multiple
> AttributeImpl instances are used in one TokenStream. A user can
> e.g. simply add a Token instance to the stream instead of the individual
> attributes. Or the user could implement a subclass of AttributeImpl that
> implements exactly the Attribute interfaces needed. I think this
> should be considered an expert API (addAttributeImpl), as this manual
> optimization is only needed if cloning performance is crucial. I ran
> some quick performance tests using Tee/Sink tokenizers (which do
> cloning) and the performance was roughly 20% faster with the new
> API. I'll run some more performance tests and post more numbers then.
> Note also that when we add serialization to the Attributes, e.g. for
> supporting storing serialized TokenStreams in the index, then the
> serialization should benefit even significantly more from the new API
> than cloning. 
> Also, the TokenStream API does not change, except for the removal 
> of the set/getUseNewAPI methods. So the patches in LUCENE-1460
> should still work.
> All core tests pass, however, I need to update all the documentation
> and also add some unit tests for the new AttributeSource
> functionality. So this patch is not ready to commit yet, but I wanted
> to post it already for some feedback. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements

Reply via email to