[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements

Uwe Schindler (JIRA) Thu, 16 Jul 2009 07:04:41 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Uwe Schindler updated LUCENE-1693:
----------------------------------

    Attachment: LUCENE-1693.patch

New patch with some more work. First the phantastic news:

As CachingTokenFilter has no API to access the cached attributes/tokens 
directly, it does not need to be deprecated, it just switched the internal and 
hidden impl to incrementToken() and attributes. I also added an additional test 
in the BW-Testcase, that checks if the caching also works for your strange 
POSTokens. And it works! You can even mix the consumers, e.g. first use new API 
to cache tokens and then replay using the old API. really cool. The problem, 
why the POSToken was not preserved in the past was an error in 
TokenWrapper.copyTo(). This method created a new Token and copied the contents 
into it using reinit(). Now it simply creates a clone and let delegate point to 
it (this is how the caching worked before).

In principle Tee/SinkTokenizer could also work like this, the only problem with 
this class is the fact, that it has a public API that exposes the Token 
instances to the outside. Because of that, there is no way around deprecating.

Your new TeeSinkTokenFilter looks good, it only had one problem:
It used addAttributeImpl to add the attribute of the Tee to the new created 
Sink. Because of this, the sink got the same instance as the parent added. With 
useOnlyNewAPI, this does not have an effect for the standard attributes, as the 
ctor already created a Token instance as implementation and added it to the 
stream, so addAttributeImpl had no effect.
I changed this to use the getAttributeClassesIterator and added a new attribute 
instance for each attribute using addAttribute to the sink. As the factory is 
the same, the attributes are generated in the same way. TeeSinkTokenizer would 
only *not* work correctly if somebody addes an custom instance using 
addAttributeImpl in one ctor of another filter in the chain. In this case, the 
factory would create another impl and restoreState throws IAE. In backwards 
compatibility mode (default) the new created sink and also the tee have always 
the default TokenWrapper implementation, so state restoring also works. You 
only have a problem if you change useOnlyNewAPIU inbetween (which would always 
create corrupt chains).

Another idea would be to clone all attribute impls and then add them to the 
sink - the factory would then not be used?

I started to create a test for the new TeeSinkTokenFilter, but there is one 
thing missing: The original test created a subclass of SinkTokenizer, 
overriding add() to filter the tokens added to the sink. This functionality is 
missing with the new API. The correct workaround would be to plug a filter 
around the sink and filter the tokens there? The problem is then, that the 
cache always contains also non-needed tokens (the old impl would not store them 
in the sink).

Maybe we add the filter to the TeeSinkTokenFilter (getting a State, which would 
not work, as contents of state pkg-private?). Somehow else? Or leave it as it 
is and let the user plug the filter on top of the sink (I prefer this)?

> AttributeSource/TokenStream API improvements
> --------------------------------------------
>
>                 Key: LUCENE-1693
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1693
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, 
> LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
> LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
> LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, 
> TestAPIBackwardsCompatibility.java, TestCompatibility.java, 
> TestCompatibility.java, TestCompatibility.java, TestCompatibility.java
>
>
> This patch makes the following improvements to AttributeSource and
> TokenStream/Filter:
> - removes the set/getUseNewAPI() methods (including the standard
>   ones). Instead by default incrementToken() throws a subclass of
>   UnsupportedOperationException. The indexer tries to call
>   incrementToken() initially once to see if the exception is thrown;
>   if so, it falls back to the old API.
> - introduces interfaces for all Attributes. The corresponding
>   implementations have the postfix 'Impl', e.g. TermAttribute and
>   TermAttributeImpl. AttributeSource now has a factory for creating
>   the Attribute instances; the default implementation looks for
>   implementing classes with the postfix 'Impl'. Token now implements
>   all 6 TokenAttribute interfaces.
> - new method added to AttributeSource:
>   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
>   class hierarchy of the passed in object and finds all interfaces
>   that the class or superclasses implement and that extend the
>   Attribute interface. It then adds the interface->instance mappings
>   to the attribute map for each of the found interfaces.
> - AttributeImpl now has a default implementation of toString that uses
>   reflection to print out the values of the attributes in a default
>   formatting. This makes it a bit easier to implement AttributeImpl,
>   because toString() was declared abstract before.
> - Cloning is now done much more efficiently in
>   captureState. The method figures out which unique AttributeImpl
>   instances are contained as values in the attributes map, because
>   those are the ones that need to be cloned. It creates a single
>   linked list that supports deep cloning (in the inner class
>   AttributeSource.State). AttributeSource keeps track of when this
>   state changes, i.e. whenever new attributes are added to the
>   AttributeSource. Only in that case will captureState recompute the
>   state, otherwise it will simply clone the precomputed state and
>   return the clone. restoreState(AttributeSource.State) walks the
>   linked list and uses the copyTo() method of AttributeImpl to copy
>   all values over into the attribute that the source stream
>   (e.g. SinkTokenizer) uses. 
> The cloning performance can be greatly improved if not multiple
> AttributeImpl instances are used in one TokenStream. A user can
> e.g. simply add a Token instance to the stream instead of the individual
> attributes. Or the user could implement a subclass of AttributeImpl that
> implements exactly the Attribute interfaces needed. I think this
> should be considered an expert API (addAttributeImpl), as this manual
> optimization is only needed if cloning performance is crucial. I ran
> some quick performance tests using Tee/Sink tokenizers (which do
> cloning) and the performance was roughly 20% faster with the new
> API. I'll run some more performance tests and post more numbers then.
> Note also that when we add serialization to the Attributes, e.g. for
> supporting storing serialized TokenStreams in the index, then the
> serialization should benefit even significantly more from the new API
> than cloning. 
> Also, the TokenStream API does not change, except for the removal 
> of the set/getUseNewAPI methods. So the patches in LUCENE-1460
> should still work.
> All core tests pass, however, I need to update all the documentation
> and also add some unit tests for the new AttributeSource
> functionality. So this patch is not ready to commit yet, but I wanted
> to post it already for some feedback. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements

Reply via email to