[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-07-24 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12735124#action_12735124
 ] 

Robert Muir edited comment on LUCENE-1693 at 7/24/09 12:08 PM:
---

{quote}
I completely agree. We should discourage users to reference Token and rather 
use the Attribute interfaces. That's the whole beauty and flexibility about 
this new API.
{quote}

Has there been any thought into reconsidering the new API's experimental 
status then?
I don't think the WARNING: encourages users to use these interfaces!

or maybe a compromise: maybe modify the javadocs to be a little less scary: 
does this text have to be FF (red) ?


  was (Author: rcmuir):
{quote}
I completely agree. We should discourage users to reference Token and rather 
use the Attribute interfaces. That's the whole beauty and flexibility about 
this new API.
{quote}

Has there been any thought into reconsidering the new API's experimental 
status then?
I don't think the WARNING: encourages users to use these interfaces!

  
 AttributeSource/TokenStream API improvements
 

 Key: LUCENE-1693
 URL: https://issues.apache.org/jira/browse/LUCENE-1693
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.9

 Attachments: lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, lucene-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, 
 PerfTest3.java, TestAPIBackwardsCompatibility.java, TestCompatibility.java, 
 TestCompatibility.java, TestCompatibility.java, TestCompatibility.java


 This patch makes the following improvements to AttributeSource and
 TokenStream/Filter:
 - introduces interfaces for all Attributes. The corresponding
   implementations have the postfix 'Impl', e.g. TermAttribute and
   TermAttributeImpl. AttributeSource now has a factory for creating
   the Attribute instances; the default implementation looks for
   implementing classes with the postfix 'Impl'. Token now implements
   all 6 TokenAttribute interfaces.
 - new method added to AttributeSource:
   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
   class hierarchy of the passed in object and finds all interfaces
   that the class or superclasses implement and that extend the
   Attribute interface. It then adds the interface-instance mappings
   to the attribute map for each of the found interfaces.
 - removes the set/getUseNewAPI() methods (including the standard
   ones). Instead it is now enough to only implement the new API,
   if one old TokenStream implements still the old API (next()/next(Token)),
   it is wrapped automatically. The delegation path is determined via
   reflection (the patch determines, which of the three methods was
   overridden).
 - Token is no longer deprecated, instead it implements all 6 standard
   token interfaces (see above). The wrapper for next() and next(Token)
   uses this, to automatically map all attribute interfaces to one
   TokenWrapper instance (implementing all 6 interfaces), that contains
   a Token instance. next() and next(Token) exchange the inner Token
   instance as needed. For the new incrementToken(), only one
   TokenWrapper instance is visible, delegating to the currect reusable
   Token. This API also preserves custom Token subclasses, that maybe
   created by very special token streams (see example in Backwards-Test).
 - AttributeImpl now has a default implementation of toString that uses
   reflection to print out the values of the attributes in a default
   formatting. This makes it a bit easier to implement AttributeImpl,
   because toString() was declared abstract before.
 - Cloning is now done much more efficiently in
   captureState. The method figures out which unique AttributeImpl
   instances are contained as values in the attributes map, because
   those are the ones that need to be cloned. It creates a single
   linked list that supports deep cloning (in the inner class
   AttributeSource.State). AttributeSource keeps track of when this
   state changes, i.e. whenever new attributes are added to the
   AttributeSource. Only in that case will captureState recompute the
   state, otherwise it will simply clone the precomputed state and
   return the clone. restoreState(AttributeSource.State) walks the
   linked list and uses the copyTo() method of AttributeImpl to copy
   all values over into the attribute that the source stream
   (e.g. 

[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-07-24 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12735210#action_12735210
 ] 

Mark Miller edited comment on LUCENE-1693 at 7/24/09 4:59 PM:
--

Not sure what issue it stems from, but Token has a bunch of constructors that 
are deprecated, but that don't point you to something new.

edit

must have come from the setBuffer stuff 

  was (Author: markrmil...@gmail.com):
Not sure what issue it stems from, but Token has a bunch of constructors 
that are deprecated, but that don't point you to something new.
  
 AttributeSource/TokenStream API improvements
 

 Key: LUCENE-1693
 URL: https://issues.apache.org/jira/browse/LUCENE-1693
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.9

 Attachments: lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, lucene-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, 
 PerfTest3.java, TestAPIBackwardsCompatibility.java, TestCompatibility.java, 
 TestCompatibility.java, TestCompatibility.java, TestCompatibility.java


 This patch makes the following improvements to AttributeSource and
 TokenStream/Filter:
 - introduces interfaces for all Attributes. The corresponding
   implementations have the postfix 'Impl', e.g. TermAttribute and
   TermAttributeImpl. AttributeSource now has a factory for creating
   the Attribute instances; the default implementation looks for
   implementing classes with the postfix 'Impl'. Token now implements
   all 6 TokenAttribute interfaces.
 - new method added to AttributeSource:
   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
   class hierarchy of the passed in object and finds all interfaces
   that the class or superclasses implement and that extend the
   Attribute interface. It then adds the interface-instance mappings
   to the attribute map for each of the found interfaces.
 - removes the set/getUseNewAPI() methods (including the standard
   ones). Instead it is now enough to only implement the new API,
   if one old TokenStream implements still the old API (next()/next(Token)),
   it is wrapped automatically. The delegation path is determined via
   reflection (the patch determines, which of the three methods was
   overridden).
 - Token is no longer deprecated, instead it implements all 6 standard
   token interfaces (see above). The wrapper for next() and next(Token)
   uses this, to automatically map all attribute interfaces to one
   TokenWrapper instance (implementing all 6 interfaces), that contains
   a Token instance. next() and next(Token) exchange the inner Token
   instance as needed. For the new incrementToken(), only one
   TokenWrapper instance is visible, delegating to the currect reusable
   Token. This API also preserves custom Token subclasses, that maybe
   created by very special token streams (see example in Backwards-Test).
 - AttributeImpl now has a default implementation of toString that uses
   reflection to print out the values of the attributes in a default
   formatting. This makes it a bit easier to implement AttributeImpl,
   because toString() was declared abstract before.
 - Cloning is now done much more efficiently in
   captureState. The method figures out which unique AttributeImpl
   instances are contained as values in the attributes map, because
   those are the ones that need to be cloned. It creates a single
   linked list that supports deep cloning (in the inner class
   AttributeSource.State). AttributeSource keeps track of when this
   state changes, i.e. whenever new attributes are added to the
   AttributeSource. Only in that case will captureState recompute the
   state, otherwise it will simply clone the precomputed state and
   return the clone. restoreState(AttributeSource.State) walks the
   linked list and uses the copyTo() method of AttributeImpl to copy
   all values over into the attribute that the source stream
   (e.g. SinkTokenizer) uses. 
 - Tee- and SinkTokenizer were deprecated, because they use
 Token instances for caching. This is not compatible to the new API
 using AttributeSource.State objects. You can still use the old
 deprecated ones, but new features provided by new Attribute types
 may get lost in the chain. A replacement is a new TeeSinkTokenFilter,
 which has a factory to create new Sink instances, that have compatible
 attributes. Sink instances created by one Tee can also 

[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-07-19 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12733087#action_12733087
 ] 

Uwe Schindler edited comment on LUCENE-1693 at 7/19/09 4:05 PM:


Forgot CHANGES.txt entries with the backwards-break note and other changes.

When starting to transform the contrib streams, we should not forget to do the 
following (maybe add this note to the JIRA issues):
- rewrite and replace next(Token)/next() implementations by new API
- if the class is final, no next(Token)/next() methods needed *(must be 
removed!!!)*
- if the class is non-final add the following methods to the class:
{code}
/** @deprecated Will be removed in Lucene 3.0. This method is final, as it 
should
 * not be overridden. Delegates to the backwards compatibility layer. */
public final Token next(final Token reusableToken) throws java.io.IOException {
  return super.next(reusableToken);
}

/** @deprecated Will be removed in Lucene 3.0. This method is final, as it 
should
 * not be overridden. Delegates to the backwards compatibility layer. */
public final Token next() throws java.io.IOException {
  return super.next();
}
{code}
Also the incrementToken() method must be final in this case.

  was (Author: thetaphi):
Forgot CHANGES.txt entries with the backwards-break note and other changes.

When starting to transform the contrib streams, we should not forget to do the 
following (maybe add this note to the JIRA issues):
- rewrite and replace next(Token)/next() implementations by new API
- if the class is final, no next(Token)/next() methods needed *(must be 
removed!!!)*
- if the class is non-final add the following methods to the class:
{code}
/** @deprecated Will be removed in Lucene 3.0. This method is final, as it 
should
 * not be overridden. Delegates to the backwards compatibility layer. */
public final Token next(final Token reusableToken) throws java.io.IOException {
  return super.next(reusableToken);
}

/** @deprecated Will be removed in Lucene 3.0. This method is final, as it 
should
 * not be overridden. Delegates to the backwards compatibility layer. */
public final Token next() throws java.io.IOException {
  return super.next();
}
{code}
  
 AttributeSource/TokenStream API improvements
 

 Key: LUCENE-1693
 URL: https://issues.apache.org/jira/browse/LUCENE-1693
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, 
 LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, lucene-1693.patch, PerfTest3.java, 
 TestAPIBackwardsCompatibility.java, TestCompatibility.java, 
 TestCompatibility.java, TestCompatibility.java, TestCompatibility.java


 This patch makes the following improvements to AttributeSource and
 TokenStream/Filter:
 - removes the set/getUseNewAPI() methods (including the standard
   ones). Instead by default incrementToken() throws a subclass of
   UnsupportedOperationException. The indexer tries to call
   incrementToken() initially once to see if the exception is thrown;
   if so, it falls back to the old API.
 - introduces interfaces for all Attributes. The corresponding
   implementations have the postfix 'Impl', e.g. TermAttribute and
   TermAttributeImpl. AttributeSource now has a factory for creating
   the Attribute instances; the default implementation looks for
   implementing classes with the postfix 'Impl'. Token now implements
   all 6 TokenAttribute interfaces.
 - new method added to AttributeSource:
   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
   class hierarchy of the passed in object and finds all interfaces
   that the class or superclasses implement and that extend the
   Attribute interface. It then adds the interface-instance mappings
   to the attribute map for each of the found interfaces.
 - AttributeImpl now has a default implementation of toString that uses
   reflection to print out the values of the attributes in a default
   formatting. This makes it a bit easier to implement AttributeImpl,
   because toString() was declared abstract before.
 - Cloning is now done much more efficiently in
   captureState. The method figures out which unique AttributeImpl
   instances are contained as values in the attributes map, because
   those are the ones that need to be cloned. It creates a single
   linked list that supports deep cloning (in the inner class
   AttributeSource.State). AttributeSource 

[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-07-18 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732878#action_12732878
 ] 

Uwe Schindler edited comment on LUCENE-1693 at 7/18/09 4:06 AM:


In this case we should rename TeeSink to something like SplitTokenStream (which 
does not extend TokenStream). One could get then any number of sinks from it:

{code}
SplitTokenFilter splitter=new SplitTokenStream(stream); // does not extend 
TokenStream!!!
TokenStream stream1=splitter.newSinkTokenStream();
TokenStream stream2=splitter.newSinkTokenStream();
...
{code}

In this case the caching would be done directly in the splitter and the sinks 
are only consumers. The first sink that calls to get the attribute states 
forces the splitter to harvest and cache the input stream (exactly like 
CachingTokenStream does it). In principle it would be the same like a 
CachingTokenStream.

But on the other hand: You can always create a CachingTokenFilter and reuse the 
same instance for different fields. Because the indexer always calls reset() 
before consuming, you could re-read it easily. Any additional filters could 
then plugged in front for each field. In this case the order is not important:

{code}
TokenStream stream=new CachingTokenFilter(input);
doc.add(new Field(xyz, new DoSomethingTokenFilter(stream)));
doc.add(new Field(abc, new DoSometingOtherTokenFilter(stream)));
...
{code}

This would not work, if the indexer can consume the different fields in 
parallel. But with the current state it would even not work with Tee/Sink (not 
multithread compatible).

  was (Author: thetaphi):
In this case we should rename TeeSink to something like SplitTokenStream 
(which does not extend TokenStream). One could get then any number of sinks 
from it:

{code}
SplitTokenFilter splitter=new SplitTokenStream(stream); // does not extend 
TokenStream!!!
TokenStream stream1=splitter.newSinkTokenStream();
TokenStream stream2=splitter.newSinkTokenStream();
...
{code}

In this case the caching would be done directly in the splitter and the sinks 
are only consumers. The first sink that calls to get the attribute states 
forces the splitter to harvest and cache the input stream (exactly like 
CachingTokenStream does it). In principle it would be the same like a 
CachingTokenStream.

But on the other hand: You can always create a CachingTokenStream and reuse the 
same instance for different fields. Because the indexer always calls reset() 
before consuming, you could re-read it easily. Any additional filters could 
then plugged in front for each field. In this case the order is not important:

{code}
TokenStream stream=new CachingTokenStream(input);
doc.add(new Field(xyz, new DoSomethingTokenFilter(stream);
doc.add(new Field(abc, new DoSometingOtherTokenFilter(stream);
...
{code}

This would not work, if the indexer can consume the different fields in 
parallel. But with the current state it would even not work with Tee/Sink (not 
multithread compatible).
  
 AttributeSource/TokenStream API improvements
 

 Key: LUCENE-1693
 URL: https://issues.apache.org/jira/browse/LUCENE-1693
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, 
 lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 lucene-1693.patch, TestAPIBackwardsCompatibility.java, 
 TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, 
 TestCompatibility.java


 This patch makes the following improvements to AttributeSource and
 TokenStream/Filter:
 - removes the set/getUseNewAPI() methods (including the standard
   ones). Instead by default incrementToken() throws a subclass of
   UnsupportedOperationException. The indexer tries to call
   incrementToken() initially once to see if the exception is thrown;
   if so, it falls back to the old API.
 - introduces interfaces for all Attributes. The corresponding
   implementations have the postfix 'Impl', e.g. TermAttribute and
   TermAttributeImpl. AttributeSource now has a factory for creating
   the Attribute instances; the default implementation looks for
   implementing classes with the postfix 'Impl'. Token now implements
   all 6 TokenAttribute interfaces.
 - new method added to AttributeSource:
   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
   class hierarchy of the passed in object and finds all interfaces
   that the class or superclasses implement and that extend 

[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-07-17 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732407#action_12732407
 ] 

Michael Busch edited comment on LUCENE-1693 at 7/17/09 1:37 AM:


{quote}
As CachingTokenFilter has no API to access the cached attributes/tokens 
directly,
{quote}

Oh true :) well, scratch it from the TODO list...

We knew it'd work conceptually the same for AttributeSource.State; unlike 
Tee/Sink, which wouldn't even be save to use with the new API if it hadn't the 
getTokens() method for the reasons I explained above.

{quote}
Another idea would be to clone all attribute impls and then add them to the 
sink - the factory would then not be used?
{quote}

Yes, I thought about this for a while. It would be nice to have this (i. e. 
cloning an AttributeSource) in general: you could reduce the costs for 
initializing the TokenStreams with onlyUseNewAPI=false. We just need to keep a 
static AttributeSource around, that contains the wrapper and the mappings from 
the 6 default interfaces. Then instead of constructing it every time we just 
clone that AttributeSource for new TokenStreams.

The query parser could do the same to keep initialization costs of the 
TokenStreams minimal, because it always needs the same attributes.

I think it should be easy? We just need to implement clone() for 
AttributeSource. 

  was (Author: michaelbusch):
{quote}
As CachingTokenFilter has no API to access the cached attributes/tokens 
directly,
{quote}

Oh true :) well, scratch it from the TODO list...

We knew it'd work conceptually the same for AttributeSource.State; unlike 
Tee/Sink, which wouldn't even be save to use with the new API if it hadn't the 
getTokens() method for the reasons I explained above.

{quote]
Another idea would be to clone all attribute impls and then add them to the 
sink - the factory would then not be used?
{quote}

Yes, I thought about this for a while. It would be nice to have this (i. e. 
cloning an AttributeSource) in general: you could reduce the costs for 
initializing the TokenStreams with onlyUseNewAPI=false. We just need to keep a 
static AttributeSource around, that contains the wrapper and the mappings from 
the 6 default interfaces. Then instead of constructing it every time we just 
clone that AttributeSource for new TokenStreams.

The query parser could do the same to keep initialization costs of the 
TokenStreams minimal, because it always needs the same attributes.

I think it should be easy? We just need to implement clone() for 
AttributeSource. 
  
 AttributeSource/TokenStream API improvements
 

 Key: LUCENE-1693
 URL: https://issues.apache.org/jira/browse/LUCENE-1693
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, 
 TestAPIBackwardsCompatibility.java, TestCompatibility.java, 
 TestCompatibility.java, TestCompatibility.java, TestCompatibility.java


 This patch makes the following improvements to AttributeSource and
 TokenStream/Filter:
 - removes the set/getUseNewAPI() methods (including the standard
   ones). Instead by default incrementToken() throws a subclass of
   UnsupportedOperationException. The indexer tries to call
   incrementToken() initially once to see if the exception is thrown;
   if so, it falls back to the old API.
 - introduces interfaces for all Attributes. The corresponding
   implementations have the postfix 'Impl', e.g. TermAttribute and
   TermAttributeImpl. AttributeSource now has a factory for creating
   the Attribute instances; the default implementation looks for
   implementing classes with the postfix 'Impl'. Token now implements
   all 6 TokenAttribute interfaces.
 - new method added to AttributeSource:
   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
   class hierarchy of the passed in object and finds all interfaces
   that the class or superclasses implement and that extend the
   Attribute interface. It then adds the interface-instance mappings
   to the attribute map for each of the found interfaces.
 - AttributeImpl now has a default implementation of toString that uses
   reflection to print out the values of the attributes in a default
   formatting. This makes it a bit easier to implement AttributeImpl,
   because toString() was declared abstract before.
 - Cloning is now done much more efficiently in
   captureState. The method 

[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-07-17 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732762#action_12732762
 ] 

Uwe Schindler edited comment on LUCENE-1693 at 7/17/09 3:31 PM:


In my opinion, this test that I was not able to reimplement using the new 
TeeSink is very theoretical. In my opinion someone who consumes two streams 
could always implement this very simple using the two aproaches:

- consume the first stream completely, the  the other
- consume one token from the first, then one token from the next and so on

Who invented the whole Tee/Sink originally for what? Maybe we should simply ask 
on java-user who have ever used it. I have never heard of Tee/Sink before 
Michael confronted me with his strange POSToken tests :-)

  was (Author: thetaphi):
In my opinion, this test that I was not able to reimplement using the new 
TeeSink is very theoretical. In my opinion someone who consumes two streams 
could always implement this very simple using the two aproaches:
- consume the first stream completely, the  the other
- consume one token from the first, then one token from the next and so on
Who invented the whole Tee/Sink originally for what? Maybe we should simply ask 
on java-user who have ever used it. I have never heard of Tee/Sink before 
Michael confronted me with his strange POSToken tests :-)
  
 AttributeSource/TokenStream API improvements
 

 Key: LUCENE-1693
 URL: https://issues.apache.org/jira/browse/LUCENE-1693
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, 
 lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 lucene-1693.patch, TestAPIBackwardsCompatibility.java, 
 TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, 
 TestCompatibility.java


 This patch makes the following improvements to AttributeSource and
 TokenStream/Filter:
 - removes the set/getUseNewAPI() methods (including the standard
   ones). Instead by default incrementToken() throws a subclass of
   UnsupportedOperationException. The indexer tries to call
   incrementToken() initially once to see if the exception is thrown;
   if so, it falls back to the old API.
 - introduces interfaces for all Attributes. The corresponding
   implementations have the postfix 'Impl', e.g. TermAttribute and
   TermAttributeImpl. AttributeSource now has a factory for creating
   the Attribute instances; the default implementation looks for
   implementing classes with the postfix 'Impl'. Token now implements
   all 6 TokenAttribute interfaces.
 - new method added to AttributeSource:
   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
   class hierarchy of the passed in object and finds all interfaces
   that the class or superclasses implement and that extend the
   Attribute interface. It then adds the interface-instance mappings
   to the attribute map for each of the found interfaces.
 - AttributeImpl now has a default implementation of toString that uses
   reflection to print out the values of the attributes in a default
   formatting. This makes it a bit easier to implement AttributeImpl,
   because toString() was declared abstract before.
 - Cloning is now done much more efficiently in
   captureState. The method figures out which unique AttributeImpl
   instances are contained as values in the attributes map, because
   those are the ones that need to be cloned. It creates a single
   linked list that supports deep cloning (in the inner class
   AttributeSource.State). AttributeSource keeps track of when this
   state changes, i.e. whenever new attributes are added to the
   AttributeSource. Only in that case will captureState recompute the
   state, otherwise it will simply clone the precomputed state and
   return the clone. restoreState(AttributeSource.State) walks the
   linked list and uses the copyTo() method of AttributeImpl to copy
   all values over into the attribute that the source stream
   (e.g. SinkTokenizer) uses. 
 The cloning performance can be greatly improved if not multiple
 AttributeImpl instances are used in one TokenStream. A user can
 e.g. simply add a Token instance to the stream instead of the individual
 attributes. Or the user could implement a subclass of AttributeImpl that
 implements exactly the Attribute interfaces needed. I think this
 should be considered an expert API (addAttributeImpl), as this manual
 optimization is 

[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-07-15 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731295#action_12731295
 ] 

Uwe Schindler edited comment on LUCENE-1693 at 7/14/09 11:12 PM:
-

bq. However, if people are using next() in a chain with core streams/filters, 
then every single token will now be cloned, possibly multiple times, right?

That was a similar problem in 2.4, whenever you call next() to consume a 
stream, every token is a new instance. Only cloning WAS not needed but its now 
needed for BW compatibility (full private copy).

In my opinion, this is neglectible, as indexing speed is important, and the 
indexer always uses incrementToken(). If somebody written an own query parser 
that uses next() to consume speed is not really important. And it is deprecated 
and in the docs (even in 2.4) stands: but will be slower than calling 
next(Token).

bq. So if people are using streams/filters that implement next(Token) I think 
the performance should be comparable - even though there's also a (hopefully 
small) performance hit to expect because of more method calls and if checks.

I found no performance hit, it is about same speed. The varieties between tests 
is bigger than a measureable performance impact. The other sensitive thing 
(TokenWrapper): The wrapping using TokenWrapper was in the original indexing 
code, too (this BackwardsCompatibilityStream private class).

bq. It's failing right now, because it's testing a lot of different 
combinations. I need to check if all of those different tests are actually 
valid, because we're saying you can't use Tee/Sink with the new API anymore. 

Have you seen my backwards compatibility test, too? It is a copy of yours (with 
some variation)? The Lucene24* classes were removed, because Tee/SinkTokenizer 
werde reverted to their original 2.4 status in the patch (only implement old 
API).
As far as I see (not yet tried out), you try to test new-style-API streams with 
the old Tee/Sink tokenizer, that is deprecated. You were not able to do this 
before 2.9 (no new API) and so the bw problem is not there. If you rewrite your 
streams with new API, you should use TeeSinkTokenizer, too.

  was (Author: thetaphi):
bq. However, if people are using next() in a chain with core 
streams/filters, then every single token will now be cloned, possibly multiple 
times, right?

That was a similar problem in 2.4, whenever you call next() to consume a 
stream, every token is a new instance. Only cloning is not needed but its now 
needed for BW compatibility (full private copy).

In my opinion, this is neglectible, as indexing speed is important, and the 
indexer always uses incrementToken(). If somebody written an own query parser 
that uses next() to consume speed is not really important. And it is deprecated 
and in the docs (even in 2.4) stands: but will be slower than calling 
next(Token).

bq. So if people are using streams/filters that implement next(Token) I think 
the performance should be comparable - even though there's also a (hopefully 
small) performance hit to expect because of more method calls and if checks.

I found no performance hit, it is about same speed. The varieties between tests 
is bigger than a measureable performance impact. The other sensitive thing 
(TokenWrapper): The wrapping using TokenWrapper was in the original indexing 
code, too (this BackwardsCompatibilityStream private class).

bq. It's failing right now, because it's testing a lot of different 
combinations. I need to check if all of those different tests are actually 
valid, because we're saying you can't use Tee/Sink with the new API anymore. 

Have you seen my backwards compatibility test, too? It is a copy of yours (with 
some variation)? The Lucene24* classes were removed, because Tee/SinkTokenizer 
werde reverted to their original 2.4 status in the patch (only implement old 
API).
As far as I see (not yet tried out), you try to test new-style-API streams with 
the old Tee/Sink tokenizer, that is deprecated. You were not able to do this 
before 2.9 (no new API) and so the bw problem is not there. If you rewrite your 
streams with new API, you should use TeeSinkTokenizer, too.
  
 AttributeSource/TokenStream API improvements
 

 Key: LUCENE-1693
 URL: https://issues.apache.org/jira/browse/LUCENE-1693
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, 
 TestAPIBackwardsCompatibility.java, 

[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-24 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723727#action_12723727
 ] 

Uwe Schindler edited comment on LUCENE-1693 at 6/24/09 1:46 PM:


In your example, you mean SinkTokenizer is from the core and only implements 
incrementToken(), the others are handled by the default in TokenStream (using 
this reflection-based redirect?)

I tested this with overriding LowerCaseFilter. In my latest patch 
LowerCaseFilter only implements incrementToken(), next() and next(Token) are 
handled by the forwarder in TokenStream. If you override next(Token) and 
consume this stream using incrementToken(), then your overriden next(Token) is 
never called.

This exact same backwards-compatibility problem was also there when changed 
from next() to next(Token). If some core filter only implemented next(Token) 
and a subclass overrides only next() (because older code-base) and calls 
super() there, the same happened: the tokenizer code called next(Token), but 
next() was never called. You have the same problem with our recent deprecations 
in DocIdSetIterator.

I would no see this as a problem, this problem can be found everywhere in 
Lucene, where we deprecated (originally abstract) methods and replaced by newer 
ones calling the old ones as default.


  was (Author: thetaphi):
In your example, you mean SinkTokenizer is from the core and only 
implements incrementToken(), the others are handles by the default in 
TokenStream (using this reflection-based redirect?)

I tested this with overriding LowerCaseFilter. In my latest patch 
LowerCaseFilter only implements incrementToken(), next() and next(Token) are 
handled by the forwarder in TokenStream. If you override next(Token) and 
consume this stream using incrementToken(), then your overriden next(Token) is 
never called.

This exact same backwards-compatibility problem was also there when changed 
from next() to next(Token). If some core filter only implemented next(Token) 
and a subclass overrides only next() (because older code-base) and calls 
super() there, the same happened: the tokenizer code called next(Token), but 
next() was never called. You have the same problem with our recent deprecations 
in DocIdSetIterator.

I would no see this as a problem, this problem can be found everywhere in 
Lucene, where we deprecated methods and replaced by newer ones calling the old 
ones as default.

  
 AttributeSource/TokenStream API improvements
 

 Key: LUCENE-1693
 URL: https://issues.apache.org/jira/browse/LUCENE-1693
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, 
 TestCompatibility.java, TestCompatibility.java, TestCompatibility.java


 This patch makes the following improvements to AttributeSource and
 TokenStream/Filter:
 - removes the set/getUseNewAPI() methods (including the standard
   ones). Instead by default incrementToken() throws a subclass of
   UnsupportedOperationException. The indexer tries to call
   incrementToken() initially once to see if the exception is thrown;
   if so, it falls back to the old API.
 - introduces interfaces for all Attributes. The corresponding
   implementations have the postfix 'Impl', e.g. TermAttribute and
   TermAttributeImpl. AttributeSource now has a factory for creating
   the Attribute instances; the default implementation looks for
   implementing classes with the postfix 'Impl'. Token now implements
   all 6 TokenAttribute interfaces.
 - new method added to AttributeSource:
   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
   class hierarchy of the passed in object and finds all interfaces
   that the class or superclasses implement and that extend the
   Attribute interface. It then adds the interface-instance mappings
   to the attribute map for each of the found interfaces.
 - AttributeImpl now has a default implementation of toString that uses
   reflection to print out the values of the attributes in a default
   formatting. This makes it a bit easier to implement AttributeImpl,
   because toString() was declared abstract before.
 - Cloning is now done much more efficiently in
   captureState. The method figures out which unique AttributeImpl
   instances are contained as values in the attributes map, because
   those are the ones that need to be cloned. It creates a single
   linked list that supports deep cloning (in the inner class
   AttributeSource.State). AttributeSource keeps track of when this
   state changes, i.e. 

[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-24 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723759#action_12723759
 ] 

Michael Busch edited comment on LUCENE-1693 at 6/24/09 2:50 PM:


{quote}
By the way: I want to have my beer!!!
{quote}
I decided a while ago that you'll definitely get it, because of your admirable 
tenaciousness. :)

  was (Author: michaelbusch):
{quote}
I decided a while ago that you'll definitely get it, because of your admirable 
tenaciousness. :)
{quote}
  
 AttributeSource/TokenStream API improvements
 

 Key: LUCENE-1693
 URL: https://issues.apache.org/jira/browse/LUCENE-1693
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, 
 TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, 
 TestCompatibility.java


 This patch makes the following improvements to AttributeSource and
 TokenStream/Filter:
 - removes the set/getUseNewAPI() methods (including the standard
   ones). Instead by default incrementToken() throws a subclass of
   UnsupportedOperationException. The indexer tries to call
   incrementToken() initially once to see if the exception is thrown;
   if so, it falls back to the old API.
 - introduces interfaces for all Attributes. The corresponding
   implementations have the postfix 'Impl', e.g. TermAttribute and
   TermAttributeImpl. AttributeSource now has a factory for creating
   the Attribute instances; the default implementation looks for
   implementing classes with the postfix 'Impl'. Token now implements
   all 6 TokenAttribute interfaces.
 - new method added to AttributeSource:
   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
   class hierarchy of the passed in object and finds all interfaces
   that the class or superclasses implement and that extend the
   Attribute interface. It then adds the interface-instance mappings
   to the attribute map for each of the found interfaces.
 - AttributeImpl now has a default implementation of toString that uses
   reflection to print out the values of the attributes in a default
   formatting. This makes it a bit easier to implement AttributeImpl,
   because toString() was declared abstract before.
 - Cloning is now done much more efficiently in
   captureState. The method figures out which unique AttributeImpl
   instances are contained as values in the attributes map, because
   those are the ones that need to be cloned. It creates a single
   linked list that supports deep cloning (in the inner class
   AttributeSource.State). AttributeSource keeps track of when this
   state changes, i.e. whenever new attributes are added to the
   AttributeSource. Only in that case will captureState recompute the
   state, otherwise it will simply clone the precomputed state and
   return the clone. restoreState(AttributeSource.State) walks the
   linked list and uses the copyTo() method of AttributeImpl to copy
   all values over into the attribute that the source stream
   (e.g. SinkTokenizer) uses. 
 The cloning performance can be greatly improved if not multiple
 AttributeImpl instances are used in one TokenStream. A user can
 e.g. simply add a Token instance to the stream instead of the individual
 attributes. Or the user could implement a subclass of AttributeImpl that
 implements exactly the Attribute interfaces needed. I think this
 should be considered an expert API (addAttributeImpl), as this manual
 optimization is only needed if cloning performance is crucial. I ran
 some quick performance tests using Tee/Sink tokenizers (which do
 cloning) and the performance was roughly 20% faster with the new
 API. I'll run some more performance tests and post more numbers then.
 Note also that when we add serialization to the Attributes, e.g. for
 supporting storing serialized TokenStreams in the index, then the
 serialization should benefit even significantly more from the new API
 than cloning. 
 Also, the TokenStream API does not change, except for the removal 
 of the set/getUseNewAPI methods. So the patches in LUCENE-1460
 should still work.
 All core tests pass, however, I need to update all the documentation
 and also add some unit tests for the new AttributeSource
 functionality. So this patch is not ready to commit yet, but I wanted
 to post it already for some feedback. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-

[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-17 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720534#action_12720534
 ] 

Uwe Schindler edited comment on LUCENE-1693 at 6/17/09 12:39 AM:
-

Hi Michael,
in principle your test is invalid. It has other tokenfilters in the chain, 
which the user has no control on. With the two mentioned filters it may work, 
because they do not change the reuseableToken instance. But the API clearly 
states, that the reuseableToken must not be used and another one can be 
returned.
So this is really unsupported behaviour. If you remove the filters in between, 
it would work correct. And this could even fail with 2.4 if you put other 
tokenfilters in your chain.

In my opinion, the advantages of the token reuse clearly overweigh the small 
problems with (unsupported) usage. The API does exactly, what is menthioned in 
the API Docs for 2.4.1.

The main advantage is, that you can mix old and new filter instances and you 
loose nothing...

  was (Author: thetaphi):
Hi Michael,
in principle your test is invalid. It has other tokenfilter over which the user 
has no control in it. With the two mentioned filters it may work, because they 
do not change the reuseableToken. But the API clearly states, that the 
reuseableToken must not be used and another one returned.
So this is really unsupported behaviour. If you remove the filters in between, 
it would work correct. And this could even fail with 2.4 if you put other 
tokenfilters in your chain.

In my opinion, the advantages of the token reuse clearly overweigh the small 
problems with (unsupported) usage. The API does exactly, what is menthioned in 
the API Docs for 2.4.1.

The main advantage is, that you can mix old and new filter instances and you 
loose nothing...
  
 AttributeSource/TokenStream API improvements
 

 Key: LUCENE-1693
 URL: https://issues.apache.org/jira/browse/LUCENE-1693
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1693.patch, lucene-1693.patch, 
 TestCompatibility.java


 This patch makes the following improvements to AttributeSource and
 TokenStream/Filter:
 - removes the set/getUseNewAPI() methods (including the standard
   ones). Instead by default incrementToken() throws a subclass of
   UnsupportedOperationException. The indexer tries to call
   incrementToken() initially once to see if the exception is thrown;
   if so, it falls back to the old API.
 - introduces interfaces for all Attributes. The corresponding
   implementations have the postfix 'Impl', e.g. TermAttribute and
   TermAttributeImpl. AttributeSource now has a factory for creating
   the Attribute instances; the default implementation looks for
   implementing classes with the postfix 'Impl'. Token now implements
   all 6 TokenAttribute interfaces.
 - new method added to AttributeSource:
   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
   class hierarchy of the passed in object and finds all interfaces
   that the class or superclasses implement and that extend the
   Attribute interface. It then adds the interface-instance mappings
   to the attribute map for each of the found interfaces.
 - AttributeImpl now has a default implementation of toString that uses
   reflection to print out the values of the attributes in a default
   formatting. This makes it a bit easier to implement AttributeImpl,
   because toString() was declared abstract before.
 - Cloning is now done much more efficiently in
   captureState. The method figures out which unique AttributeImpl
   instances are contained as values in the attributes map, because
   those are the ones that need to be cloned. It creates a single
   linked list that supports deep cloning (in the inner class
   AttributeSource.State). AttributeSource keeps track of when this
   state changes, i.e. whenever new attributes are added to the
   AttributeSource. Only in that case will captureState recompute the
   state, otherwise it will simply clone the precomputed state and
   return the clone. restoreState(AttributeSource.State) walks the
   linked list and uses the copyTo() method of AttributeImpl to copy
   all values over into the attribute that the source stream
   (e.g. SinkTokenizer) uses. 
 The cloning performance can be greatly improved if not multiple
 AttributeImpl instances are used in one TokenStream. A user can
 e.g. simply add a Token instance to the stream instead of the individual
 attributes. Or the user could implement a subclass of AttributeImpl that
 implements exactly the Attribute interfaces needed. I think this
 should be considered an expert API 

[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-16 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720052#action_12720052
 ] 

Uwe Schindler edited comment on LUCENE-1693 at 6/16/09 3:43 AM:


{quote}
bq. What was your concusion about my idea yesterday, to pass the Token around 
even with the old API and copy it on demand

I don't think the indexer should know at all about Token? It should only use 
the interfaces so that we can maintain the full flexibility.
Also I don't really like the fact very much that some user might get a 
performance hit. I had the idea to throw the exception in incrementToken() to 
automatically being able to fallback to the old API. I think this is nice and 
gets rid of the explicit useNewAPI methods. The only drawback is still the fact 
that we have to implement both old and new APIs in Lucene's tokenizers and 
filters until we remove the old API. Grant won't like this  but I think this is 
better than the possible performance hit? Also we don't add new filters THAT 
often to Lucene and implementing both methods is often mostly copypaste.
{quote}

I did not mean to use Token in the indexer, I would like to remove the whole 
old API from the indexer (even the UOE). My idea would be the following:
- Let the default interfaces implemented by Token, the default factory creates 
it for all requests to the default attributes
- If somebody implements the new API, the indexer can use it without problems. 
If he doesn't, the default impl in TokenStream would call next(Token), using 
the token instance from AttributeSource. If the method returns with another 
Token instance (because it did not reuse, which is seldom I think), copy this 
returned token into the per instance AttributeSource Token.
- The other way round, if one uses a TokenStream with the old API (own indexer, 
query parser,...), the TokenStream only implemented the new API, the 
deprectated old method would also have a default impl, ignoring the token 
supplied to next() and returning always the instance-token after calling 
incrementToken().

Because of this, the indexer would always use the new API, the old API is 
wrapped. Core analyzers only need to implement the new methods (or could even 
stay with the old).

There are two problems:
- If somebody does not implement either method, the call to incrementToken() 
will loop endless, there should be some early break with UOE in this case. Do 
not know how to implement this without inspecting the stack trace in the 
default methods.
- Filters are a bit tricky: They could pass in both methods per default to the 
delegate. The problem, if one only implements one of the methods, the other 
passes down to the delegate, and doing nothing. But this could be fixed by 
delegating in the deprecated old method always to the new one (or vice versa).

Hope this was clear, maybe I should create a patch.

  was (Author: thetaphi):
{quote}
bq. What was your concusion about my idea yesterday, to pass the Token around 
even with the old API and copy it on demand

I don't think the indexer should know at all about Token? It should only use 
the interfaces so that we can maintain the full flexibility.
Also I don't really like the fact very much that some user might get a 
performance hit. I had the idea to throw the exception in incrementToken() to 
automatically being able to fallback to the old API. I think this is nice and 
gets rid of the explicit useNewAPI methods. The only drawback is still the fact 
that we have to implement both old and new APIs in Lucene's tokenizers and 
filters until we remove the old API. Grant won't like this  but I think this is 
better than the possible performance hit? Also we don't add new filters THAT 
often to Lucene and implementing both methods is often mostly copypaste.
{quote}

I did not mean to use Token in the indexer, I would like to remove the whole 
old API from Token (even the UOE). My idea would be the following:
- Let the default interfaces implemented by Token, the default factory creates 
it for all requests to the default attributes
- If somebody implements the new API, the indexer can use it without problems. 
If he doesn't, the default impl in TokenStream would call next(Token), using 
the token instance from AttributeSource. If the method returns with another 
Token instance (because it did not reuse, which is seldom I think), copy this 
returned token into the per instance AttributeSource Token.
- The other way round, if one uses a TokenStream with the old API (own indexer, 
query parser,...), the TokenStream only implemented the new API, the 
deprectated old method would also have a default impl, ignoring the token 
supplied to next() and returning always the instance-token after calling 
incrementToken().

Because of this, the indexer would always use the new API, the old API is 
wrapped. Core analyzers only need to 

[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-16 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720237#action_12720237
 ] 

Uwe Schindler edited comment on LUCENE-1693 at 6/16/09 9:57 AM:


Grant: But with the new TokenStream API, you must always clone the attributes 
before caching, because the TokenFilter only has *one* final working instance 
per Attribute whose contents change all the time.
In this case it would not be better than wrapping next(Token) with the new api 
and feeding in the current instance token and copy it back into the 
instance-one if next(Token) returns a different instance.

  was (Author: thetaphi):
Grant: But with the new TokenStream API, you must always clone the 
attributes before caching, because the TokenFilter only has one working 
instance that changes all the time.
In this case it would not be better than wrapping next(Token) with the new api 
and feeding in the current instance token and copy it back into the 
instance-one if next(Token) returns a different instance.
  
 AttributeSource/TokenStream API improvements
 

 Key: LUCENE-1693
 URL: https://issues.apache.org/jira/browse/LUCENE-1693
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.9

 Attachments: lucene-1693.patch


 This patch makes the following improvements to AttributeSource and
 TokenStream/Filter:
 - removes the set/getUseNewAPI() methods (including the standard
   ones). Instead by default incrementToken() throws a subclass of
   UnsupportedOperationException. The indexer tries to call
   incrementToken() initially once to see if the exception is thrown;
   if so, it falls back to the old API.
 - introduces interfaces for all Attributes. The corresponding
   implementations have the postfix 'Impl', e.g. TermAttribute and
   TermAttributeImpl. AttributeSource now has a factory for creating
   the Attribute instances; the default implementation looks for
   implementing classes with the postfix 'Impl'. Token now implements
   all 6 TokenAttribute interfaces.
 - new method added to AttributeSource:
   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
   class hierarchy of the passed in object and finds all interfaces
   that the class or superclasses implement and that extend the
   Attribute interface. It then adds the interface-instance mappings
   to the attribute map for each of the found interfaces.
 - AttributeImpl now has a default implementation of toString that uses
   reflection to print out the values of the attributes in a default
   formatting. This makes it a bit easier to implement AttributeImpl,
   because toString() was declared abstract before.
 - Cloning is now done much more efficiently in
   captureState. The method figures out which unique AttributeImpl
   instances are contained as values in the attributes map, because
   those are the ones that need to be cloned. It creates a single
   linked list that supports deep cloning (in the inner class
   AttributeSource.State). AttributeSource keeps track of when this
   state changes, i.e. whenever new attributes are added to the
   AttributeSource. Only in that case will captureState recompute the
   state, otherwise it will simply clone the precomputed state and
   return the clone. restoreState(AttributeSource.State) walks the
   linked list and uses the copyTo() method of AttributeImpl to copy
   all values over into the attribute that the source stream
   (e.g. SinkTokenizer) uses. 
 The cloning performance can be greatly improved if not multiple
 AttributeImpl instances are used in one TokenStream. A user can
 e.g. simply add a Token instance to the stream instead of the individual
 attributes. Or the user could implement a subclass of AttributeImpl that
 implements exactly the Attribute interfaces needed. I think this
 should be considered an expert API (addAttributeImpl), as this manual
 optimization is only needed if cloning performance is crucial. I ran
 some quick performance tests using Tee/Sink tokenizers (which do
 cloning) and the performance was roughly 20% faster with the new
 API. I'll run some more performance tests and post more numbers then.
 Note also that when we add serialization to the Attributes, e.g. for
 supporting storing serialized TokenStreams in the index, then the
 serialization should benefit even significantly more from the new API
 than cloning. 
 Also, the TokenStream API does not change, except for the removal 
 of the set/getUseNewAPI methods. So the patches in LUCENE-1460
 should still work.
 All core tests pass, however, I need to update all the documentation
 and also add some unit tests for the new