subject:"\[jira\] Issue Comment Edited\: \(LUCENE\-1693\) AttributeSource\/TokenStream API improvements"

[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-07-24 Thread Robert Muir (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12735124#action_12735124
]

Robert Muir edited comment on LUCENE-1693 at 7/24/09 12:08 PM:
---

{quote}
I completely agree. We should discourage users to reference Token and rather
use the Attribute interfaces. That's the whole beauty and flexibility about
this new API.
{quote}

Has there been any thought into reconsidering the new API's experimental
status then?
I don't think the WARNING: encourages users to use these interfaces!

or maybe a compromise: maybe modify the javadocs to be a little less scary:
does this text have to be FF (red) ?

was (Author: rcmuir):
{quote}
I completely agree. We should discourage users to reference Token and rather
use the Attribute interfaces. That's the whole beauty and flexibility about
this new API.
{quote}

Has there been any thought into reconsidering the new API's experimental
status then?
I don't think the WARNING: encourages users to use these interfaces!

AttributeSource/TokenStream API improvements

Key: LUCENE-1693
URL: https://issues.apache.org/jira/browse/LUCENE-1693
Project: Lucene - Java
Issue Type: Improvement
Components: Analysis
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
Fix For: 2.9

Attachments: lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch,
LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, lucene-1693.patch,
LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch,
LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch,
LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch,
PerfTest3.java, TestAPIBackwardsCompatibility.java, TestCompatibility.java,
TestCompatibility.java, TestCompatibility.java, TestCompatibility.java

This patch makes the following improvements to AttributeSource and
TokenStream/Filter:
- introduces interfaces for all Attributes. The corresponding
implementations have the postfix 'Impl', e.g. TermAttribute and
TermAttributeImpl. AttributeSource now has a factory for creating
the Attribute instances; the default implementation looks for
implementing classes with the postfix 'Impl'. Token now implements
all 6 TokenAttribute interfaces.
- new method added to AttributeSource:
addAttributeImpl(AttributeImpl). Using reflection it walks up in the
class hierarchy of the passed in object and finds all interfaces
that the class or superclasses implement and that extend the
Attribute interface. It then adds the interface-instance mappings
to the attribute map for each of the found interfaces.
- removes the set/getUseNewAPI() methods (including the standard
ones). Instead it is now enough to only implement the new API,
if one old TokenStream implements still the old API (next()/next(Token)),
it is wrapped automatically. The delegation path is determined via
reflection (the patch determines, which of the three methods was
overridden).
- Token is no longer deprecated, instead it implements all 6 standard
token interfaces (see above). The wrapper for next() and next(Token)
uses this, to automatically map all attribute interfaces to one
TokenWrapper instance (implementing all 6 interfaces), that contains
a Token instance. next() and next(Token) exchange the inner Token
instance as needed. For the new incrementToken(), only one
TokenWrapper instance is visible, delegating to the currect reusable
Token. This API also preserves custom Token subclasses, that maybe
created by very special token streams (see example in Backwards-Test).
- AttributeImpl now has a default implementation of toString that uses
reflection to print out the values of the attributes in a default
formatting. This makes it a bit easier to implement AttributeImpl,
because toString() was declared abstract before.
- Cloning is now done much more efficiently in
captureState. The method figures out which unique AttributeImpl
instances are contained as values in the attributes map, because
those are the ones that need to be cloned. It creates a single
linked list that supports deep cloning (in the inner class
AttributeSource.State). AttributeSource keeps track of when this
state changes, i.e. whenever new attributes are added to the
AttributeSource. Only in that case will captureState recompute the
state, otherwise it will simply clone the precomputed state and
return the clone. restoreState(AttributeSource.State) walks the
linked list and uses the copyTo() method of AttributeImpl to copy
all values over into the attribute that the source stream
(e.g.

[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-07-24 Thread Mark Miller (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12735210#action_12735210
]

Mark Miller edited comment on LUCENE-1693 at 7/24/09 4:59 PM:
--

Not sure what issue it stems from, but Token has a bunch of constructors that
are deprecated, but that don't point you to something new.

edit

must have come from the setBuffer stuff

was (Author: markrmil...@gmail.com):
Not sure what issue it stems from, but Token has a bunch of constructors
that are deprecated, but that don't point you to something new.

AttributeSource/TokenStream API improvements

[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-07-19 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12733087#action_12733087
 ] 

Uwe Schindler edited comment on LUCENE-1693 at 7/19/09 4:05 PM:


Forgot CHANGES.txt entries with the backwards-break note and other changes.

When starting to transform the contrib streams, we should not forget to do the 
following (maybe add this note to the JIRA issues):
- rewrite and replace next(Token)/next() implementations by new API
- if the class is final, no next(Token)/next() methods needed *(must be 
removed!!!)*
- if the class is non-final add the following methods to the class:
{code}
/** @deprecated Will be removed in Lucene 3.0. This method is final, as it 
should
 * not be overridden. Delegates to the backwards compatibility layer. */
public final Token next(final Token reusableToken) throws java.io.IOException {
  return super.next(reusableToken);
}

/** @deprecated Will be removed in Lucene 3.0. This method is final, as it 
should
 * not be overridden. Delegates to the backwards compatibility layer. */
public final Token next() throws java.io.IOException {
  return super.next();
}
{code}
Also the incrementToken() method must be final in this case.

  was (Author: thetaphi):
Forgot CHANGES.txt entries with the backwards-break note and other changes.

When starting to transform the contrib streams, we should not forget to do the 
following (maybe add this note to the JIRA issues):
- rewrite and replace next(Token)/next() implementations by new API
- if the class is final, no next(Token)/next() methods needed *(must be 
removed!!!)*
- if the class is non-final add the following methods to the class:
{code}
/** @deprecated Will be removed in Lucene 3.0. This method is final, as it 
should
 * not be overridden. Delegates to the backwards compatibility layer. */
public final Token next(final Token reusableToken) throws java.io.IOException {
  return super.next(reusableToken);
}

/** @deprecated Will be removed in Lucene 3.0. This method is final, as it 
should
 * not be overridden. Delegates to the backwards compatibility layer. */
public final Token next() throws java.io.IOException {
  return super.next();
}
{code}
  
 AttributeSource/TokenStream API improvements
 

 Key: LUCENE-1693
 URL: https://issues.apache.org/jira/browse/LUCENE-1693
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, 
 LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, lucene-1693.patch, PerfTest3.java, 
 TestAPIBackwardsCompatibility.java, TestCompatibility.java, 
 TestCompatibility.java, TestCompatibility.java, TestCompatibility.java


 This patch makes the following improvements to AttributeSource and
 TokenStream/Filter:
 - removes the set/getUseNewAPI() methods (including the standard
   ones). Instead by default incrementToken() throws a subclass of
   UnsupportedOperationException. The indexer tries to call
   incrementToken() initially once to see if the exception is thrown;
   if so, it falls back to the old API.
 - introduces interfaces for all Attributes. The corresponding
   implementations have the postfix 'Impl', e.g. TermAttribute and
   TermAttributeImpl. AttributeSource now has a factory for creating
   the Attribute instances; the default implementation looks for
   implementing classes with the postfix 'Impl'. Token now implements
   all 6 TokenAttribute interfaces.
 - new method added to AttributeSource:
   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
   class hierarchy of the passed in object and finds all interfaces
   that the class or superclasses implement and that extend the
   Attribute interface. It then adds the interface-instance mappings
   to the attribute map for each of the found interfaces.
 - AttributeImpl now has a default implementation of toString that uses
   reflection to print out the values of the attributes in a default
   formatting. This makes it a bit easier to implement AttributeImpl,
   because toString() was declared abstract before.
 - Cloning is now done much more efficiently in
   captureState. The method figures out which unique AttributeImpl
   instances are contained as values in the attributes map, because
   those are the ones that need to be cloned. It creates a single
   linked list that supports deep cloning (in the inner class
   AttributeSource.State). AttributeSource

[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-07-18 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732878#action_12732878
 ] 

Uwe Schindler edited comment on LUCENE-1693 at 7/18/09 4:06 AM:


In this case we should rename TeeSink to something like SplitTokenStream (which 
does not extend TokenStream). One could get then any number of sinks from it:

{code}
SplitTokenFilter splitter=new SplitTokenStream(stream); // does not extend 
TokenStream!!!
TokenStream stream1=splitter.newSinkTokenStream();
TokenStream stream2=splitter.newSinkTokenStream();
...
{code}

In this case the caching would be done directly in the splitter and the sinks 
are only consumers. The first sink that calls to get the attribute states 
forces the splitter to harvest and cache the input stream (exactly like 
CachingTokenStream does it). In principle it would be the same like a 
CachingTokenStream.

But on the other hand: You can always create a CachingTokenFilter and reuse the 
same instance for different fields. Because the indexer always calls reset() 
before consuming, you could re-read it easily. Any additional filters could 
then plugged in front for each field. In this case the order is not important:

{code}
TokenStream stream=new CachingTokenFilter(input);
doc.add(new Field(xyz, new DoSomethingTokenFilter(stream)));
doc.add(new Field(abc, new DoSometingOtherTokenFilter(stream)));
...
{code}

This would not work, if the indexer can consume the different fields in 
parallel. But with the current state it would even not work with Tee/Sink (not 
multithread compatible).

  was (Author: thetaphi):
In this case we should rename TeeSink to something like SplitTokenStream 
(which does not extend TokenStream). One could get then any number of sinks 
from it:

{code}
SplitTokenFilter splitter=new SplitTokenStream(stream); // does not extend 
TokenStream!!!
TokenStream stream1=splitter.newSinkTokenStream();
TokenStream stream2=splitter.newSinkTokenStream();
...
{code}

In this case the caching would be done directly in the splitter and the sinks 
are only consumers. The first sink that calls to get the attribute states 
forces the splitter to harvest and cache the input stream (exactly like 
CachingTokenStream does it). In principle it would be the same like a 
CachingTokenStream.

But on the other hand: You can always create a CachingTokenStream and reuse the 
same instance for different fields. Because the indexer always calls reset() 
before consuming, you could re-read it easily. Any additional filters could 
then plugged in front for each field. In this case the order is not important:

{code}
TokenStream stream=new CachingTokenStream(input);
doc.add(new Field(xyz, new DoSomethingTokenFilter(stream);
doc.add(new Field(abc, new DoSometingOtherTokenFilter(stream);
...
{code}

This would not work, if the indexer can consume the different fields in 
parallel. But with the current state it would even not work with Tee/Sink (not 
multithread compatible).
  
 AttributeSource/TokenStream API improvements
 

 Key: LUCENE-1693
 URL: https://issues.apache.org/jira/browse/LUCENE-1693
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, 
 lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 lucene-1693.patch, TestAPIBackwardsCompatibility.java, 
 TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, 
 TestCompatibility.java


 This patch makes the following improvements to AttributeSource and
 TokenStream/Filter:
 - removes the set/getUseNewAPI() methods (including the standard
   ones). Instead by default incrementToken() throws a subclass of
   UnsupportedOperationException. The indexer tries to call
   incrementToken() initially once to see if the exception is thrown;
   if so, it falls back to the old API.
 - introduces interfaces for all Attributes. The corresponding
   implementations have the postfix 'Impl', e.g. TermAttribute and
   TermAttributeImpl. AttributeSource now has a factory for creating
   the Attribute instances; the default implementation looks for
   implementing classes with the postfix 'Impl'. Token now implements
   all 6 TokenAttribute interfaces.
 - new method added to AttributeSource:
   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
   class hierarchy of the passed in object and finds all interfaces
   that the class or superclasses implement and that extend

[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-07-17 Thread Michael Busch (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732407#action_12732407
]

Michael Busch edited comment on LUCENE-1693 at 7/17/09 1:37 AM:

{quote}
As CachingTokenFilter has no API to access the cached attributes/tokens
directly,
{quote}

Oh true :) well, scratch it from the TODO list...

We knew it'd work conceptually the same for AttributeSource.State; unlike
Tee/Sink, which wouldn't even be save to use with the new API if it hadn't the
getTokens() method for the reasons I explained above.

{quote}
Another idea would be to clone all attribute impls and then add them to the
sink - the factory would then not be used?
{quote}

Yes, I thought about this for a while. It would be nice to have this (i. e.
cloning an AttributeSource) in general: you could reduce the costs for
initializing the TokenStreams with onlyUseNewAPI=false. We just need to keep a
static AttributeSource around, that contains the wrapper and the mappings from
the 6 default interfaces. Then instead of constructing it every time we just
clone that AttributeSource for new TokenStreams.

The query parser could do the same to keep initialization costs of the
TokenStreams minimal, because it always needs the same attributes.

I think it should be easy? We just need to implement clone() for
AttributeSource.

was (Author: michaelbusch):
{quote}
As CachingTokenFilter has no API to access the cached attributes/tokens
directly,
{quote}

Oh true :) well, scratch it from the TODO list...

{quote]
Another idea would be to clone all attribute impls and then add them to the
sink - the factory would then not be used?
{quote}

The query parser could do the same to keep initialization costs of the
TokenStreams minimal, because it always needs the same attributes.

I think it should be easy? We just need to implement clone() for
AttributeSource.

AttributeSource/TokenStream API improvements

Attachments: LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch,
LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch,
LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch,
LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch,
TestAPIBackwardsCompatibility.java, TestCompatibility.java,
TestCompatibility.java, TestCompatibility.java, TestCompatibility.java

[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-07-17 Thread Uwe Schindler (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732762#action_12732762
]

Uwe Schindler edited comment on LUCENE-1693 at 7/17/09 3:31 PM:

In my opinion, this test that I was not able to reimplement using the new
TeeSink is very theoretical. In my opinion someone who consumes two streams
could always implement this very simple using the two aproaches:

- consume the first stream completely, the the other
- consume one token from the first, then one token from the next and so on

Who invented the whole Tee/Sink originally for what? Maybe we should simply ask
on java-user who have ever used it. I have never heard of Tee/Sink before
Michael confronted me with his strange POSToken tests :-)

was (Author: thetaphi):
In my opinion, this test that I was not able to reimplement using the new
TeeSink is very theoretical. In my opinion someone who consumes two streams
could always implement this very simple using the two aproaches:
- consume the first stream completely, the the other
- consume one token from the first, then one token from the next and so on
Who invented the whole Tee/Sink originally for what? Maybe we should simply ask
on java-user who have ever used it. I have never heard of Tee/Sink before
Michael confronted me with his strange POSToken tests :-)

AttributeSource/TokenStream API improvements

Attachments: LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch,
lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch,
LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch,
LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch,
lucene-1693.patch, TestAPIBackwardsCompatibility.java,
TestCompatibility.java, TestCompatibility.java, TestCompatibility.java,
TestCompatibility.java

This patch makes the following improvements to AttributeSource and
TokenStream/Filter:
- removes the set/getUseNewAPI() methods (including the standard
ones). Instead by default incrementToken() throws a subclass of
UnsupportedOperationException. The indexer tries to call
incrementToken() initially once to see if the exception is thrown;
if so, it falls back to the old API.
- introduces interfaces for all Attributes. The corresponding
implementations have the postfix 'Impl', e.g. TermAttribute and
TermAttributeImpl. AttributeSource now has a factory for creating
the Attribute instances; the default implementation looks for
implementing classes with the postfix 'Impl'. Token now implements
all 6 TokenAttribute interfaces.
- new method added to AttributeSource:
addAttributeImpl(AttributeImpl). Using reflection it walks up in the
class hierarchy of the passed in object and finds all interfaces
that the class or superclasses implement and that extend the
Attribute interface. It then adds the interface-instance mappings
to the attribute map for each of the found interfaces.
- AttributeImpl now has a default implementation of toString that uses
reflection to print out the values of the attributes in a default
formatting. This makes it a bit easier to implement AttributeImpl,
because toString() was declared abstract before.
- Cloning is now done much more efficiently in
captureState. The method figures out which unique AttributeImpl
instances are contained as values in the attributes map, because
those are the ones that need to be cloned. It creates a single
linked list that supports deep cloning (in the inner class
AttributeSource.State). AttributeSource keeps track of when this
state changes, i.e. whenever new attributes are added to the
AttributeSource. Only in that case will captureState recompute the
state, otherwise it will simply clone the precomputed state and
return the clone. restoreState(AttributeSource.State) walks the
linked list and uses the copyTo() method of AttributeImpl to copy
all values over into the attribute that the source stream
(e.g. SinkTokenizer) uses.
The cloning performance can be greatly improved if not multiple
AttributeImpl instances are used in one TokenStream. A user can
e.g. simply add a Token instance to the stream instead of the individual
attributes. Or the user could implement a subclass of AttributeImpl that
implements exactly the Attribute interfaces needed. I think this
should be considered an expert API (addAttributeImpl), as this manual
optimization is

[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-07-15 Thread Uwe Schindler (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731295#action_12731295
]

Uwe Schindler edited comment on LUCENE-1693 at 7/14/09 11:12 PM:
-

bq. However, if people are using next() in a chain with core streams/filters,
then every single token will now be cloned, possibly multiple times, right?

That was a similar problem in 2.4, whenever you call next() to consume a
stream, every token is a new instance. Only cloning WAS not needed but its now
needed for BW compatibility (full private copy).

In my opinion, this is neglectible, as indexing speed is important, and the
indexer always uses incrementToken(). If somebody written an own query parser
that uses next() to consume speed is not really important. And it is deprecated
and in the docs (even in 2.4) stands: but will be slower than calling
next(Token).

bq. So if people are using streams/filters that implement next(Token) I think
the performance should be comparable - even though there's also a (hopefully
small) performance hit to expect because of more method calls and if checks.

I found no performance hit, it is about same speed. The varieties between tests
is bigger than a measureable performance impact. The other sensitive thing
(TokenWrapper): The wrapping using TokenWrapper was in the original indexing
code, too (this BackwardsCompatibilityStream private class).

bq. It's failing right now, because it's testing a lot of different
combinations. I need to check if all of those different tests are actually
valid, because we're saying you can't use Tee/Sink with the new API anymore.

Have you seen my backwards compatibility test, too? It is a copy of yours (with
some variation)? The Lucene24* classes were removed, because Tee/SinkTokenizer
werde reverted to their original 2.4 status in the patch (only implement old
API).
As far as I see (not yet tried out), you try to test new-style-API streams with
the old Tee/Sink tokenizer, that is deprecated. You were not able to do this
before 2.9 (no new API) and so the bw problem is not there. If you rewrite your
streams with new API, you should use TeeSinkTokenizer, too.

was (Author: thetaphi):
bq. However, if people are using next() in a chain with core
streams/filters, then every single token will now be cloned, possibly multiple
times, right?

That was a similar problem in 2.4, whenever you call next() to consume a
stream, every token is a new instance. Only cloning is not needed but its now
needed for BW compatibility (full private copy).

AttributeSource/TokenStream API improvements

Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch,
LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch,
LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch,
TestAPIBackwardsCompatibility.java,

[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-24 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723727#action_12723727
 ] 

Uwe Schindler edited comment on LUCENE-1693 at 6/24/09 1:46 PM:


In your example, you mean SinkTokenizer is from the core and only implements 
incrementToken(), the others are handled by the default in TokenStream (using 
this reflection-based redirect?)

I tested this with overriding LowerCaseFilter. In my latest patch 
LowerCaseFilter only implements incrementToken(), next() and next(Token) are 
handled by the forwarder in TokenStream. If you override next(Token) and 
consume this stream using incrementToken(), then your overriden next(Token) is 
never called.

This exact same backwards-compatibility problem was also there when changed 
from next() to next(Token). If some core filter only implemented next(Token) 
and a subclass overrides only next() (because older code-base) and calls 
super() there, the same happened: the tokenizer code called next(Token), but 
next() was never called. You have the same problem with our recent deprecations 
in DocIdSetIterator.

I would no see this as a problem, this problem can be found everywhere in 
Lucene, where we deprecated (originally abstract) methods and replaced by newer 
ones calling the old ones as default.


  was (Author: thetaphi):
In your example, you mean SinkTokenizer is from the core and only 
implements incrementToken(), the others are handles by the default in 
TokenStream (using this reflection-based redirect?)

I tested this with overriding LowerCaseFilter. In my latest patch 
LowerCaseFilter only implements incrementToken(), next() and next(Token) are 
handled by the forwarder in TokenStream. If you override next(Token) and 
consume this stream using incrementToken(), then your overriden next(Token) is 
never called.

This exact same backwards-compatibility problem was also there when changed 
from next() to next(Token). If some core filter only implemented next(Token) 
and a subclass overrides only next() (because older code-base) and calls 
super() there, the same happened: the tokenizer code called next(Token), but 
next() was never called. You have the same problem with our recent deprecations 
in DocIdSetIterator.

I would no see this as a problem, this problem can be found everywhere in 
Lucene, where we deprecated methods and replaced by newer ones calling the old 
ones as default.

  
 AttributeSource/TokenStream API improvements
 

 Key: LUCENE-1693
 URL: https://issues.apache.org/jira/browse/LUCENE-1693
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, 
 TestCompatibility.java, TestCompatibility.java, TestCompatibility.java


 This patch makes the following improvements to AttributeSource and
 TokenStream/Filter:
 - removes the set/getUseNewAPI() methods (including the standard
   ones). Instead by default incrementToken() throws a subclass of
   UnsupportedOperationException. The indexer tries to call
   incrementToken() initially once to see if the exception is thrown;
   if so, it falls back to the old API.
 - introduces interfaces for all Attributes. The corresponding
   implementations have the postfix 'Impl', e.g. TermAttribute and
   TermAttributeImpl. AttributeSource now has a factory for creating
   the Attribute instances; the default implementation looks for
   implementing classes with the postfix 'Impl'. Token now implements
   all 6 TokenAttribute interfaces.
 - new method added to AttributeSource:
   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
   class hierarchy of the passed in object and finds all interfaces
   that the class or superclasses implement and that extend the
   Attribute interface. It then adds the interface-instance mappings
   to the attribute map for each of the found interfaces.
 - AttributeImpl now has a default implementation of toString that uses
   reflection to print out the values of the attributes in a default
   formatting. This makes it a bit easier to implement AttributeImpl,
   because toString() was declared abstract before.
 - Cloning is now done much more efficiently in
   captureState. The method figures out which unique AttributeImpl
   instances are contained as values in the attributes map, because
   those are the ones that need to be cloned. It creates a single
   linked list that supports deep cloning (in the inner class
   AttributeSource.State). AttributeSource keeps track of when this
   state changes, i.e.

[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-24 Thread Michael Busch (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723759#action_12723759
]

Michael Busch edited comment on LUCENE-1693 at 6/24/09 2:50 PM:

{quote}
By the way: I want to have my beer!!!
{quote}
I decided a while ago that you'll definitely get it, because of your admirable
tenaciousness. :)

was (Author: michaelbusch):
{quote}
I decided a while ago that you'll definitely get it, because of your admirable
tenaciousness. :)
{quote}

AttributeSource/TokenStream API improvements

Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch,
LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch,
TestCompatibility.java, TestCompatibility.java, TestCompatibility.java,
TestCompatibility.java

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-17 Thread Uwe Schindler (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720534#action_12720534
]

Uwe Schindler edited comment on LUCENE-1693 at 6/17/09 12:39 AM:
-

Hi Michael,
in principle your test is invalid. It has other tokenfilters in the chain,
which the user has no control on. With the two mentioned filters it may work,
because they do not change the reuseableToken instance. But the API clearly
states, that the reuseableToken must not be used and another one can be
returned.
So this is really unsupported behaviour. If you remove the filters in between,
it would work correct. And this could even fail with 2.4 if you put other
tokenfilters in your chain.

In my opinion, the advantages of the token reuse clearly overweigh the small
problems with (unsupported) usage. The API does exactly, what is menthioned in
the API Docs for 2.4.1.

The main advantage is, that you can mix old and new filter instances and you
loose nothing...

was (Author: thetaphi):
Hi Michael,
in principle your test is invalid. It has other tokenfilter over which the user
has no control in it. With the two mentioned filters it may work, because they
do not change the reuseableToken. But the API clearly states, that the
reuseableToken must not be used and another one returned.
So this is really unsupported behaviour. If you remove the filters in between,
it would work correct. And this could even fail with 2.4 if you put other
tokenfilters in your chain.

In my opinion, the advantages of the token reuse clearly overweigh the small
problems with (unsupported) usage. The API does exactly, what is menthioned in
the API Docs for 2.4.1.

The main advantage is, that you can mix old and new filter instances and you
loose nothing...

AttributeSource/TokenStream API improvements

Attachments: LUCENE-1693.patch, lucene-1693.patch,
TestCompatibility.java

[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-16 Thread Uwe Schindler (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720052#action_12720052
]

Uwe Schindler edited comment on LUCENE-1693 at 6/16/09 3:43 AM:

{quote}
bq. What was your concusion about my idea yesterday, to pass the Token around
even with the old API and copy it on demand

I don't think the indexer should know at all about Token? It should only use
the interfaces so that we can maintain the full flexibility.
Also I don't really like the fact very much that some user might get a
performance hit. I had the idea to throw the exception in incrementToken() to
automatically being able to fallback to the old API. I think this is nice and
gets rid of the explicit useNewAPI methods. The only drawback is still the fact
that we have to implement both old and new APIs in Lucene's tokenizers and
filters until we remove the old API. Grant won't like this but I think this is
better than the possible performance hit? Also we don't add new filters THAT
often to Lucene and implementing both methods is often mostly copypaste.
{quote}

I did not mean to use Token in the indexer, I would like to remove the whole
old API from the indexer (even the UOE). My idea would be the following:
- Let the default interfaces implemented by Token, the default factory creates
it for all requests to the default attributes
- If somebody implements the new API, the indexer can use it without problems.
If he doesn't, the default impl in TokenStream would call next(Token), using
the token instance from AttributeSource. If the method returns with another
Token instance (because it did not reuse, which is seldom I think), copy this
returned token into the per instance AttributeSource Token.
- The other way round, if one uses a TokenStream with the old API (own indexer,
query parser,...), the TokenStream only implemented the new API, the
deprectated old method would also have a default impl, ignoring the token
supplied to next() and returning always the instance-token after calling
incrementToken().

Because of this, the indexer would always use the new API, the old API is
wrapped. Core analyzers only need to implement the new methods (or could even
stay with the old).

There are two problems:
- If somebody does not implement either method, the call to incrementToken()
will loop endless, there should be some early break with UOE in this case. Do
not know how to implement this without inspecting the stack trace in the
default methods.
- Filters are a bit tricky: They could pass in both methods per default to the
delegate. The problem, if one only implements one of the methods, the other
passes down to the delegate, and doing nothing. But this could be fixed by
delegating in the deprecated old method always to the new one (or vice versa).

Hope this was clear, maybe I should create a patch.

was (Author: thetaphi):
{quote}
bq. What was your concusion about my idea yesterday, to pass the Token around
even with the old API and copy it on demand

I did not mean to use Token in the indexer, I would like to remove the whole
old API from Token (even the UOE). My idea would be the following:
- Let the default interfaces implemented by Token, the default factory creates
it for all requests to the default attributes
- If somebody implements the new API, the indexer can use it without problems.
If he doesn't, the default impl in TokenStream would call next(Token), using
the token instance from AttributeSource. If the method returns with another
Token instance (because it did not reuse, which is seldom I think), copy this
returned token into the per instance AttributeSource Token.
- The other way round, if one uses a TokenStream with the old API (own indexer,
query parser,...), the TokenStream only implemented the new API, the
deprectated old method would also have a default impl, ignoring the token
supplied to next() and returning always the instance-token after calling
incrementToken().

Because of this, the indexer would always use the new API, the old API is
wrapped. Core analyzers only need to

[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-16 Thread Uwe Schindler (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720237#action_12720237
]

Uwe Schindler edited comment on LUCENE-1693 at 6/16/09 9:57 AM:

Grant: But with the new TokenStream API, you must always clone the attributes
before caching, because the TokenFilter only has *one* final working instance
per Attribute whose contents change all the time.
In this case it would not be better than wrapping next(Token) with the new api
and feeding in the current instance token and copy it back into the
instance-one if next(Token) returns a different instance.

was (Author: thetaphi):
Grant: But with the new TokenStream API, you must always clone the
attributes before caching, because the TokenFilter only has one working
instance that changes all the time.
In this case it would not be better than wrapping next(Token) with the new api
and feeding in the current instance token and copy it back into the
instance-one if next(Token) returns a different instance.

AttributeSource/TokenStream API improvements

Attachments: lucene-1693.patch

[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements

[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements

[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements

[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements

[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements

[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements

[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements

[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements

[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements

[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements

[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements

[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements

12 matches

Site Navigation

Mail list logo

Footer information