[jira] Commented: (LUCENE-1063) Token re-use API breaks back compatibility in certain TokenStream chains

Michael McCandless (JIRA) Tue, 20 Nov 2007 10:33:16 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543991
 ]


Michael McCandless commented on LUCENE-1063:
--------------------------------------------

{quote}
{code}
// Filter F is calling TokenStream ts:
F.next(Token result) {
    Token t = ts.next(result);
    t.setSomething();
    return t;
}
{code}
Problem as described: ts expects the token it returns to not be altered because 
it somehow intends to rely on its content when servicing the following call to 
next([]). In other words, it assumes that callers to next([]) would only 
consume, but not alter, the returned token.
{quote}

And, ts only defined "non-reuse" next(), thus it is the default
implemenation in TokenStream.next(Token) that is actually invoked,
which in turn invokes ts.next() and directly returns the result.

{quote}
Seems that such an expectation by ts would be problematic no matter if 
ts.next() or ts.next(Token) are used.

I mean, even if we removed next(Token) but kept Token.termBuffer(), that char 
array could be modified, and some TokenSteam implementation could still be 
broken because it assumes (following similar logic) that it can reuse its 
private copy of the char array... right?
{quote}

I don't think it's problematic for ts to expect this?  This is the
"contract" that you are supposed to follow for this API, spelled out
in the javadocs.

When you call "non-reuse" ts.next() you expect to get a private copy
that you can hold onto indefinitely and it will never be modified,
and, you accept that you must never modify this token yourself.

Whereas when you call "reuse" ts.next(Token) you accept that you must
fully consume the returned Token before you next call next(Token),
and, that you are free to alter this token.

I think that contract is well defined & consistent?

{quote}
TokenStream already does this, right? (or do you mean in the class TokenStream 
or in all implementations of TokenStream?)
{quote}
I'm talking about TokenStream's default implementation of next(Token).
It's not copying now, but it needs to in order to properly meet the
contract of this API (ie, allow caller to modify the returned token).
The default implementation of TokenStream.next() does already copy.


> Token re-use API breaks back compatibility in certain TokenStream chains
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-1063
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1063
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.3
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 2.3
>
>
> In scrutinizing the new Token re-use API during this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54708
> I realized we now have a non-back-compatibility when mixing re-use and
> non-re-use TokenStreams.
> The new "reuse" next(Token) API actually allows two different aspects
> of re-use:
>   1) "Backwards re-use": the subsequent call to next(Token) is allowed
>      to change all aspects of the provided Token, meaning the caller
>      must do all persisting of Token that it needs before calling
>      next(Token) again.
>   2) "Forwards re-use": the caller is allowed to modify the returned
>      Token however it wants.  Eg the LowerCaseFilter is allowed to
>      downcase the characters in-place in the char[] termBuffer.
> The forwards re-use case can break backwards compatibility now.  EG:
> if a TokenStream X providing only the "non-reuse" next() API is
> followed by a TokenFilter Y using the "reuse" next(Token) API to pull
> the tokens, then the default implementation in TokenStream.java for
> next(Token) will kick in.
> That default implementation just returns the provided "private copy"
> Token returned by next().  But, because of 2) above, this is not
> legal: if the TokenFilter Y modifies the char[] termBuffer (say), that
> is actually modifying the cached copy being potentially stored by X.
> I think the opposite case is handled correctly.
> A simple way to fix this is to make a full copy of the Token in the
> next(Token) call in TokenStream, just like we do in the next() method
> in TokenStream.  The downside is this is a small performance hit.  However
> that hit only happens at the boundary between a non-reuse and a re-use
> tokenizer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1063) Token re-use API breaks back compatibility in certain TokenStream chains

Reply via email to