[
https://issues.apache.org/jira/browse/LUCENE-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544093
]
Doron Cohen commented on LUCENE-1063:
-------------------------------------
{quote}
> TokenStreams that cache tokens without "protecting" their private copy when
> next() is called?
That would be a bug in the filter (both in the past and now).
{quote}
I think it is okay to relax this to only protect in Toknizers (where Tokens are
created), and not worry about TokenFilters.
TokenFilters always take a TokenStream at construction and always call its
next(Token), which eventually calls a Tokenizer.next(Token) -- which is
protecyed -- and so the TokenFilter can rely on that protection. Right?
> Token re-use API breaks back compatibility in certain TokenStream chains
> ------------------------------------------------------------------------
>
> Key: LUCENE-1063
> URL: https://issues.apache.org/jira/browse/LUCENE-1063
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 2.3
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Fix For: 2.3
>
> Attachments: LUCENE-1063.patch
>
>
> In scrutinizing the new Token re-use API during this thread:
> http://www.gossamer-threads.com/lists/lucene/java-dev/54708
> I realized we now have a non-back-compatibility when mixing re-use and
> non-re-use TokenStreams.
> The new "reuse" next(Token) API actually allows two different aspects
> of re-use:
> 1) "Backwards re-use": the subsequent call to next(Token) is allowed
> to change all aspects of the provided Token, meaning the caller
> must do all persisting of Token that it needs before calling
> next(Token) again.
> 2) "Forwards re-use": the caller is allowed to modify the returned
> Token however it wants. Eg the LowerCaseFilter is allowed to
> downcase the characters in-place in the char[] termBuffer.
> The forwards re-use case can break backwards compatibility now. EG:
> if a TokenStream X providing only the "non-reuse" next() API is
> followed by a TokenFilter Y using the "reuse" next(Token) API to pull
> the tokens, then the default implementation in TokenStream.java for
> next(Token) will kick in.
> That default implementation just returns the provided "private copy"
> Token returned by next(). But, because of 2) above, this is not
> legal: if the TokenFilter Y modifies the char[] termBuffer (say), that
> is actually modifying the cached copy being potentially stored by X.
> I think the opposite case is handled correctly.
> A simple way to fix this is to make a full copy of the Token in the
> next(Token) call in TokenStream, just like we do in the next() method
> in TokenStream. The downside is this is a small performance hit. However
> that hit only happens at the boundary between a non-reuse and a re-use
> tokenizer.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]