[jira] Commented: (LUCENE-1801) Tokenizers (which are the source of Tokens) should call AttributeSource.clearAttributes() first

Robert Muir (JIRA) Wed, 12 Aug 2009 15:54:38 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742638#action_12742638
 ]


Robert Muir commented on LUCENE-1801:
-------------------------------------

uwe, sorry I see there is an encoding problem with my patch file... i will 
supply another.

> Tokenizers (which are the source of Tokens) should call 
> AttributeSource.clearAttributes() first
> -----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1801
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1801
>             Project: Lucene - Java
>          Issue Type: Task
>    Affects Versions: 2.9
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>             Fix For: 2.9
>
>         Attachments: LUCENE-1801.patch, LUCENE-1801.patch
>
>
> This is a followup for LUCENE-1796:
> {quote}
> Token.clear() used to be called by the consumer... but then it was switched 
> to the producer here: LUCENE-1101 
> I don't know if all of the Tokenizers in lucene were ever changed, but in any 
> case it looks like at least some of these bugs were introduced with the 
> switch to the attribute API - for example StandardTokenizer did clear it's 
> reusableToken... and now it doesn't.
> {quote}
> As alternative to changing all core/contrib Tokenizers to call 
> clearAttributes first, we could do this in the indexer, what would be a 
> overhead for old token streams that itsself clear their reusable token. This 
> issue should also update the Javadocs, to clearly state inside 
> Tokenizer.java, that the source TokenStream (normally the Tokenizer) should 
> clear *all* Attributes. If it does not do it and e.g. the positionIncrement 
> is changed to 0 by any TokenFilter, but the filter does not change it back to 
> 1, the TokenStream would stay with 0. If the TokenFilter would call 
> PositionIncrementAttribute.clear() (because he is responsible), it could also 
> break the TokenStream, because clear() is a general method for the whole 
> attribute instance. If e.g. Token is used as AttributeImpl, a call to clear() 
> would also clear offsets and termLength, which is not wanted. So the source 
> of the Tokenization should rest the attributes to default values.
> LUCENE-1796 removed the iterator creation cost, so clearAttributes should run 
> fast, but is an additional cost during Tokenization, as it was not done 
> consistently before, so a small speed degradion is caused by this, but has 
> nothing to do with the new TokenStream API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1801) Tokenizers (which are the source of Tokens) should call AttributeSource.clearAttributes() first

Reply via email to