[ 
https://issues.apache.org/jira/browse/LUCENE-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13867437#comment-13867437
 ] 

Benson Margulies commented on LUCENE-5386:
------------------------------------------

Let me try to restate the above in my own words to make sure I understand it.

At #end(), all the pieces of an analysis chain are responsible for putting the 
attributes into a consistent state that reflects the end of the input. 
TokenStream itself takes care of PositionIncrementAttribute. Only the Tokenizer 
can take care of OffsetAttribute, but it's easy to forget -- and if there are 
other interesting things going on, a Tokenizer or anything else may have other 
work to do. 

So Rob's thoughts above are to make Tokenizer or a derivative track the final 
offset, which is simple, and have protocol to keep PositionIncrement in line 
given the possibility of skipped tokens. To avoid loading up the 'Tokenizer' 
class with too much stuff that someone might want to do for themselves, add an 
intermediate class for this and let Tokenizer proper be lean.

I'll get organized to sketch some code.

> Make Tokenizers deliver their final offsets
> -------------------------------------------
>
>                 Key: LUCENE-5386
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5386
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Benson Margulies
>
> Tokenizers _must_ have an implementation of #end() in which they set up the 
> final offset. Currently, nothing enforces this. end() has a useful 
> implementation in TokenStream, so just making it abstract is not attractive.
> Proposal: add
>   abstract int finalOffset(); 
> to tokenizer, and then make
>     void end() {
>         super.end();
>         int fo = finalOffset();
>        offsetAttr.setOffsets(fo, fo);
>    }
> or something to that effect.
> Other alternative to be considered depending on how this looks.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to