[
https://issues.apache.org/jira/browse/LUCENE-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13863584#comment-13863584
]
Robert Muir commented on LUCENE-5386:
-------------------------------------
I don't have a clear idea how it can relate, just some vague thoughts at the
moment.
The other most common attribute that is manipulated is position increments,
take a look at StandardTokenizer.end() for an example (when it removes a
too-long token), and FilteringTokenFilter.end() (subclass for StopFilter & co).
{code}
@Override
public final void end() throws IOException {
super.end();
// set final offset
int finalOffset = correctOffset(scanner.yychar() + scanner.yylength());
offsetAtt.setOffset(finalOffset, finalOffset);
// adjust any skipped tokens
posIncrAtt.setPositionIncrement(posIncrAtt.getPositionIncrement()+skippedPositions);
}
{code}
otherwise skipped tokens at the end of a multi-valued field get lost forever.
As far as other attributes, i guess it depends what they do (e.g. they could be
some custom one where some end() logic makes sense too). But it would be good
if we could make the basics easier, while still allowing crazy custom classes
to do whatever they need.
As far as how to tie this in and make it simpler? I'm not sure. One idea would
be a TokenizerBase that takes care of these two things or maybe even some other
things (e.g. provides a skipToken() method for subclasses to call and
implements final position adjustments, and wraps the reader + implements final
offset). Sucks to add a new class though, but its one idea. Making a tokenizer
is really hard. One that is easier to subclass but cant do 100% of the crazy
possibilities could be a nice balance: experts could still subclass Tokenizer
directly.
> Make Tokenizers deliver their final offsets
> -------------------------------------------
>
> Key: LUCENE-5386
> URL: https://issues.apache.org/jira/browse/LUCENE-5386
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Benson Margulies
>
> Tokenizers _must_ have an implementation of #end() in which they set up the
> final offset. Currently, nothing enforces this. end() has a useful
> implementation in TokenStream, so just making it abstract is not attractive.
> Proposal: add
> abstract int finalOffset();
> to tokenizer, and then make
> void end() {
> super.end();
> int fo = finalOffset();
> offsetAttr.setOffsets(fo, fo);
> }
> or something to that effect.
> Other alternative to be considered depending on how this looks.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]