[jira] [Commented] (LUCENE-5386) Make Tokenizers deliver their final offsets

Robert Muir (JIRA) Mon, 06 Jan 2014 15:02:33 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13863584#comment-13863584
 ]


Robert Muir commented on LUCENE-5386:
-------------------------------------

I don't have a clear idea how it can relate, just some vague thoughts at the 
moment.

The other most common attribute that is manipulated is position increments, 
take a look at StandardTokenizer.end() for an example (when it removes a 
too-long token), and FilteringTokenFilter.end() (subclass for StopFilter & co).

{code}
  @Override
  public final void end() throws IOException {
    super.end();
    // set final offset
    int finalOffset = correctOffset(scanner.yychar() + scanner.yylength());
    offsetAtt.setOffset(finalOffset, finalOffset);
    // adjust any skipped tokens
    
posIncrAtt.setPositionIncrement(posIncrAtt.getPositionIncrement()+skippedPositions);
  }
{code}

otherwise skipped tokens at the end of a multi-valued field get lost forever. 
As far as other attributes, i guess it depends what they do (e.g. they could be 
some custom one where some end() logic makes sense too). But it would be good 
if we could make the basics easier, while still allowing crazy custom classes 
to do whatever they need.

As far as how to tie this in and make it simpler? I'm not sure. One idea would 
be a TokenizerBase that takes care of these two things or maybe even some other 
things (e.g. provides a skipToken() method for subclasses to call and 
implements final position adjustments, and wraps the reader + implements final 
offset). Sucks to add a new class though, but its one idea. Making a tokenizer 
is really hard. One that is easier to subclass but cant do 100% of the crazy 
possibilities could be a nice balance: experts could still subclass Tokenizer 
directly.


> Make Tokenizers deliver their final offsets
> -------------------------------------------
>
>                 Key: LUCENE-5386
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5386
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Benson Margulies
>
> Tokenizers _must_ have an implementation of #end() in which they set up the 
> final offset. Currently, nothing enforces this. end() has a useful 
> implementation in TokenStream, so just making it abstract is not attractive.
> Proposal: add
>   abstract int finalOffset(); 
> to tokenizer, and then make
>     void end() {
>         super.end();
>         int fo = finalOffset();
>        offsetAttr.setOffsets(fo, fo);
>    }
> or something to that effect.
> Other alternative to be considered depending on how this looks.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5386) Make Tokenizers deliver their final offsets

Reply via email to