[jira] [Commented] (LUCENE-5386) Make Tokenizers deliver their final offsets

Robert Muir (JIRA) Thu, 09 Jan 2014 18:46:00 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13867446#comment-13867446
 ]


Robert Muir commented on LUCENE-5386:
-------------------------------------

{quote}
TokenStream itself takes care of PositionIncrementAttribute. 
{quote}

Well it doesnt really "take care" at all. any tokenizer/tokenfilter that 
"removes" tokens and has the concept of "holes" (e.g. stopflter, or 
standardtokenizer when it drops a too-long token), has to provide that 
inforamtion here or its lost forever.

Lets make a concrete example: pages of a book. lets say each one is a value in 
a multi-valued field.

page1 ends with "the quick brown fox jumps over the"
and then page2 starts with "lazy dog."

as you know, internally both "values" are each independently analyzed but 
basically concatenated together in indexwriter, with 
analyzer.getPositionIncrementGap() \[default=0] in between.

in this case, phrase queries will not work correctly for this sentence, unless 
the analyzer propagates that 'the' was removed in end(), so the position 
increment is increased against "lazy".

Only the guy who removed that token (e.g. stopfilter) knows this, so it must 
provide it in end().


But yeah, i agree with your assessment, I think we want the tokenizer class to 
be very simple and just be "tokenstream that works on reader". We want it to be 
very flexible. On the other hand it sucks for simple typical use cases (e.g. 
the ones that happen 99% of the time) to be so difficult, when I think honestly 
most tokenizers are gonna worry about just offsets and positions in end(). 

So a compromise to make easy things easy and still leave sophisticate things 
possible is to provide some subclass, even if its "limited" in some ways that 
it can't do all the crazy stuff. But if it can make the ordinary use-cases that 
happen 99% of the time easier, I think it would be really helpful. Its at least 
a good step in the right direction that won't hurt anyone.

> Make Tokenizers deliver their final offsets
> -------------------------------------------
>
>                 Key: LUCENE-5386
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5386
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Benson Margulies
>
> Tokenizers _must_ have an implementation of #end() in which they set up the 
> final offset. Currently, nothing enforces this. end() has a useful 
> implementation in TokenStream, so just making it abstract is not attractive.
> Proposal: add
>   abstract int finalOffset(); 
> to tokenizer, and then make
>     void end() {
>         super.end();
>         int fo = finalOffset();
>        offsetAttr.setOffsets(fo, fo);
>    }
> or something to that effect.
> Other alternative to be considered depending on how this looks.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5386) Make Tokenizers deliver their final offsets

Reply via email to