In case it wasn't clear, I am +1 for Alan's plan. We can always restore
offset-alterations here if at some future date we figure out how to do it
correctly.

On Fri, Oct 26, 2018 at 6:08 AM Michael Sokolov <[email protected]> wrote:

> The current situation is that it is impossible to apply offsets correctly
> in a TokenFilter. It seems to work OK most of the time, but truly correct
> behavior relies on prior components in the chain not having altered the
> length of tokens, which some of them occasionally do. For complete
> correctness in this area, I believe there are only really two
> possibilities: one is to stop trying to provide offsets in token filters,
> as in this issue, and the other would be to add some mechanism for allowing
> token filters to access the "correct" offset.  Well I guess we could try to
> prevent token filters from adding or removing characters, but that seems
> like a nonstarter for a lot of reasons. I put up a patch that allows for
> correct offsetting, but I think there was some consensus, and I am coming
> around to this position, that the amount of API change was not justified by
> the pretty minor benefit of having accurate within-token highlighting.
>
> On Wed, Oct 24, 2018 at 10:40 PM Michael Gibney (JIRA) <[email protected]>
> wrote:
>
>>
>>     [
>> https://issues.apache.org/jira/browse/LUCENE-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16663115#comment-16663115
>> ]
>>
>> Michael Gibney commented on LUCENE-8509:
>> ----------------------------------------
>>
>> > The trim filter removes the leading space from the second token,
>> leaving offsets unchanged, so WDGF sees "bb"[1,4];
>>
>> If I understand correctly what [~dsmiley] is saying, then to put it
>> another way: doesn't this look more like an issue with {{TrimFilter}}? If
>> WDGF sees as input from {{TrimFilter}} "bb"[1,4] (instead of " bb"[1,4] or
>> "bb"[2,4]), then it's handling the input correctly, but the input is wrong.
>>
>> "because tokenization splits offsets and WDGF is playing the role of a
>> tokenizer" -- this behavior is notably different from what
>> {{SynonymGraphFilter}} does (adding externally-specified alternate
>> representations of input tokens). Offsets are really only meaningful with
>> respect to input, and new tokens introduced by WDGF are directly derived
>> from input, while new tokens introduced by {{SynonymGraphFilter}} are not
>> and thus can _only_ inherit offsets of the input token.
>>
>> > NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination
>> can produce backwards offsets
>> >
>> ----------------------------------------------------------------------------------------------------
>> >
>> >                 Key: LUCENE-8509
>> >                 URL: https://issues.apache.org/jira/browse/LUCENE-8509
>> >             Project: Lucene - Core
>> >          Issue Type: Task
>> >            Reporter: Alan Woodward
>> >            Assignee: Alan Woodward
>> >            Priority: Major
>> >         Attachments: LUCENE-8509.patch
>> >
>> >
>> > Discovered by an elasticsearch user and described here:
>> https://github.com/elastic/elasticsearch/issues/33710
>> > The ngram tokenizer produces tokens "a b" and " bb" (note the space at
>> the beginning of the second token).  The WDGF takes the first token and
>> splits it into two, adjusting the offsets of the second token, so we get
>> "a"[0,1] and "b"[2,3].  The trim filter removes the leading space from the
>> second token, leaving offsets unchanged, so WDGF sees "bb"[1,4]; because
>> the leading space has already been stripped, WDGF sees no need to adjust
>> offsets, and emits the token as-is, resulting in the start offsets of the
>> tokenstream being [0, 2, 1], and the IndexWriter rejecting it.
>>
>>
>>
>> --
>> This message was sent by Atlassian JIRA
>> (v7.6.3#76005)
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>

Reply via email to