Ah, I see -- thanks, Michael. To make sure I understand correctly, this
particular case (with this particular order of analysis components) *would*
in fact be fixed by causing TrimFilter to update offsets. But for the sake
of argument, if we had some filter *before* TrimFilter that for some reason
*added* an extra leading space, then TrimFilter would have no way of
knowing whether to update the startOffset by +1 (correct) or +2 (incorrect,
but probably the most likely way to implement). Or a less contrived
example: if you applied SynonymGraphFilter before WDGF (which would seem
weird, but could happen) that would break all correspondence between the
token text and the input offsets, and *any* manipulation of offsets by WDGF
would be based on the false assumption of such a correspondence.

I think that makes me also +1 for Alan's suggestion.

While we're at it though, thinking ahead a little more about "figure out
how to do it correctly", I can think of only 2 possibilities, each
requiring an extra Attribute, and one of the possibilities is crazy:

The crazy idea: have an Attribute that maps each input character offset to
a corresponding character position in the token text ... but actually I
don't think that would even work, so nevermind.

The not-so-crazy idea: have a boolean Attribute that tracks whether there
is a 1:1 correspondence between the input offsets and the token text. Any
TokenFilter doing the kind of manipulation that *would* affect offsets
could check for the presence of this Attribute (which would default to
false), and iff present and true, could update offsets. I think that should
be robust, and could leave the behavior of a lot of existing configurations
unchanged (since TrimFilter, WDGF, and the like are often applied early in
the analysis chain); this would also avoid the need to potentially modify
tests for highlighting, etc...

Michael

On Fri, Oct 26, 2018 at 9:10 AM Michael Sokolov <[email protected]> wrote:

> In case it wasn't clear, I am +1 for Alan's plan. We can always restore
> offset-alterations here if at some future date we figure out how to do it
> correctly.
>
> On Fri, Oct 26, 2018 at 6:08 AM Michael Sokolov <[email protected]>
> wrote:
>
>> The current situation is that it is impossible to apply offsets correctly
>> in a TokenFilter. It seems to work OK most of the time, but truly correct
>> behavior relies on prior components in the chain not having altered the
>> length of tokens, which some of them occasionally do. For complete
>> correctness in this area, I believe there are only really two
>> possibilities: one is to stop trying to provide offsets in token filters,
>> as in this issue, and the other would be to add some mechanism for allowing
>> token filters to access the "correct" offset.  Well I guess we could try to
>> prevent token filters from adding or removing characters, but that seems
>> like a nonstarter for a lot of reasons. I put up a patch that allows for
>> correct offsetting, but I think there was some consensus, and I am coming
>> around to this position, that the amount of API change was not justified by
>> the pretty minor benefit of having accurate within-token highlighting.
>>
>> On Wed, Oct 24, 2018 at 10:40 PM Michael Gibney (JIRA) <[email protected]>
>> wrote:
>>
>>>
>>>     [
>>> https://issues.apache.org/jira/browse/LUCENE-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16663115#comment-16663115
>>> ]
>>>
>>> Michael Gibney commented on LUCENE-8509:
>>> ----------------------------------------
>>>
>>> > The trim filter removes the leading space from the second token,
>>> leaving offsets unchanged, so WDGF sees "bb"[1,4];
>>>
>>> If I understand correctly what [~dsmiley] is saying, then to put it
>>> another way: doesn't this look more like an issue with {{TrimFilter}}? If
>>> WDGF sees as input from {{TrimFilter}} "bb"[1,4] (instead of " bb"[1,4] or
>>> "bb"[2,4]), then it's handling the input correctly, but the input is wrong.
>>>
>>> "because tokenization splits offsets and WDGF is playing the role of a
>>> tokenizer" -- this behavior is notably different from what
>>> {{SynonymGraphFilter}} does (adding externally-specified alternate
>>> representations of input tokens). Offsets are really only meaningful with
>>> respect to input, and new tokens introduced by WDGF are directly derived
>>> from input, while new tokens introduced by {{SynonymGraphFilter}} are not
>>> and thus can _only_ inherit offsets of the input token.
>>>
>>> > NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination
>>> can produce backwards offsets
>>> >
>>> ----------------------------------------------------------------------------------------------------
>>> >
>>> >                 Key: LUCENE-8509
>>> >                 URL: https://issues.apache.org/jira/browse/LUCENE-8509
>>> >             Project: Lucene - Core
>>> >          Issue Type: Task
>>> >            Reporter: Alan Woodward
>>> >            Assignee: Alan Woodward
>>> >            Priority: Major
>>> >         Attachments: LUCENE-8509.patch
>>> >
>>> >
>>> > Discovered by an elasticsearch user and described here:
>>> https://github.com/elastic/elasticsearch/issues/33710
>>> > The ngram tokenizer produces tokens "a b" and " bb" (note the space at
>>> the beginning of the second token).  The WDGF takes the first token and
>>> splits it into two, adjusting the offsets of the second token, so we get
>>> "a"[0,1] and "b"[2,3].  The trim filter removes the leading space from the
>>> second token, leaving offsets unchanged, so WDGF sees "bb"[1,4]; because
>>> the leading space has already been stripped, WDGF sees no need to adjust
>>> offsets, and emits the token as-is, resulting in the start offsets of the
>>> tokenstream being [0, 2, 1], and the IndexWriter rejecting it.
>>>
>>>
>>>
>>> --
>>> This message was sent by Atlassian JIRA
>>> (v7.6.3#76005)
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>>

Reply via email to