The current situation is that it is impossible to apply offsets correctly in a TokenFilter. It seems to work OK most of the time, but truly correct behavior relies on prior components in the chain not having altered the length of tokens, which some of them occasionally do. For complete correctness in this area, I believe there are only really two possibilities: one is to stop trying to provide offsets in token filters, as in this issue, and the other would be to add some mechanism for allowing token filters to access the "correct" offset. Well I guess we could try to prevent token filters from adding or removing characters, but that seems like a nonstarter for a lot of reasons. I put up a patch that allows for correct offsetting, but I think there was some consensus, and I am coming around to this position, that the amount of API change was not justified by the pretty minor benefit of having accurate within-token highlighting.
On Wed, Oct 24, 2018 at 10:40 PM Michael Gibney (JIRA) <[email protected]> wrote: > > [ > https://issues.apache.org/jira/browse/LUCENE-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16663115#comment-16663115 > ] > > Michael Gibney commented on LUCENE-8509: > ---------------------------------------- > > > The trim filter removes the leading space from the second token, leaving > offsets unchanged, so WDGF sees "bb"[1,4]; > > If I understand correctly what [~dsmiley] is saying, then to put it > another way: doesn't this look more like an issue with {{TrimFilter}}? If > WDGF sees as input from {{TrimFilter}} "bb"[1,4] (instead of " bb"[1,4] or > "bb"[2,4]), then it's handling the input correctly, but the input is wrong. > > "because tokenization splits offsets and WDGF is playing the role of a > tokenizer" -- this behavior is notably different from what > {{SynonymGraphFilter}} does (adding externally-specified alternate > representations of input tokens). Offsets are really only meaningful with > respect to input, and new tokens introduced by WDGF are directly derived > from input, while new tokens introduced by {{SynonymGraphFilter}} are not > and thus can _only_ inherit offsets of the input token. > > > NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination > can produce backwards offsets > > > ---------------------------------------------------------------------------------------------------- > > > > Key: LUCENE-8509 > > URL: https://issues.apache.org/jira/browse/LUCENE-8509 > > Project: Lucene - Core > > Issue Type: Task > > Reporter: Alan Woodward > > Assignee: Alan Woodward > > Priority: Major > > Attachments: LUCENE-8509.patch > > > > > > Discovered by an elasticsearch user and described here: > https://github.com/elastic/elasticsearch/issues/33710 > > The ngram tokenizer produces tokens "a b" and " bb" (note the space at > the beginning of the second token). The WDGF takes the first token and > splits it into two, adjusting the offsets of the second token, so we get > "a"[0,1] and "b"[2,3]. The trim filter removes the leading space from the > second token, leaving offsets unchanged, so WDGF sees "bb"[1,4]; because > the leading space has already been stripped, WDGF sees no need to adjust > offsets, and emits the token as-is, resulting in the start offsets of the > tokenstream being [0, 2, 1], and the IndexWriter rejecting it. > > > > -- > This message was sent by Atlassian JIRA > (v7.6.3#76005) > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
