In case it wasn't clear, I am +1 for Alan's plan. We can always restore offset-alterations here if at some future date we figure out how to do it correctly.
On Fri, Oct 26, 2018 at 6:08 AM Michael Sokolov <[email protected]> wrote: > The current situation is that it is impossible to apply offsets correctly > in a TokenFilter. It seems to work OK most of the time, but truly correct > behavior relies on prior components in the chain not having altered the > length of tokens, which some of them occasionally do. For complete > correctness in this area, I believe there are only really two > possibilities: one is to stop trying to provide offsets in token filters, > as in this issue, and the other would be to add some mechanism for allowing > token filters to access the "correct" offset. Well I guess we could try to > prevent token filters from adding or removing characters, but that seems > like a nonstarter for a lot of reasons. I put up a patch that allows for > correct offsetting, but I think there was some consensus, and I am coming > around to this position, that the amount of API change was not justified by > the pretty minor benefit of having accurate within-token highlighting. > > On Wed, Oct 24, 2018 at 10:40 PM Michael Gibney (JIRA) <[email protected]> > wrote: > >> >> [ >> https://issues.apache.org/jira/browse/LUCENE-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16663115#comment-16663115 >> ] >> >> Michael Gibney commented on LUCENE-8509: >> ---------------------------------------- >> >> > The trim filter removes the leading space from the second token, >> leaving offsets unchanged, so WDGF sees "bb"[1,4]; >> >> If I understand correctly what [~dsmiley] is saying, then to put it >> another way: doesn't this look more like an issue with {{TrimFilter}}? If >> WDGF sees as input from {{TrimFilter}} "bb"[1,4] (instead of " bb"[1,4] or >> "bb"[2,4]), then it's handling the input correctly, but the input is wrong. >> >> "because tokenization splits offsets and WDGF is playing the role of a >> tokenizer" -- this behavior is notably different from what >> {{SynonymGraphFilter}} does (adding externally-specified alternate >> representations of input tokens). Offsets are really only meaningful with >> respect to input, and new tokens introduced by WDGF are directly derived >> from input, while new tokens introduced by {{SynonymGraphFilter}} are not >> and thus can _only_ inherit offsets of the input token. >> >> > NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination >> can produce backwards offsets >> > >> ---------------------------------------------------------------------------------------------------- >> > >> > Key: LUCENE-8509 >> > URL: https://issues.apache.org/jira/browse/LUCENE-8509 >> > Project: Lucene - Core >> > Issue Type: Task >> > Reporter: Alan Woodward >> > Assignee: Alan Woodward >> > Priority: Major >> > Attachments: LUCENE-8509.patch >> > >> > >> > Discovered by an elasticsearch user and described here: >> https://github.com/elastic/elasticsearch/issues/33710 >> > The ngram tokenizer produces tokens "a b" and " bb" (note the space at >> the beginning of the second token). The WDGF takes the first token and >> splits it into two, adjusting the offsets of the second token, so we get >> "a"[0,1] and "b"[2,3]. The trim filter removes the leading space from the >> second token, leaving offsets unchanged, so WDGF sees "bb"[1,4]; because >> the leading space has already been stripped, WDGF sees no need to adjust >> offsets, and emits the token as-is, resulting in the start offsets of the >> tokenstream being [0, 2, 1], and the IndexWriter rejecting it. >> >> >> >> -- >> This message was sent by Atlassian JIRA >> (v7.6.3#76005) >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >>
