Ah, I see -- thanks, Michael. To make sure I understand correctly, this particular case (with this particular order of analysis components) *would* in fact be fixed by causing TrimFilter to update offsets. But for the sake of argument, if we had some filter *before* TrimFilter that for some reason *added* an extra leading space, then TrimFilter would have no way of knowing whether to update the startOffset by +1 (correct) or +2 (incorrect, but probably the most likely way to implement). Or a less contrived example: if you applied SynonymGraphFilter before WDGF (which would seem weird, but could happen) that would break all correspondence between the token text and the input offsets, and *any* manipulation of offsets by WDGF would be based on the false assumption of such a correspondence.
I think that makes me also +1 for Alan's suggestion. While we're at it though, thinking ahead a little more about "figure out how to do it correctly", I can think of only 2 possibilities, each requiring an extra Attribute, and one of the possibilities is crazy: The crazy idea: have an Attribute that maps each input character offset to a corresponding character position in the token text ... but actually I don't think that would even work, so nevermind. The not-so-crazy idea: have a boolean Attribute that tracks whether there is a 1:1 correspondence between the input offsets and the token text. Any TokenFilter doing the kind of manipulation that *would* affect offsets could check for the presence of this Attribute (which would default to false), and iff present and true, could update offsets. I think that should be robust, and could leave the behavior of a lot of existing configurations unchanged (since TrimFilter, WDGF, and the like are often applied early in the analysis chain); this would also avoid the need to potentially modify tests for highlighting, etc... Michael On Fri, Oct 26, 2018 at 9:10 AM Michael Sokolov <[email protected]> wrote: > In case it wasn't clear, I am +1 for Alan's plan. We can always restore > offset-alterations here if at some future date we figure out how to do it > correctly. > > On Fri, Oct 26, 2018 at 6:08 AM Michael Sokolov <[email protected]> > wrote: > >> The current situation is that it is impossible to apply offsets correctly >> in a TokenFilter. It seems to work OK most of the time, but truly correct >> behavior relies on prior components in the chain not having altered the >> length of tokens, which some of them occasionally do. For complete >> correctness in this area, I believe there are only really two >> possibilities: one is to stop trying to provide offsets in token filters, >> as in this issue, and the other would be to add some mechanism for allowing >> token filters to access the "correct" offset. Well I guess we could try to >> prevent token filters from adding or removing characters, but that seems >> like a nonstarter for a lot of reasons. I put up a patch that allows for >> correct offsetting, but I think there was some consensus, and I am coming >> around to this position, that the amount of API change was not justified by >> the pretty minor benefit of having accurate within-token highlighting. >> >> On Wed, Oct 24, 2018 at 10:40 PM Michael Gibney (JIRA) <[email protected]> >> wrote: >> >>> >>> [ >>> https://issues.apache.org/jira/browse/LUCENE-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16663115#comment-16663115 >>> ] >>> >>> Michael Gibney commented on LUCENE-8509: >>> ---------------------------------------- >>> >>> > The trim filter removes the leading space from the second token, >>> leaving offsets unchanged, so WDGF sees "bb"[1,4]; >>> >>> If I understand correctly what [~dsmiley] is saying, then to put it >>> another way: doesn't this look more like an issue with {{TrimFilter}}? If >>> WDGF sees as input from {{TrimFilter}} "bb"[1,4] (instead of " bb"[1,4] or >>> "bb"[2,4]), then it's handling the input correctly, but the input is wrong. >>> >>> "because tokenization splits offsets and WDGF is playing the role of a >>> tokenizer" -- this behavior is notably different from what >>> {{SynonymGraphFilter}} does (adding externally-specified alternate >>> representations of input tokens). Offsets are really only meaningful with >>> respect to input, and new tokens introduced by WDGF are directly derived >>> from input, while new tokens introduced by {{SynonymGraphFilter}} are not >>> and thus can _only_ inherit offsets of the input token. >>> >>> > NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination >>> can produce backwards offsets >>> > >>> ---------------------------------------------------------------------------------------------------- >>> > >>> > Key: LUCENE-8509 >>> > URL: https://issues.apache.org/jira/browse/LUCENE-8509 >>> > Project: Lucene - Core >>> > Issue Type: Task >>> > Reporter: Alan Woodward >>> > Assignee: Alan Woodward >>> > Priority: Major >>> > Attachments: LUCENE-8509.patch >>> > >>> > >>> > Discovered by an elasticsearch user and described here: >>> https://github.com/elastic/elasticsearch/issues/33710 >>> > The ngram tokenizer produces tokens "a b" and " bb" (note the space at >>> the beginning of the second token). The WDGF takes the first token and >>> splits it into two, adjusting the offsets of the second token, so we get >>> "a"[0,1] and "b"[2,3]. The trim filter removes the leading space from the >>> second token, leaving offsets unchanged, so WDGF sees "bb"[1,4]; because >>> the leading space has already been stripped, WDGF sees no need to adjust >>> offsets, and emits the token as-is, resulting in the start offsets of the >>> tokenstream being [0, 2, 1], and the IndexWriter rejecting it. >>> >>> >>> >>> -- >>> This message was sent by Atlassian JIRA >>> (v7.6.3#76005) >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [email protected] >>> For additional commands, e-mail: [email protected] >>> >>>
