[ 
https://issues.apache.org/jira/browse/LUCENE-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16665496#comment-16665496
 ] 

Michael Gibney commented on LUCENE-8509:
----------------------------------------

[ also from mailing list – sorry for the duplication ]

Ah, I see – thanks, [~sokolov]. To make sure I understand correctly, this 
particular case (with this particular order of analysis components) _would_ in 
fact be fixed by causing {{TrimFilter}} to update offsets. But for the sake of 
argument, if we had some filter _before_ {{TrimFilter}} that for some reason 
_added_ an extra leading space, then {{TrimFilter}} would have no way of 
knowing whether to update the {{startOffset}} by +1 (correct) or +2 (incorrect, 
but probably the most sensible way to implement). Or a less contrived example: 
if you applied {{SynonymGraphFilter}} before WDGF (which would seem weird, but 
could happen) that would break all correspondence between the token text and 
the input offsets, and _any_ manipulation of offsets by WDGF would be based on 
the false assumption of such a correspondence.
  
 I think that makes me also +1 for [~romseygeek]'s suggestion.
  
 While we're at it though, thinking ahead a little more about "figure out how 
to do it correctly", I can think of only 2 possibilities, each requiring an 
extra {{Attribute}}, and one of the possibilities is probably crazy:
  
 The crazy idea: have an {{Attribute}} that maps each input character offset to 
a corresponding character position in the token text ... but actually I don't 
think that would even work, so nevermind.
  
 The not-so-crazy idea: have a boolean {{Attribute}} that tracks whether there 
is a 1:1 correspondence between the input offsets and the token text. Any 
{{TokenFilter}} doing the kind of manipulation that _would_ affect offsets 
could check for the presence of this {{Attribute}} (which would default to 
false), and iff present and true, could update offsets. I think that should be 
robust, and could leave the behavior of a lot of existing configurations 
unchanged (since {{TrimFilter}}, WDGF, and the like are often applied early in 
the analysis chain); this would also potentially avoid the need to modify some 
existing tests for highlighting, etc. (including potential tests of 
highlighting in downstream systems)

> NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can 
> produce backwards offsets
> ----------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-8509
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8509
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Alan Woodward
>            Assignee: Alan Woodward
>            Priority: Major
>         Attachments: LUCENE-8509.patch
>
>
> Discovered by an elasticsearch user and described here: 
> https://github.com/elastic/elasticsearch/issues/33710
> The ngram tokenizer produces tokens "a b" and " bb" (note the space at the 
> beginning of the second token).  The WDGF takes the first token and splits it 
> into two, adjusting the offsets of the second token, so we get "a"[0,1] and 
> "b"[2,3].  The trim filter removes the leading space from the second token, 
> leaving offsets unchanged, so WDGF sees "bb"[1,4]; because the leading space 
> has already been stripped, WDGF sees no need to adjust offsets, and emits the 
> token as-is, resulting in the start offsets of the tokenstream being [0, 2, 
> 1], and the IndexWriter rejecting it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to