[ 
https://issues.apache.org/jira/browse/LUCENE-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16663115#comment-16663115
 ] 

Michael Gibney commented on LUCENE-8509:
----------------------------------------

> The trim filter removes the leading space from the second token, leaving 
> offsets unchanged, so WDGF sees "bb"[1,4]; 

If I understand correctly what [~dsmiley] is saying, then to put it another 
way: doesn't this look more like an issue with {{TrimFilter}}? If WDGF sees as 
input from {{TrimFilter}} "bb"[1,4] (instead of " bb"[1,4] or "bb"[2,4]), then 
it's handling the input correctly, but the input is wrong.

"because tokenization splits offsets and WDGF is playing the role of a 
tokenizer" -- this behavior is notably different from what 
{{SynonymGraphFilter}} does (adding externally-specified alternate 
representations of input tokens). Offsets are really only meaningful with 
respect to input, and new tokens introduced by WDGF are directly derived from 
input, while new tokens introduced by {{SynonymGraphFilter}} are not and thus 
can _only_ inherit offsets of the input token.

> NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can 
> produce backwards offsets
> ----------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-8509
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8509
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Alan Woodward
>            Assignee: Alan Woodward
>            Priority: Major
>         Attachments: LUCENE-8509.patch
>
>
> Discovered by an elasticsearch user and described here: 
> https://github.com/elastic/elasticsearch/issues/33710
> The ngram tokenizer produces tokens "a b" and " bb" (note the space at the 
> beginning of the second token).  The WDGF takes the first token and splits it 
> into two, adjusting the offsets of the second token, so we get "a"[0,1] and 
> "b"[2,3].  The trim filter removes the leading space from the second token, 
> leaving offsets unchanged, so WDGF sees "bb"[1,4]; because the leading space 
> has already been stripped, WDGF sees no need to adjust offsets, and emits the 
> token as-is, resulting in the start offsets of the tokenstream being [0, 2, 
> 1], and the IndexWriter rejecting it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to