[
https://issues.apache.org/jira/browse/LUCENE-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16662159#comment-16662159
]
Alan Woodward commented on LUCENE-8509:
---------------------------------------
Here is a patch removing the offset-adjustment logic from WDGF. All subtokens
emitted by the filter now have the same offsets as their parent token.
The downstream consequences are that entire tokens will be highlighted (eg, if
you search for 'wi' then the whole token 'wi-fi' will get highlighted). I
think this is a reasonable trade-off, though. It brings things more in to line
with the behaviour of SynonymGraphFilter as well.
> NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can
> produce backwards offsets
> ----------------------------------------------------------------------------------------------------
>
> Key: LUCENE-8509
> URL: https://issues.apache.org/jira/browse/LUCENE-8509
> Project: Lucene - Core
> Issue Type: Task
> Reporter: Alan Woodward
> Assignee: Alan Woodward
> Priority: Major
> Attachments: LUCENE-8509.patch
>
>
> Discovered by an elasticsearch user and described here:
> https://github.com/elastic/elasticsearch/issues/33710
> The ngram tokenizer produces tokens "a b" and " bb" (note the space at the
> beginning of the second token). The WDGF takes the first token and splits it
> into two, adjusting the offsets of the second token, so we get "a"[0,1] and
> "b"[2,3]. The trim filter removes the leading space from the second token,
> leaving offsets unchanged, so WDGF sees "bb"[1,4]; because the leading space
> has already been stripped, WDGF sees no need to adjust offsets, and emits the
> token as-is, resulting in the start offsets of the tokenstream being [0, 2,
> 1], and the IndexWriter rejecting it.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]