Alan Woodward created LUCENE-8509:
-------------------------------------
Summary: NGramTokenizer, TrimFilter and WordDelimiterGraphFilter
in combination can produce backwards offsets
Key: LUCENE-8509
URL: https://issues.apache.org/jira/browse/LUCENE-8509
Project: Lucene - Core
Issue Type: Task
Reporter: Alan Woodward
Assignee: Alan Woodward
Discovered by an elasticsearch user and described here:
https://github.com/elastic/elasticsearch/issues/33710
The ngram tokenizer produces tokens "a b" and " bb" (note the space at the
beginning of the second token). The WDGF takes the first token and splits it
into two, adjusting the offsets of the second token, so we get "a"[0,1] and
"b"[2,3]. The trim filter removes the leading space from the second token,
leaving offsets unchanged, so WDGF sees "bb"[1,4]; because the leading space
has already been stripped, WDGF sees no need to adjust offsets, and emits the
token as-is, resulting in the start offsets of the tokenstream being [0, 2, 1],
and the IndexWriter rejecting it.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]