Alan Woodward created LUCENE-8509:
-------------------------------------

             Summary: NGramTokenizer, TrimFilter and WordDelimiterGraphFilter 
in combination can produce backwards offsets
                 Key: LUCENE-8509
                 URL: https://issues.apache.org/jira/browse/LUCENE-8509
             Project: Lucene - Core
          Issue Type: Task
            Reporter: Alan Woodward
            Assignee: Alan Woodward


Discovered by an elasticsearch user and described here: 
https://github.com/elastic/elasticsearch/issues/33710

The ngram tokenizer produces tokens "a b" and " bb" (note the space at the 
beginning of the second token).  The WDGF takes the first token and splits it 
into two, adjusting the offsets of the second token, so we get "a"[0,1] and 
"b"[2,3].  The trim filter removes the leading space from the second token, 
leaving offsets unchanged, so WDGF sees "bb"[1,4]; because the leading space 
has already been stripped, WDGF sees no need to adjust offsets, and emits the 
token as-is, resulting in the start offsets of the tokenstream being [0, 2, 1], 
and the IndexWriter rejecting it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to