[ 
https://issues.apache.org/jira/browse/LUCENE-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16665429#comment-16665429
 ] 

Mike Sokolov commented on LUCENE-8509:
--------------------------------------

[ from mailing list – sorry for the duplication ]

The current situation is that it is impossible to apply offsets correctly in a 
TokenFilter. It seems to work OK most of the time, but truly correct behavior 
relies on prior components in the chain not having altered the length of 
tokens, which some of them occasionally do. For complete correctness in this 
area, I believe there are only really two possibilities: one is to stop trying 
to provide offsets in token filters, as in this issue, and the other would be 
to add some mechanism for allowing token filters to access the "correct" 
offset.  Well I guess we could try to prevent token filters from adding or 
removing characters, but that seems like a nonstarter for a lot of reasons. I 
put up a patch that allows for correct offsetting, but I think there was some 
consensus, and I am coming around to this position, that the amount of API 
change was not justified by the pretty minor benefit of having accurate 
within-token highlighting.

So I am +1 to this patch.

> NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can 
> produce backwards offsets
> ----------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-8509
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8509
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Alan Woodward
>            Assignee: Alan Woodward
>            Priority: Major
>         Attachments: LUCENE-8509.patch
>
>
> Discovered by an elasticsearch user and described here: 
> https://github.com/elastic/elasticsearch/issues/33710
> The ngram tokenizer produces tokens "a b" and " bb" (note the space at the 
> beginning of the second token).  The WDGF takes the first token and splits it 
> into two, adjusting the offsets of the second token, so we get "a"[0,1] and 
> "b"[2,3].  The trim filter removes the leading space from the second token, 
> leaving offsets unchanged, so WDGF sees "bb"[1,4]; because the leading space 
> has already been stripped, WDGF sees no need to adjust offsets, and emits the 
> token as-is, resulting in the start offsets of the tokenstream being [0, 2, 
> 1], and the IndexWriter rejecting it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to