[
https://issues.apache.org/jira/browse/LUCENE-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16662897#comment-16662897
]
David Smiley commented on LUCENE-8509:
--------------------------------------
bq. The trim filter removes the leading space from the second token, leaving
offsets unchanged
That sounds fishy though; shouldn't they be trivially adjusted?
I'm skeptical about your proposal RE WDGF being an improvement because
tokenization splits offsets and WDGF is playing the role of a tokenizer.
Perhaps your proposal could be a new option that perhaps even defaults the way
you want it? And we solicit feedback/input saying the ability to toggle may go
away. The option's default setting should probably be Version-dependent.
> NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can
> produce backwards offsets
> ----------------------------------------------------------------------------------------------------
>
> Key: LUCENE-8509
> URL: https://issues.apache.org/jira/browse/LUCENE-8509
> Project: Lucene - Core
> Issue Type: Task
> Reporter: Alan Woodward
> Assignee: Alan Woodward
> Priority: Major
> Attachments: LUCENE-8509.patch
>
>
> Discovered by an elasticsearch user and described here:
> https://github.com/elastic/elasticsearch/issues/33710
> The ngram tokenizer produces tokens "a b" and " bb" (note the space at the
> beginning of the second token). The WDGF takes the first token and splits it
> into two, adjusting the offsets of the second token, so we get "a"[0,1] and
> "b"[2,3]. The trim filter removes the leading space from the second token,
> leaving offsets unchanged, so WDGF sees "bb"[1,4]; because the leading space
> has already been stripped, WDGF sees no need to adjust offsets, and emits the
> token as-is, resulting in the start offsets of the tokenstream being [0, 2,
> 1], and the IndexWriter rejecting it.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]