[ 
https://issues.apache.org/jira/browse/LUCENE-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16667406#comment-16667406
 ] 

Michael Gibney commented on LUCENE-8509:
----------------------------------------

I'd echo [~dsmiley]'s comment over at LUCENE-8516 – "I don't see the big deal 
in a token filter doing tokenization. I see it has certain challenges but don't 
think it's fundamentally wrong".

A special case of the "not-so-crazy" idea proposed above would have WDGF remain 
a {{TokenFilter}}, but require it to be configured to take input directly from 
a {{Tokenizer}} (as opposed to more general {{TokenStream}}). I think this 
would be functionally equivalent to the change proposed at LUCENE-8516. This 
special case would obviate the need for tracking whether there exists a 1:1 
correspondence between input offsets and token text, because such 
correspondence should (?) always exist immediately after the {{Tokenizer}}. 
This approach (or the slightly more general/elaborate "not-so-crazy" approach 
described above) might also address [~rcmuir]'s observation at LUCENE-8516 that 
the {{WordDelimiterTokenizer}} could be viewed as "still a tokenfilter in 
disguise".

As a side note, the configuration referenced in the title and description of 
this issue doesn't particularly well illustrate the more general problem, 
because the problem with this configuration could be equally well addressed by 
causing {{TrimFilter}} to update offsets, or (I think with no affect on 
intended behavior) by simply reordering filters so that {{TrimFilter}} comes 
after WDGF.

> NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can 
> produce backwards offsets
> ----------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-8509
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8509
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Alan Woodward
>            Assignee: Alan Woodward
>            Priority: Major
>         Attachments: LUCENE-8509.patch
>
>
> Discovered by an elasticsearch user and described here: 
> https://github.com/elastic/elasticsearch/issues/33710
> The ngram tokenizer produces tokens "a b" and " bb" (note the space at the 
> beginning of the second token).  The WDGF takes the first token and splits it 
> into two, adjusting the offsets of the second token, so we get "a"[0,1] and 
> "b"[2,3].  The trim filter removes the leading space from the second token, 
> leaving offsets unchanged, so WDGF sees "bb"[1,4]; because the leading space 
> has already been stripped, WDGF sees no need to adjust offsets, and emits the 
> token as-is, resulting in the start offsets of the tokenstream being [0, 2, 
> 1], and the IndexWriter rejecting it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to