[ 
https://issues.apache.org/jira/browse/LUCENE-9006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952799#comment-16952799
 ] 

David Wayne Smiley commented on LUCENE-9006:
--------------------------------------------

Thanks for the explanation RE graphOffsetsAreCorrect.  I guess there is no new 
concern here the PR then.

I discovered this problem due to a custom filter that directly collaborates 
with a delegated WDGF instance.  It assumes the first two tokens are 
preserveOriginal then catenateAll.  This was the case with the now deprecated 
WDF.  It's intuitive too, so "looks" odd when it doesn't happen.  I noticed in 
LUCENE-8730 a precedent for making the token orderings consistent, which makes 
sense to me.

> Ensure WordDelimiterGraphFilter always emits catenateAll token early
> --------------------------------------------------------------------
>
>                 Key: LUCENE-9006
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9006
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: David Wayne Smiley
>            Assignee: David Wayne Smiley
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Ideally, the first token of WDGF is the preserveOriginal (if configured to 
> emit), and the second should be the catenateAll (if configured to emit).  The 
> deprecated WDF does this but WDGF can sometimes put the first other token 
> earlier when there is a non-emitted candidate sub-token.
> Example input "8-other" when only generateWordParts and catenateAll -- *not* 
> generateNumberParts.  WDGF internally sees the '8' but moves on.  Ultimately, 
> the "other" token and the catenated "8other" will appear at the same internal 
> position, which by luck fools the sorter to emit "other" first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to