[ 
https://issues.apache.org/jira/browse/LUCENE-9006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952770#comment-16952770
 ] 

Jim Ferenczi commented on LUCENE-9006:
--------------------------------------

I don't think your change affects the fact that we cannot set 
graphOffsetsAreCorrect when writing a test using the WDGF. Your test should 
fail the same way with graphOffsetsAreCorrect if you don't reorder the terms in 
the output. The other tests for the WDGF sets this flag to false. I also wonder 
why do you think that there should be any order among the different form that 
start at the same position ? Are you relying on this order in a subsequent 
filter ? Maybe we could mark the alternatives with a specific type like 
synonyms are doing ? This way it would be easier to differentiate a splitting 
path from the original token ?

> Ensure WordDelimiterGraphFilter always emits catenateAll token early
> --------------------------------------------------------------------
>
>                 Key: LUCENE-9006
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9006
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: David Wayne Smiley
>            Assignee: David Wayne Smiley
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Ideally, the first token of WDGF is the preserveOriginal (if configured to 
> emit), and the second should be the catenateAll (if configured to emit).  The 
> deprecated WDF does this but WDGF can sometimes put the first other token 
> earlier when there is a non-emitted candidate sub-token.
> Example input "8-other" when only generateWordParts and catenateAll -- *not* 
> generateNumberParts.  WDGF internally sees the '8' but moves on.  Ultimately, 
> the "other" token and the catenated "8other" will appear at the same internal 
> position, which by luck fools the sorter to emit "other" first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to