[ 
https://issues.apache.org/jira/browse/LUCENE-9006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952099#comment-16952099
 ] 

David Wayne Smiley commented on LUCENE-9006:
--------------------------------------------

There are two ways to fix this I explored.  
* The PR shows simply modifying the sort of the buffered tokens to also 
consider the char length as calculated by the offsets.  Unfortunately 
checkAnalysisConsistency doesn't like this when it's asked to check the offsets.
* Another approach is to increment wordPos on the first internal candidate 
token even if the configuration shows it doesn't need to be generated.  That 
worked by I saw it changed the graph and I wasn't sure if this matter is worth 
changing the graph over.  It's also kinda a rare config so, I dunno.

> Ensure WordDelimiterGraphFilter always emits catenateAll token early
> --------------------------------------------------------------------
>
>                 Key: LUCENE-9006
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9006
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: David Wayne Smiley
>            Assignee: David Wayne Smiley
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Ideally, the first token of WDGF is the preserveOriginal (if configured to 
> emit), and the second should be the catenateAll (if configured to emit).  The 
> deprecated WDF does this but WDGF can sometimes put the first other token 
> earlier when there is a non-emitted candidate sub-token.
> Example input "8-other" when only generateWordParts and catenateAll -- *not* 
> generateNumberParts.  WDGF internally sees the '8' but moves on.  Ultimately, 
> the "other" token and the catenated "8other" will appear at the same internal 
> position, which by luck fools the sorter to emit "other" first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to