[ 
https://issues.apache.org/jira/browse/LUCENE-9006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952867#comment-16952867
 ] 

David Smiley commented on LUCENE-9006:
--------------------------------------

BTW this issue also fixes a bug in the offsets.  The previous behavior resulted 
in the token "8other" having start offset of 2 because it followed the token 
"other" which is and should be 2.  Now that "8other" is earlier, it can have 
the start offset it should -- 0.

I was thinking about the core of the change here to the sort to consider the 
offset based length.  I think it's simpler/faster and perhaps more correct to 
just use the start offset.  This change passes the tests, so I'm inclined to 
push that.

> Ensure WordDelimiterGraphFilter always emits catenateAll token early
> --------------------------------------------------------------------
>
>                 Key: LUCENE-9006
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9006
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: David Smiley
>            Assignee: David Smiley
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Ideally, the first token of WDGF is the preserveOriginal (if configured to 
> emit), and the second should be the catenateAll (if configured to emit).  The 
> deprecated WDF does this but WDGF can sometimes put the first other token 
> earlier when there is a non-emitted candidate sub-token.
> Example input "8-other" when only generateWordParts and catenateAll -- *not* 
> generateNumberParts.  WDGF internally sees the '8' but moves on.  Ultimately, 
> the "other" token and the catenated "8other" will appear at the same internal 
> position, which by luck fools the sorter to emit "other" first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to