[ https://issues.apache.org/jira/browse/LUCENE-9006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952867#comment-16952867 ]
David Smiley commented on LUCENE-9006: -------------------------------------- BTW this issue also fixes a bug in the offsets. The previous behavior resulted in the token "8other" having start offset of 2 because it followed the token "other" which is and should be 2. Now that "8other" is earlier, it can have the start offset it should -- 0. I was thinking about the core of the change here to the sort to consider the offset based length. I think it's simpler/faster and perhaps more correct to just use the start offset. This change passes the tests, so I'm inclined to push that. > Ensure WordDelimiterGraphFilter always emits catenateAll token early > -------------------------------------------------------------------- > > Key: LUCENE-9006 > URL: https://issues.apache.org/jira/browse/LUCENE-9006 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis > Reporter: David Smiley > Assignee: David Smiley > Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Ideally, the first token of WDGF is the preserveOriginal (if configured to > emit), and the second should be the catenateAll (if configured to emit). The > deprecated WDF does this but WDGF can sometimes put the first other token > earlier when there is a non-emitted candidate sub-token. > Example input "8-other" when only generateWordParts and catenateAll -- *not* > generateNumberParts. WDGF internally sees the '8' but moves on. Ultimately, > the "other" token and the catenated "8other" will appear at the same internal > position, which by luck fools the sorter to emit "other" first. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org