[ https://issues.apache.org/jira/browse/LUCENE-10059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17406227#comment-17406227 ]
Michael McCandless commented on LUCENE-10059: --------------------------------------------- Did we already backport this fix to 8.10 as well? In which case let's resolve this issue? {noformat} I can put a PR to fix the issue in nori then another (Jira) to refactor, or I can also try to put both in one PR. But the Kuromoji tokenizer is quite complicated and there could be a lot of code to refactor out, I think the former is better. Does anyone have a preference on this? {noformat} Yeah +1 to do the two PRs separately! > Assertion error in JapaneseTokenizer backtrace > ---------------------------------------------- > > Key: LUCENE-10059 > URL: https://issues.apache.org/jira/browse/LUCENE-10059 > Project: Lucene - Core > Issue Type: Bug > Affects Versions: 8.8 > Reporter: Anh Dung Bui > Priority: Major > Time Spent: 1h 20m > Remaining Estimate: 0h > > There is a rare case which causes an AssertionError in the backtrace step of > JapaneseTokenizer that we (Amazon Product Search) found in our tests. > If there is a text span of length 1024 (determined by > [MAX_BACKTRACE_GAP|https://github.com/apache/lucene/blob/main/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseTokenizer.java#L116]) > where the regular backtrace is not called, a [forced > backtrace|https://github.com/apache/lucene/blob/main/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseTokenizer.java#L781] > will be applied. If the partially best path at this point happens to end at > the last pos, and since there is always a [final > backtrace|https://github.com/apache/lucene/blob/main/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseTokenizer.java#L1044] > applied at the end, the final backtrace will try to backtrace from and to > the same position, causing an AssertionError in RollingCharBuffer.get() when > it tries to generate an empty buffer. > We are fixing it by returning prematurely in the backtrace() method when the > from and to pos are the same: > {code:java} > if (endPos == lastBackTracePos) { > return; > } > {code} > The backtrace() method is essentially no-op when this condition happens, thus > when _-ea_ is not enabled, it can still output the correct tokens. > We will open a PR for this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org