[ https://issues.apache.org/jira/browse/LUCENE-10059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17403497#comment-17403497 ]
Anh Dung Bui commented on LUCENE-10059: --------------------------------------- The PR is merged, but I realized [KoreanTokenizer|https://github.com/apache/lucene/blob/main/lucene/analysis/nori/src/java/org/apache/lucene/analysis/ko/KoreanTokenizer.java] has the same issue. And I also realized it is sharing a significant amount of code with JapaneseTokenizer. Should we try to have a base class (maybe in analysis/common?) for all the shared code? > Assertion error in JapaneseTokenizer backtrace > ---------------------------------------------- > > Key: LUCENE-10059 > URL: https://issues.apache.org/jira/browse/LUCENE-10059 > Project: Lucene - Core > Issue Type: Bug > Affects Versions: 8.8 > Reporter: Anh Dung Bui > Priority: Major > Time Spent: 1h 20m > Remaining Estimate: 0h > > There is a rare case which causes an AssertionError in the backtrace step of > JapaneseTokenizer that we (Amazon Product Search) found in our tests. > If there is a text span of length 1024 (determined by > [MAX_BACKTRACE_GAP|https://github.com/apache/lucene/blob/main/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseTokenizer.java#L116]) > where the regular backtrace is not called, a [forced > backtrace|https://github.com/apache/lucene/blob/main/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseTokenizer.java#L781] > will be applied. If the partially best path at this point happens to end at > the last pos, and since there is always a [final > backtrace|https://github.com/apache/lucene/blob/main/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseTokenizer.java#L1044] > applied at the end, the final backtrace will try to backtrace from and to > the same position, causing an AssertionError in RollingCharBuffer.get() when > it tries to generate an empty buffer. > We are fixing it by returning prematurely in the backtrace() method when the > from and to pos are the same: > {code:java} > if (endPos == lastBackTracePos) { > return; > } > {code} > The backtrace() method is essentially no-op when this condition happens, thus > when _-ea_ is not enabled, it can still output the correct tokens. > We will open a PR for this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org