[ https://issues.apache.org/jira/browse/LUCENE-10059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17404552#comment-17404552 ]
Tomoko Uchida commented on LUCENE-10059: ---------------------------------------- I'm sorry, I'm not familiar with the particular reason why KoreanTokenizer had to be forked from JapaneseTokenizer when it was created. I think that might have been explained or discussed in the original KoreanTokenizer issue but can't find it soon. > Assertion error in JapaneseTokenizer backtrace > ---------------------------------------------- > > Key: LUCENE-10059 > URL: https://issues.apache.org/jira/browse/LUCENE-10059 > Project: Lucene - Core > Issue Type: Bug > Affects Versions: 8.8 > Reporter: Anh Dung Bui > Priority: Major > Time Spent: 1h 20m > Remaining Estimate: 0h > > There is a rare case which causes an AssertionError in the backtrace step of > JapaneseTokenizer that we (Amazon Product Search) found in our tests. > If there is a text span of length 1024 (determined by > [MAX_BACKTRACE_GAP|https://github.com/apache/lucene/blob/main/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseTokenizer.java#L116]) > where the regular backtrace is not called, a [forced > backtrace|https://github.com/apache/lucene/blob/main/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseTokenizer.java#L781] > will be applied. If the partially best path at this point happens to end at > the last pos, and since there is always a [final > backtrace|https://github.com/apache/lucene/blob/main/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseTokenizer.java#L1044] > applied at the end, the final backtrace will try to backtrace from and to > the same position, causing an AssertionError in RollingCharBuffer.get() when > it tries to generate an empty buffer. > We are fixing it by returning prematurely in the backtrace() method when the > from and to pos are the same: > {code:java} > if (endPos == lastBackTracePos) { > return; > } > {code} > The backtrace() method is essentially no-op when this condition happens, thus > when _-ea_ is not enabled, it can still output the correct tokens. > We will open a PR for this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org