[ https://issues.apache.org/jira/browse/LUCENE-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16892574#comment-16892574 ]
Tomoko Uchida commented on LUCENE-8933: --------------------------------------- The surrogate pair Emoji character 🙂 included in the user dictionary is problematic. Interestingly, it does not cause the error if it appears at the first column (for "normal" mode segmentation). KoreanTokenizer could have same issue? > JapaneseTokenizer creates Token objects with corrupt offsets > ------------------------------------------------------------ > > Key: LUCENE-8933 > URL: https://issues.apache.org/jira/browse/LUCENE-8933 > Project: Lucene - Core > Issue Type: Bug > Reporter: Adrien Grand > Priority: Minor > > An Elasticsearch user reported the following stack trace when parsing > synonyms. It looks like the only reason why this might occur is if the offset > of a {{org.apache.lucene.analysis.ja.Token}} is not within the expected range. >  > {noformat} > Caused by: java.lang.ArrayIndexOutOfBoundsException > at > org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.copyBuffer(CharTermAttributeImpl.java:44) > ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - > nknize - 2018-12-07 14:44:20] > at > org.apache.lucene.analysis.ja.JapaneseTokenizer.incrementToken(JapaneseTokenizer.java:486) > ~[?:?] > at > org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:318) > ~[lucene-analyzers-common-7.6.0.jar:7.6.0 > 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48] > at > org.elasticsearch.index.analysis.ESSolrSynonymParser.analyze(ESSolrSynonymParser.java:57) > ~[elasticsearch-6.6.1.jar:6.6.1] > at > org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114) > ~[lucene-analyzers-common-7.6.0.jar:7.6.0 > 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48] > at > org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70) > ~[lucene-analyzers-common-7.6.0.jar:7.6.0 > 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48] > at > org.elasticsearch.index.analysis.SynonymTokenFilterFactory.buildSynonyms(SynonymTokenFilterFactory.java:154) > ~[elasticsearch-6.6.1.jar:6.6.1] > ... 24 more > {noformat} -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org