[
https://issues.apache.org/jira/browse/LUCENE-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16893478#comment-16893478
]
Jim Ferenczi commented on LUCENE-8933:
--------------------------------------
{quote}
If there are no other opinions or objections, I'd like to create a patch that
add a validation rule to the UserDictionary.
{quote}
Thanks [~tomoko]!
{quote}
For purpose of format validation, I think it would be better that we check if
the sum of length of segments is equal to the length of its surface form.
i.e., we also should not allow such entry "aabbcc,a b c,aa bb cc,pos_tag" even
if this does not cause any exceptions.
{quote}
+1
> JapaneseTokenizer creates Token objects with corrupt offsets
> ------------------------------------------------------------
>
> Key: LUCENE-8933
> URL: https://issues.apache.org/jira/browse/LUCENE-8933
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Adrien Grand
> Priority: Minor
>
> An Elasticsearch user reported the following stack trace when parsing
> synonyms. It looks like the only reason why this might occur is if the offset
> of a {{org.apache.lucene.analysis.ja.Token}} is not within the expected range.
>
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException
> at
> org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.copyBuffer(CharTermAttributeImpl.java:44)
> ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f -
> nknize - 2018-12-07 14:44:20]
> at
> org.apache.lucene.analysis.ja.JapaneseTokenizer.incrementToken(JapaneseTokenizer.java:486)
> ~[?:?]
> at
> org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:318)
> ~[lucene-analyzers-common-7.6.0.jar:7.6.0
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at
> org.elasticsearch.index.analysis.ESSolrSynonymParser.analyze(ESSolrSynonymParser.java:57)
> ~[elasticsearch-6.6.1.jar:6.6.1]
> at
> org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114)
> ~[lucene-analyzers-common-7.6.0.jar:7.6.0
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at
> org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70)
> ~[lucene-analyzers-common-7.6.0.jar:7.6.0
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at
> org.elasticsearch.index.analysis.SynonymTokenFilterFactory.buildSynonyms(SynonymTokenFilterFactory.java:154)
> ~[elasticsearch-6.6.1.jar:6.6.1]
> ... 24 more
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]