[ https://issues.apache.org/jira/browse/LUCENE-7635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15828848#comment-15828848 ]
Hoss Man commented on LUCENE-7635: ---------------------------------- i'm not very familiar with Kuromoji but i believe the lines you 're deleting in this patch are intended to catch comments at the _end_ of a line -- not just the begining, ie... {noformat} # comment at start of line 朝青龍,朝青龍,アサショウリュウ,カスタム人名 # end line comment, has a comma in it # spans more then one line abcd,a b cd,foo1 foo2 foo3,bar # Another end line comment {noformat} Since it seems like the intent of the UserDict format is to be "CSV with '#' comments" it seems like the comment stripping should be moved to o.a.l.analysis.ja.util.CSVUtil where it can be done if-and-only-if the '#' is not part of a quoted value... {noformat} 朝青龍,朝青龍,アサショウリュウ,カスタム人名 # end line comment, has a comma in it # spans more then one line abcd,a b cd,foo1 foo2 foo3,bar # Another end line comment "quoted#sharp",other,"quoted,stuff" # yet another end line comment {noformat} ie: add a {{if(c == '#' && !insideQuote)}} block (similar to the existing {{COMMA}} conditional) to CSVUtil.parse() that would (trim and) add the final value to result and break out of the for loop. ? > Kuromoji fails if user dictionary contains # > -------------------------------------------- > > Key: LUCENE-7635 > URL: https://issues.apache.org/jira/browse/LUCENE-7635 > Project: Lucene - Core > Issue Type: Bug > Reporter: Masaru Hasegawa > Attachments: LUCENE-7635.patch > > > If user dictionary contains entries like: > {code} > withsharp#,withsharp#,withsharp#,カスタム名詞 > {code} > It fails to create dictionary throwing > java.lang.ArrayIndexOutOfBoundsException. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org