[ https://issues.apache.org/jira/browse/LUCENE-8959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16918464#comment-16918464 ]
Christian Moen commented on LUCENE-8959: ---------------------------------------- Sounds like a good idea. This is also rather big rabbit hole... Would it be useful to consider making the digit grouping separators configurable as part of a bigger scheme here? In Japanese, if you're processing text with SI numbers, I believe space is a valid digit grouping. > JapaneseNumberFilter does not take whitespaces into account when > concatenating numbers > -------------------------------------------------------------------------------------- > > Key: LUCENE-8959 > URL: https://issues.apache.org/jira/browse/LUCENE-8959 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Jim Ferenczi > Priority: Minor > > Today the JapaneseNumberFilter tries to concatenate numbers even if they are > separated by whitespaces. So for instance "10 100" is rewritten into "10100" > even if the tokenizer doesn't discard punctuations. In practice this is not > an issue but this can lead to giant number of tokens if there are a lot of > numbers separated by spaces. The number of concatenation should be > configurable with a sane default limit in order to avoid creating big tokens > that slows down the analysis. -- This message was sent by Atlassian Jira (v8.3.2#803003) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org