[
https://issues.apache.org/jira/browse/LUCENE-8959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16918464#comment-16918464
]
Christian Moen commented on LUCENE-8959:
----------------------------------------
Sounds like a good idea. This is also rather big rabbit hole...
Would it be useful to consider making the digit grouping separators
configurable as part of a bigger scheme here?
In Japanese, if you're processing text with SI numbers, I believe space is a
valid digit grouping.
> JapaneseNumberFilter does not take whitespaces into account when
> concatenating numbers
> --------------------------------------------------------------------------------------
>
> Key: LUCENE-8959
> URL: https://issues.apache.org/jira/browse/LUCENE-8959
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Jim Ferenczi
> Priority: Minor
>
> Today the JapaneseNumberFilter tries to concatenate numbers even if they are
> separated by whitespaces. So for instance "10 100" is rewritten into "10100"
> even if the tokenizer doesn't discard punctuations. In practice this is not
> an issue but this can lead to giant number of tokens if there are a lot of
> numbers separated by spaces. The number of concatenation should be
> configurable with a sane default limit in order to avoid creating big tokens
> that slows down the analysis.
--
This message was sent by Atlassian Jira
(v8.3.2#803003)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]