[ 
https://issues.apache.org/jira/browse/LUCENE-8959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16918464#comment-16918464
 ] 

Christian Moen commented on LUCENE-8959:
----------------------------------------

Sounds like a good idea.  This is also rather big rabbit hole... 

Would it be useful to consider making the digit grouping separators 
configurable as part of a bigger scheme here?

In Japanese, if you're processing text with SI numbers, I believe space is a 
valid digit grouping.

> JapaneseNumberFilter does not take whitespaces into account when 
> concatenating numbers
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-8959
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8959
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Jim Ferenczi
>            Priority: Minor
>
> Today the JapaneseNumberFilter tries to concatenate numbers even if they are 
> separated by whitespaces. So for instance "10 100" is rewritten into "10100" 
> even if the tokenizer doesn't discard punctuations. In practice this is not 
> an issue but this can lead to giant number of tokens if there are a lot of 
> numbers separated by spaces. The number of concatenation should be 
> configurable with a sane default limit in order to avoid creating big tokens 
> that slows down the analysis.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to