[ 
https://issues.apache.org/jira/browse/LUCENE-8959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8959:
---------------------------------
    Description: Today the JapaneseNumberFilter tries to concatenate numbers 
even if they are separated by whitespaces. So for instance "10 100" is 
rewritten into "10100" -even if the tokenizer doesn't discard punctuations-. In 
practice this is not an issue but this can lead to giant number of tokens if 
there are a lot of numbers separated by spaces. The number of concatenation 
should be configurable with a sane default limit in order to avoid creating 
giant tokens that slows down the analysis if the tokenizer is not correctly 
configured.  (was: Today the JapaneseNumberFilter tries to concatenate numbers 
even if they are separated by whitespaces. So for instance "10 100" is 
rewritten into "10100" even if the tokenizer doesn't discard punctuations. In 
practice this is not an issue but this can lead to giant number of tokens if 
there are a lot of numbers separated by spaces. The number of concatenation 
should be configurable with a sane default limit in order to avoid creating big 
tokens that slows down the analysis.)

> JapaneseNumberFilter does not take whitespaces into account when 
> concatenating numbers
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-8959
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8959
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Jim Ferenczi
>            Priority: Minor
>
> Today the JapaneseNumberFilter tries to concatenate numbers even if they are 
> separated by whitespaces. So for instance "10 100" is rewritten into "10100" 
> -even if the tokenizer doesn't discard punctuations-. In practice this is not 
> an issue but this can lead to giant number of tokens if there are a lot of 
> numbers separated by spaces. The number of concatenation should be 
> configurable with a sane default limit in order to avoid creating giant 
> tokens that slows down the analysis if the tokenizer is not correctly 
> configured.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to