[ 
https://issues.apache.org/jira/browse/LUCENE-8959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16918532#comment-16918532
 ] 

Jim Ferenczi commented on LUCENE-8959:
--------------------------------------

*Update:* Whitespaces were removed in my tests because I was using the default 
JapanesePartOfSpeechStopFilter before the JapaneseNumberFilter. The behavior is 
correct when discardPunctuations is correctly set and theĀ 
JapanesePartOfSpeechStopFilter is the first filter in the chain. We could 
protect against the rabbit hole for users that forget to set 
discardPunctuations to false or remove the whitespaces in a preceding filter 
but the behavior is correct. Sorry for the false alarm.

> JapaneseNumberFilter does not take whitespaces into account when 
> concatenating numbers
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-8959
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8959
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Jim Ferenczi
>            Priority: Minor
>
> Today the JapaneseNumberFilter tries to concatenate numbers even if they are 
> separated by whitespaces. So for instance "10 100" is rewritten into "10100" 
> even if the tokenizer doesn't discard punctuations. In practice this is not 
> an issue but this can lead to giant number of tokens if there are a lot of 
> numbers separated by spaces. The number of concatenation should be 
> configurable with a sane default limit in order to avoid creating big tokens 
> that slows down the analysis.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to