[ 
https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845856#comment-16845856
 ] 

Jim Ferenczi commented on LUCENE-8784:
--------------------------------------

Hi [~danmuzi],
I don't think we should have one option for every punctuation type and the 
current check in the patch based on Character.OTHER_PUNCTUATION would match 
more than just the full stop character. If we want to preserve punctuations we 
can add the same option than for Kuromoji (discardPunctuation) and output a 
token for each punctuation group. So for an input like "10.1?" we would output 
4 tokens: "10", ".", "1", "?". Then if you need to "regroup" tokens based on 
additional rules you can add another filter to do this like the 
JapaneseNumberFilter does. The other option would be to detect numbers with 
decimal points accurately like the standard tokenizer does but we don't want to 
reinvent the wheel either. If we want the same grouping for unknown words in 
this tokenizer we should probably implement it on top of the standard or ICU 
tokenizer directly. 
.

>  Nori(Korean) tokenizer removes the decimal point. 
> ---------------------------------------------------
>
>                 Key: LUCENE-8784
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8784
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Munkyu Im
>            Priority: Major
>         Attachments: LUCENE-8784.patch
>
>
> This is the same issue that I mentioned to 
> [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367]
> unlike standard analyzer, nori analyzer removes the decimal point.
> nori tokenizer removes "." character by default.
>  In this case, it is difficult to index the keywords including the decimal 
> point.
> It would be nice if there had the option whether add a decimal point or not.
> Like Japanese tokenizer does,  Nori need an option to preserve decimal point.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to