[
https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845856#comment-16845856
]
Jim Ferenczi commented on LUCENE-8784:
--------------------------------------
Hi [~danmuzi],
I don't think we should have one option for every punctuation type and the
current check in the patch based on Character.OTHER_PUNCTUATION would match
more than just the full stop character. If we want to preserve punctuations we
can add the same option than for Kuromoji (discardPunctuation) and output a
token for each punctuation group. So for an input like "10.1?" we would output
4 tokens: "10", ".", "1", "?". Then if you need to "regroup" tokens based on
additional rules you can add another filter to do this like the
JapaneseNumberFilter does. The other option would be to detect numbers with
decimal points accurately like the standard tokenizer does but we don't want to
reinvent the wheel either. If we want the same grouping for unknown words in
this tokenizer we should probably implement it on top of the standard or ICU
tokenizer directly.
.
> Nori(Korean) tokenizer removes the decimal point.
> ---------------------------------------------------
>
> Key: LUCENE-8784
> URL: https://issues.apache.org/jira/browse/LUCENE-8784
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Munkyu Im
> Priority: Major
> Attachments: LUCENE-8784.patch
>
>
> This is the same issue that I mentioned to
> [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367]
> unlike standard analyzer, nori analyzer removes the decimal point.
> nori tokenizer removes "." character by default.
> In this case, it is difficult to index the keywords including the decimal
> point.
> It would be nice if there had the option whether add a decimal point or not.
> Like Japanese tokenizer does, Nori need an option to preserve decimal point.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]