[ https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845856#comment-16845856 ]
Jim Ferenczi commented on LUCENE-8784: -------------------------------------- Hi [~danmuzi], I don't think we should have one option for every punctuation type and the current check in the patch based on Character.OTHER_PUNCTUATION would match more than just the full stop character. If we want to preserve punctuations we can add the same option than for Kuromoji (discardPunctuation) and output a token for each punctuation group. So for an input like "10.1?" we would output 4 tokens: "10", ".", "1", "?". Then if you need to "regroup" tokens based on additional rules you can add another filter to do this like the JapaneseNumberFilter does. The other option would be to detect numbers with decimal points accurately like the standard tokenizer does but we don't want to reinvent the wheel either. If we want the same grouping for unknown words in this tokenizer we should probably implement it on top of the standard or ICU tokenizer directly. . > Nori(Korean) tokenizer removes the decimal point. > --------------------------------------------------- > > Key: LUCENE-8784 > URL: https://issues.apache.org/jira/browse/LUCENE-8784 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Munkyu Im > Priority: Major > Attachments: LUCENE-8784.patch > > > This is the same issue that I mentioned to > [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367] > unlike standard analyzer, nori analyzer removes the decimal point. > nori tokenizer removes "." character by default. > In this case, it is difficult to index the keywords including the decimal > point. > It would be nice if there had the option whether add a decimal point or not. > Like Japanese tokenizer does, Nori need an option to preserve decimal point. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org