[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.
[ https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849559#comment-16849559 ] ASF subversion and git services commented on LUCENE-8784: - Commit bf0d6fad4294b1d0e7c17b819be0f730d4e59c38 in lucene-solr's branch refs/heads/branch_8x from Jim Ferenczi [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=bf0d6fa ] LUCENE-8784: Restore the Korean's part of speech tag for NGRAM. The part of speech tag for unigram has been changed inadvertenly in a previous commit (not released). This change restores the original value that is also set on the serialized unkwnown dictionary. > Nori(Korean) tokenizer removes the decimal point. > --- > > Key: LUCENE-8784 > URL: https://issues.apache.org/jira/browse/LUCENE-8784 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Munkyu Im >Priority: Major > Fix For: master (9.0), 8.2 > > Attachments: LUCENE-8784.patch, LUCENE-8784.patch, LUCENE-8784.patch, > LUCENE-8784.patch > > > This is the same issue that I mentioned to > [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367] > unlike standard analyzer, nori analyzer removes the decimal point. > nori tokenizer removes "." character by default. > In this case, it is difficult to index the keywords including the decimal > point. > It would be nice if there had the option whether add a decimal point or not. > Like Japanese tokenizer does, Nori need an option to preserve decimal point. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.
[ https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849557#comment-16849557 ] ASF subversion and git services commented on LUCENE-8784: - Commit db334c792bedb2b385f72098a6a61458a05667e3 in lucene-solr's branch refs/heads/master from Jim Ferenczi [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=db334c7 ] LUCENE-8784: Restore the Korean's part of speech tag for NGRAM. The part of speech tag for unigram has been changed inadvertenly in a previous commit (not released). This change restores the original value that is also set on the serialized unkwnown dictionary. > Nori(Korean) tokenizer removes the decimal point. > --- > > Key: LUCENE-8784 > URL: https://issues.apache.org/jira/browse/LUCENE-8784 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Munkyu Im >Priority: Major > Fix For: master (9.0), 8.2 > > Attachments: LUCENE-8784.patch, LUCENE-8784.patch, LUCENE-8784.patch, > LUCENE-8784.patch > > > This is the same issue that I mentioned to > [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367] > unlike standard analyzer, nori analyzer removes the decimal point. > nori tokenizer removes "." character by default. > In this case, it is difficult to index the keywords including the decimal > point. > It would be nice if there had the option whether add a decimal point or not. > Like Japanese tokenizer does, Nori need an option to preserve decimal point. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.
[ https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849000#comment-16849000 ] Namgyu Kim commented on LUCENE-8784: Oh, I checked that discardPunctuation is removed from KoreanAnalyzer. Thank you very much for applying my patch! [~jim.ferenczi] :D > Nori(Korean) tokenizer removes the decimal point. > --- > > Key: LUCENE-8784 > URL: https://issues.apache.org/jira/browse/LUCENE-8784 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Munkyu Im >Priority: Major > Fix For: master (9.0), 8.2 > > Attachments: LUCENE-8784.patch, LUCENE-8784.patch, LUCENE-8784.patch, > LUCENE-8784.patch > > > This is the same issue that I mentioned to > [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367] > unlike standard analyzer, nori analyzer removes the decimal point. > nori tokenizer removes "." character by default. > In this case, it is difficult to index the keywords including the decimal > point. > It would be nice if there had the option whether add a decimal point or not. > Like Japanese tokenizer does, Nori need an option to preserve decimal point. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.
[ https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16848937#comment-16848937 ] ASF subversion and git services commented on LUCENE-8784: - Commit 990f62aca47fc0571c79e80f25ee1d0b7dc4a824 in lucene-solr's branch refs/heads/branch_8x from Namgyu Kim [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=990f62a ] LUCENE-8784: The KoreanTokenizer now preserves punctuations if discardPunctuation is set to false (defaults to true). Signed-off-by: Namgyu Kim Signed-off-by: jimczi > Nori(Korean) tokenizer removes the decimal point. > --- > > Key: LUCENE-8784 > URL: https://issues.apache.org/jira/browse/LUCENE-8784 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Munkyu Im >Priority: Major > Attachments: LUCENE-8784.patch, LUCENE-8784.patch, LUCENE-8784.patch, > LUCENE-8784.patch > > > This is the same issue that I mentioned to > [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367] > unlike standard analyzer, nori analyzer removes the decimal point. > nori tokenizer removes "." character by default. > In this case, it is difficult to index the keywords including the decimal > point. > It would be nice if there had the option whether add a decimal point or not. > Like Japanese tokenizer does, Nori need an option to preserve decimal point. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.
[ https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16848925#comment-16848925 ] ASF subversion and git services commented on LUCENE-8784: - Commit a556925eb8aaea61e55010a6e5c90520139b56c3 in lucene-solr's branch refs/heads/master from Namgyu Kim [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a556925 ] LUCENE-8784: The KoreanTokenizer now preserves punctuations if discardPunctuation is set to false (defaults to true). Signed-off-by: Namgyu Kim Signed-off-by: jimczi > Nori(Korean) tokenizer removes the decimal point. > --- > > Key: LUCENE-8784 > URL: https://issues.apache.org/jira/browse/LUCENE-8784 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Munkyu Im >Priority: Major > Attachments: LUCENE-8784.patch, LUCENE-8784.patch, LUCENE-8784.patch, > LUCENE-8784.patch > > > This is the same issue that I mentioned to > [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367] > unlike standard analyzer, nori analyzer removes the decimal point. > nori tokenizer removes "." character by default. > In this case, it is difficult to index the keywords including the decimal > point. > It would be nice if there had the option whether add a decimal point or not. > Like Japanese tokenizer does, Nori need an option to preserve decimal point. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.
[ https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16847706#comment-16847706 ] Namgyu Kim commented on LUCENE-8784: Thanks, [~jim.ferenczi]! :D If there is something wrong, I would appreciate it if you let me know. > Nori(Korean) tokenizer removes the decimal point. > --- > > Key: LUCENE-8784 > URL: https://issues.apache.org/jira/browse/LUCENE-8784 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Munkyu Im >Priority: Major > Attachments: LUCENE-8784.patch, LUCENE-8784.patch, LUCENE-8784.patch, > LUCENE-8784.patch > > > This is the same issue that I mentioned to > [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367] > unlike standard analyzer, nori analyzer removes the decimal point. > nori tokenizer removes "." character by default. > In this case, it is difficult to index the keywords including the decimal > point. > It would be nice if there had the option whether add a decimal point or not. > Like Japanese tokenizer does, Nori need an option to preserve decimal point. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.
[ https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16847695#comment-16847695 ] Jim Ferenczi commented on LUCENE-8784: -- The last patch for this issue looks good to me. I'll test locally and merge if all tests pass. Thanks for opening LUCENE-8812, I'll take a look when this issue gets merged. > Nori(Korean) tokenizer removes the decimal point. > --- > > Key: LUCENE-8784 > URL: https://issues.apache.org/jira/browse/LUCENE-8784 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Munkyu Im >Priority: Major > Attachments: LUCENE-8784.patch, LUCENE-8784.patch, LUCENE-8784.patch, > LUCENE-8784.patch > > > This is the same issue that I mentioned to > [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367] > unlike standard analyzer, nori analyzer removes the decimal point. > nori tokenizer removes "." character by default. > In this case, it is difficult to index the keywords including the decimal > point. > It would be nice if there had the option whether add a decimal point or not. > Like Japanese tokenizer does, Nori need an option to preserve decimal point. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.
[ https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16847659#comment-16847659 ] Namgyu Kim commented on LUCENE-8784: It's a good idea :D I linked LUCENE-8784(discardPunctuation) with LUCENE-8812(KoreanNumberFilter). (Apply LUCENE-8784 *first* and then LUCENE-8812) Your suggestion made this issue cleaner. In LUCENE-8784, I did not change the existing TCs and just added new TCs for discardPunctuation. (remain the current constructor to provide an existing API) > Nori(Korean) tokenizer removes the decimal point. > --- > > Key: LUCENE-8784 > URL: https://issues.apache.org/jira/browse/LUCENE-8784 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Munkyu Im >Priority: Major > Attachments: LUCENE-8784.patch, LUCENE-8784.patch, LUCENE-8784.patch, > LUCENE-8784.patch > > > This is the same issue that I mentioned to > [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367] > unlike standard analyzer, nori analyzer removes the decimal point. > nori tokenizer removes "." character by default. > In this case, it is difficult to index the keywords including the decimal > point. > It would be nice if there had the option whether add a decimal point or not. > Like Japanese tokenizer does, Nori need an option to preserve decimal point. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.
[ https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16847381#comment-16847381 ] Jim Ferenczi commented on LUCENE-8784: -- {quote} By the way, would not it be better to leave the constructors that do not use discardPunctuation parameters? (Existing Nori users have to modify the code after uploading) {quote} Yes we should do that, otherwise it's a breaking change and we cannot push to 8x. {quote} I also added Javadoc for discardPunctuation in your patch. (KoreanAnalyzer, KoreanTokenizerFactory) {quote} thanks! {quote} I developed KoreanNumberFilter by referring to JapaneseNumberFilter. Please check my patch :D {quote} The patch looks good but we should iterate on this in a new issue. We try to do one feature at a time in a single issue so let's add discardPunctuation in this one and we can open a new one as a follow up to add the KoreanNumberFilter ? > Nori(Korean) tokenizer removes the decimal point. > --- > > Key: LUCENE-8784 > URL: https://issues.apache.org/jira/browse/LUCENE-8784 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Munkyu Im >Priority: Major > Attachments: LUCENE-8784.patch, LUCENE-8784.patch, LUCENE-8784.patch > > > This is the same issue that I mentioned to > [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367] > unlike standard analyzer, nori analyzer removes the decimal point. > nori tokenizer removes "." character by default. > In this case, it is difficult to index the keywords including the decimal > point. > It would be nice if there had the option whether add a decimal point or not. > Like Japanese tokenizer does, Nori need an option to preserve decimal point. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.
[ https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16846877#comment-16846877 ] Namgyu Kim commented on LUCENE-8784: Oh, I forgot. I also added Javadoc for discardPunctuation in your patch. (KoreanAnalyzer, KoreanTokenizerFactory) > Nori(Korean) tokenizer removes the decimal point. > --- > > Key: LUCENE-8784 > URL: https://issues.apache.org/jira/browse/LUCENE-8784 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Munkyu Im >Priority: Major > Attachments: LUCENE-8784.patch, LUCENE-8784.patch, LUCENE-8784.patch > > > This is the same issue that I mentioned to > [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367] > unlike standard analyzer, nori analyzer removes the decimal point. > nori tokenizer removes "." character by default. > In this case, it is difficult to index the keywords including the decimal > point. > It would be nice if there had the option whether add a decimal point or not. > Like Japanese tokenizer does, Nori need an option to preserve decimal point. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.
[ https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16846871#comment-16846871 ] Namgyu Kim commented on LUCENE-8784: Thank you for your reply, [~jim.ferenczi]! Your approach looks awesome. I developed KoreanNumberFilter by referring to JapaneseNumberFilter. Please check my patch :D (use "git apply --whitespace=fix LUCENE-8784.patch" because of trailing whitespace error :() I did not set KoreanNumberFilter as the default filter in KoreanAnalyzer. By the way, would not it be better to leave the constructors that do not use discardPunctuation parameters? (Existing Nori users have to modify the code after uploading) > Nori(Korean) tokenizer removes the decimal point. > --- > > Key: LUCENE-8784 > URL: https://issues.apache.org/jira/browse/LUCENE-8784 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Munkyu Im >Priority: Major > Attachments: LUCENE-8784.patch, LUCENE-8784.patch, LUCENE-8784.patch > > > This is the same issue that I mentioned to > [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367] > unlike standard analyzer, nori analyzer removes the decimal point. > nori tokenizer removes "." character by default. > In this case, it is difficult to index the keywords including the decimal > point. > It would be nice if there had the option whether add a decimal point or not. > Like Japanese tokenizer does, Nori need an option to preserve decimal point. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.
[ https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845933#comment-16845933 ] Namgyu Kim commented on LUCENE-8784: Thank you for your reply, [~jim.ferenczi] :D I tried to process only "." character in Tokenizer. Because Korean is a language that can have a whitespace in sentence, but Japanese is not. (Character.OTHER_PUNCTUATION would match more than just the full stop character. => Right. That's a problem. I have to change that part...) JapaneseTokenizer keeps whitespace when using the discardPunctuation option. (example : "十万二千五 百 二千五" (means "102005 100 2005") If we run the JapaneseTokenizer with discardPunctuation=false and JapaneseNumberFilter, we get: {"102005", " ", "100", " ", "2005"}) Of course we can do it with StopFilter or internal processing in other Filter, but is it okay..? Developing a NumberFilter looks much more flexible and structurally beautiful rather than internal processing in Tokenizer. But I have developed like this because of the above problems, how can we handle those spaces? I think there are several ways to handle this problem: 1) Remove whitespace from Punctuation list in Tokenizer. 2) Use a TokenFilter to remove whitespace. 3) Remove whitespace from KoreanNumberFilter. (looks structurally strange...) 4) Just leave whitespace > Nori(Korean) tokenizer removes the decimal point. > --- > > Key: LUCENE-8784 > URL: https://issues.apache.org/jira/browse/LUCENE-8784 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Munkyu Im >Priority: Major > Attachments: LUCENE-8784.patch > > > This is the same issue that I mentioned to > [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367] > unlike standard analyzer, nori analyzer removes the decimal point. > nori tokenizer removes "." character by default. > In this case, it is difficult to index the keywords including the decimal > point. > It would be nice if there had the option whether add a decimal point or not. > Like Japanese tokenizer does, Nori need an option to preserve decimal point. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.
[ https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845856#comment-16845856 ] Jim Ferenczi commented on LUCENE-8784: -- Hi [~danmuzi], I don't think we should have one option for every punctuation type and the current check in the patch based on Character.OTHER_PUNCTUATION would match more than just the full stop character. If we want to preserve punctuations we can add the same option than for Kuromoji (discardPunctuation) and output a token for each punctuation group. So for an input like "10.1?" we would output 4 tokens: "10", ".", "1", "?". Then if you need to "regroup" tokens based on additional rules you can add another filter to do this like the JapaneseNumberFilter does. The other option would be to detect numbers with decimal points accurately like the standard tokenizer does but we don't want to reinvent the wheel either. If we want the same grouping for unknown words in this tokenizer we should probably implement it on top of the standard or ICU tokenizer directly. . > Nori(Korean) tokenizer removes the decimal point. > --- > > Key: LUCENE-8784 > URL: https://issues.apache.org/jira/browse/LUCENE-8784 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Munkyu Im >Priority: Major > Attachments: LUCENE-8784.patch > > > This is the same issue that I mentioned to > [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367] > unlike standard analyzer, nori analyzer removes the decimal point. > nori tokenizer removes "." character by default. > In this case, it is difficult to index the keywords including the decimal > point. > It would be nice if there had the option whether add a decimal point or not. > Like Japanese tokenizer does, Nori need an option to preserve decimal point. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.
[ https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845015#comment-16845015 ] Namgyu Kim commented on LUCENE-8784: Hi. [~jim.ferenczi] and [~Munkyu]. I uploaded a patch for this issue. I only worked about Tokenizer and TokenizerFactory, and did not work about Analyzer. In the case of Japanese, it could not be customized. (discardPunctuation is always true) If necessary, we can easily add it to Analyzer. However, I have a question now. The current patch was developed in such a way that it continues to pass parameters. (in _isPunctuation_ method) If we don't use the static method, we don't have to pass the parameters every time. What do you think about disabling static in the _isPunctuation_ method? > Nori(Korean) tokenizer removes the decimal point. > --- > > Key: LUCENE-8784 > URL: https://issues.apache.org/jira/browse/LUCENE-8784 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Munkyu Im >Priority: Major > Attachments: LUCENE-8784.patch > > > This is the same issue that I mentioned to > [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367] > unlike standard analyzer, nori analyzer removes the decimal point. > nori tokenizer removes "." character by default. > In this case, it is difficult to index the keywords including the decimal > point. > It would be nice if there had the option whether add a decimal point or not. > Like Japanese tokenizer does, Nori need an option to preserve decimal point. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org