[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.

2019-05-28 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849559#comment-16849559
 ] 

ASF subversion and git services commented on LUCENE-8784:
-

Commit bf0d6fad4294b1d0e7c17b819be0f730d4e59c38 in lucene-solr's branch 
refs/heads/branch_8x from Jim Ferenczi
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=bf0d6fa ]

LUCENE-8784: Restore the Korean's part of speech tag for NGRAM.

The part of speech tag for unigram has been changed inadvertenly in a previous 
commit (not released).
This change restores the original value that is also set on the serialized 
unkwnown dictionary.


>  Nori(Korean) tokenizer removes the decimal point. 
> ---
>
> Key: LUCENE-8784
> URL: https://issues.apache.org/jira/browse/LUCENE-8784
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Munkyu Im
>Priority: Major
> Fix For: master (9.0), 8.2
>
> Attachments: LUCENE-8784.patch, LUCENE-8784.patch, LUCENE-8784.patch, 
> LUCENE-8784.patch
>
>
> This is the same issue that I mentioned to 
> [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367]
> unlike standard analyzer, nori analyzer removes the decimal point.
> nori tokenizer removes "." character by default.
>  In this case, it is difficult to index the keywords including the decimal 
> point.
> It would be nice if there had the option whether add a decimal point or not.
> Like Japanese tokenizer does,  Nori need an option to preserve decimal point.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.

2019-05-28 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849557#comment-16849557
 ] 

ASF subversion and git services commented on LUCENE-8784:
-

Commit db334c792bedb2b385f72098a6a61458a05667e3 in lucene-solr's branch 
refs/heads/master from Jim Ferenczi
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=db334c7 ]

LUCENE-8784: Restore the Korean's part of speech tag for NGRAM.

The part of speech tag for unigram has been changed inadvertenly in a previous 
commit (not released).
This change restores the original value that is also set on the serialized 
unkwnown dictionary.


>  Nori(Korean) tokenizer removes the decimal point. 
> ---
>
> Key: LUCENE-8784
> URL: https://issues.apache.org/jira/browse/LUCENE-8784
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Munkyu Im
>Priority: Major
> Fix For: master (9.0), 8.2
>
> Attachments: LUCENE-8784.patch, LUCENE-8784.patch, LUCENE-8784.patch, 
> LUCENE-8784.patch
>
>
> This is the same issue that I mentioned to 
> [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367]
> unlike standard analyzer, nori analyzer removes the decimal point.
> nori tokenizer removes "." character by default.
>  In this case, it is difficult to index the keywords including the decimal 
> point.
> It would be nice if there had the option whether add a decimal point or not.
> Like Japanese tokenizer does,  Nori need an option to preserve decimal point.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.

2019-05-27 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849000#comment-16849000
 ] 

Namgyu Kim commented on LUCENE-8784:


Oh, I checked that discardPunctuation is removed from KoreanAnalyzer.

Thank you very much for applying my patch! [~jim.ferenczi] :D

>  Nori(Korean) tokenizer removes the decimal point. 
> ---
>
> Key: LUCENE-8784
> URL: https://issues.apache.org/jira/browse/LUCENE-8784
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Munkyu Im
>Priority: Major
> Fix For: master (9.0), 8.2
>
> Attachments: LUCENE-8784.patch, LUCENE-8784.patch, LUCENE-8784.patch, 
> LUCENE-8784.patch
>
>
> This is the same issue that I mentioned to 
> [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367]
> unlike standard analyzer, nori analyzer removes the decimal point.
> nori tokenizer removes "." character by default.
>  In this case, it is difficult to index the keywords including the decimal 
> point.
> It would be nice if there had the option whether add a decimal point or not.
> Like Japanese tokenizer does,  Nori need an option to preserve decimal point.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.

2019-05-27 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16848937#comment-16848937
 ] 

ASF subversion and git services commented on LUCENE-8784:
-

Commit 990f62aca47fc0571c79e80f25ee1d0b7dc4a824 in lucene-solr's branch 
refs/heads/branch_8x from Namgyu Kim
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=990f62a ]

LUCENE-8784: The KoreanTokenizer now preserves punctuations if 
discardPunctuation is set to false (defaults to true).

Signed-off-by: Namgyu Kim 
Signed-off-by: jimczi 


>  Nori(Korean) tokenizer removes the decimal point. 
> ---
>
> Key: LUCENE-8784
> URL: https://issues.apache.org/jira/browse/LUCENE-8784
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Munkyu Im
>Priority: Major
> Attachments: LUCENE-8784.patch, LUCENE-8784.patch, LUCENE-8784.patch, 
> LUCENE-8784.patch
>
>
> This is the same issue that I mentioned to 
> [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367]
> unlike standard analyzer, nori analyzer removes the decimal point.
> nori tokenizer removes "." character by default.
>  In this case, it is difficult to index the keywords including the decimal 
> point.
> It would be nice if there had the option whether add a decimal point or not.
> Like Japanese tokenizer does,  Nori need an option to preserve decimal point.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.

2019-05-27 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16848925#comment-16848925
 ] 

ASF subversion and git services commented on LUCENE-8784:
-

Commit a556925eb8aaea61e55010a6e5c90520139b56c3 in lucene-solr's branch 
refs/heads/master from Namgyu Kim
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a556925 ]

LUCENE-8784: The KoreanTokenizer now preserves punctuations if 
discardPunctuation is set to false (defaults to true).

Signed-off-by: Namgyu Kim 
Signed-off-by: jimczi 


>  Nori(Korean) tokenizer removes the decimal point. 
> ---
>
> Key: LUCENE-8784
> URL: https://issues.apache.org/jira/browse/LUCENE-8784
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Munkyu Im
>Priority: Major
> Attachments: LUCENE-8784.patch, LUCENE-8784.patch, LUCENE-8784.patch, 
> LUCENE-8784.patch
>
>
> This is the same issue that I mentioned to 
> [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367]
> unlike standard analyzer, nori analyzer removes the decimal point.
> nori tokenizer removes "." character by default.
>  In this case, it is difficult to index the keywords including the decimal 
> point.
> It would be nice if there had the option whether add a decimal point or not.
> Like Japanese tokenizer does,  Nori need an option to preserve decimal point.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.

2019-05-24 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16847706#comment-16847706
 ] 

Namgyu Kim commented on LUCENE-8784:


Thanks, [~jim.ferenczi]! :D
If there is something wrong, I would appreciate it if you let me know.

>  Nori(Korean) tokenizer removes the decimal point. 
> ---
>
> Key: LUCENE-8784
> URL: https://issues.apache.org/jira/browse/LUCENE-8784
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Munkyu Im
>Priority: Major
> Attachments: LUCENE-8784.patch, LUCENE-8784.patch, LUCENE-8784.patch, 
> LUCENE-8784.patch
>
>
> This is the same issue that I mentioned to 
> [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367]
> unlike standard analyzer, nori analyzer removes the decimal point.
> nori tokenizer removes "." character by default.
>  In this case, it is difficult to index the keywords including the decimal 
> point.
> It would be nice if there had the option whether add a decimal point or not.
> Like Japanese tokenizer does,  Nori need an option to preserve decimal point.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.

2019-05-24 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16847695#comment-16847695
 ] 

Jim Ferenczi commented on LUCENE-8784:
--

The last patch for this issue looks good to me. I'll test locally and merge if 
all tests pass. 

Thanks for opening LUCENE-8812, I'll take a look when this issue gets merged.

>  Nori(Korean) tokenizer removes the decimal point. 
> ---
>
> Key: LUCENE-8784
> URL: https://issues.apache.org/jira/browse/LUCENE-8784
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Munkyu Im
>Priority: Major
> Attachments: LUCENE-8784.patch, LUCENE-8784.patch, LUCENE-8784.patch, 
> LUCENE-8784.patch
>
>
> This is the same issue that I mentioned to 
> [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367]
> unlike standard analyzer, nori analyzer removes the decimal point.
> nori tokenizer removes "." character by default.
>  In this case, it is difficult to index the keywords including the decimal 
> point.
> It would be nice if there had the option whether add a decimal point or not.
> Like Japanese tokenizer does,  Nori need an option to preserve decimal point.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.

2019-05-24 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16847659#comment-16847659
 ] 

Namgyu Kim commented on LUCENE-8784:


It's a good idea :D

I linked LUCENE-8784(discardPunctuation) with LUCENE-8812(KoreanNumberFilter).
(Apply LUCENE-8784 *first* and then LUCENE-8812)
Your suggestion made this issue cleaner.

In LUCENE-8784,
I did not change the existing TCs and just added new TCs for discardPunctuation.
(remain the current constructor to provide an existing API)

>  Nori(Korean) tokenizer removes the decimal point. 
> ---
>
> Key: LUCENE-8784
> URL: https://issues.apache.org/jira/browse/LUCENE-8784
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Munkyu Im
>Priority: Major
> Attachments: LUCENE-8784.patch, LUCENE-8784.patch, LUCENE-8784.patch, 
> LUCENE-8784.patch
>
>
> This is the same issue that I mentioned to 
> [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367]
> unlike standard analyzer, nori analyzer removes the decimal point.
> nori tokenizer removes "." character by default.
>  In this case, it is difficult to index the keywords including the decimal 
> point.
> It would be nice if there had the option whether add a decimal point or not.
> Like Japanese tokenizer does,  Nori need an option to preserve decimal point.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.

2019-05-24 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16847381#comment-16847381
 ] 

Jim Ferenczi commented on LUCENE-8784:
--

{quote}

By the way, would not it be better to leave the constructors that do not use 
discardPunctuation parameters?
(Existing Nori users have to modify the code after uploading)

{quote}

Yes we should do that, otherwise it's a breaking change and we cannot push to 
8x.

{quote}

I also added Javadoc for discardPunctuation in your patch. (KoreanAnalyzer, 
KoreanTokenizerFactory)

{quote}

thanks!

{quote}

I developed KoreanNumberFilter by referring to JapaneseNumberFilter.
Please check my patch :D

{quote}

The patch looks good but we should iterate on this in a new issue. We try to do 
one feature at a time in a single issue so let's add discardPunctuation in this 
one and we can open a new one as a follow up to add the KoreanNumberFilter ?

 

 

 

 

>  Nori(Korean) tokenizer removes the decimal point. 
> ---
>
> Key: LUCENE-8784
> URL: https://issues.apache.org/jira/browse/LUCENE-8784
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Munkyu Im
>Priority: Major
> Attachments: LUCENE-8784.patch, LUCENE-8784.patch, LUCENE-8784.patch
>
>
> This is the same issue that I mentioned to 
> [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367]
> unlike standard analyzer, nori analyzer removes the decimal point.
> nori tokenizer removes "." character by default.
>  In this case, it is difficult to index the keywords including the decimal 
> point.
> It would be nice if there had the option whether add a decimal point or not.
> Like Japanese tokenizer does,  Nori need an option to preserve decimal point.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.

2019-05-23 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16846877#comment-16846877
 ] 

Namgyu Kim commented on LUCENE-8784:


Oh, I forgot.
I also added Javadoc for discardPunctuation in your patch. (KoreanAnalyzer, 
KoreanTokenizerFactory)

>  Nori(Korean) tokenizer removes the decimal point. 
> ---
>
> Key: LUCENE-8784
> URL: https://issues.apache.org/jira/browse/LUCENE-8784
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Munkyu Im
>Priority: Major
> Attachments: LUCENE-8784.patch, LUCENE-8784.patch, LUCENE-8784.patch
>
>
> This is the same issue that I mentioned to 
> [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367]
> unlike standard analyzer, nori analyzer removes the decimal point.
> nori tokenizer removes "." character by default.
>  In this case, it is difficult to index the keywords including the decimal 
> point.
> It would be nice if there had the option whether add a decimal point or not.
> Like Japanese tokenizer does,  Nori need an option to preserve decimal point.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.

2019-05-23 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16846871#comment-16846871
 ] 

Namgyu Kim commented on LUCENE-8784:


Thank you for your reply, [~jim.ferenczi]!

Your approach looks awesome.
I developed KoreanNumberFilter by referring to JapaneseNumberFilter.
Please check my patch :D
(use "git apply --whitespace=fix LUCENE-8784.patch" because of trailing 
whitespace error :()

I did not set KoreanNumberFilter as the default filter in KoreanAnalyzer.
By the way, would not it be better to leave the constructors that do not use 
discardPunctuation parameters?
(Existing Nori users have to modify the code after uploading)


>  Nori(Korean) tokenizer removes the decimal point. 
> ---
>
> Key: LUCENE-8784
> URL: https://issues.apache.org/jira/browse/LUCENE-8784
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Munkyu Im
>Priority: Major
> Attachments: LUCENE-8784.patch, LUCENE-8784.patch, LUCENE-8784.patch
>
>
> This is the same issue that I mentioned to 
> [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367]
> unlike standard analyzer, nori analyzer removes the decimal point.
> nori tokenizer removes "." character by default.
>  In this case, it is difficult to index the keywords including the decimal 
> point.
> It would be nice if there had the option whether add a decimal point or not.
> Like Japanese tokenizer does,  Nori need an option to preserve decimal point.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.

2019-05-22 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845933#comment-16845933
 ] 

Namgyu Kim commented on LUCENE-8784:


Thank you for your reply, [~jim.ferenczi] :D

 

I tried to process only "." character in Tokenizer.
 Because Korean is a language that can have a whitespace in sentence, but 
Japanese is not.
 (Character.OTHER_PUNCTUATION would match more than just the full stop 
character. => Right. That's a problem. I have to change that part...)

 

JapaneseTokenizer keeps whitespace when using the discardPunctuation option.
 (example : "十万二千五 百 二千五" (means "102005 100 2005")
 If we run the JapaneseTokenizer with discardPunctuation=false and 
JapaneseNumberFilter, we get:
{"102005", " ", "100", " ", "2005"})

 

Of course we can do it with StopFilter or internal processing in other Filter, 
but is it okay..?

 

Developing a NumberFilter looks much more flexible and structurally beautiful 
rather than internal processing in Tokenizer.
 But I have developed like this because of the above problems, how can we 
handle those spaces?

 

I think there are several ways to handle this problem:
 1) Remove whitespace from Punctuation list in Tokenizer.
 2) Use a TokenFilter to remove whitespace.
 3) Remove whitespace from KoreanNumberFilter. (looks structurally strange...)
 4) Just leave whitespace

>  Nori(Korean) tokenizer removes the decimal point. 
> ---
>
> Key: LUCENE-8784
> URL: https://issues.apache.org/jira/browse/LUCENE-8784
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Munkyu Im
>Priority: Major
> Attachments: LUCENE-8784.patch
>
>
> This is the same issue that I mentioned to 
> [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367]
> unlike standard analyzer, nori analyzer removes the decimal point.
> nori tokenizer removes "." character by default.
>  In this case, it is difficult to index the keywords including the decimal 
> point.
> It would be nice if there had the option whether add a decimal point or not.
> Like Japanese tokenizer does,  Nori need an option to preserve decimal point.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.

2019-05-22 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845856#comment-16845856
 ] 

Jim Ferenczi commented on LUCENE-8784:
--

Hi [~danmuzi],
I don't think we should have one option for every punctuation type and the 
current check in the patch based on Character.OTHER_PUNCTUATION would match 
more than just the full stop character. If we want to preserve punctuations we 
can add the same option than for Kuromoji (discardPunctuation) and output a 
token for each punctuation group. So for an input like "10.1?" we would output 
4 tokens: "10", ".", "1", "?". Then if you need to "regroup" tokens based on 
additional rules you can add another filter to do this like the 
JapaneseNumberFilter does. The other option would be to detect numbers with 
decimal points accurately like the standard tokenizer does but we don't want to 
reinvent the wheel either. If we want the same grouping for unknown words in 
this tokenizer we should probably implement it on top of the standard or ICU 
tokenizer directly. 
.

>  Nori(Korean) tokenizer removes the decimal point. 
> ---
>
> Key: LUCENE-8784
> URL: https://issues.apache.org/jira/browse/LUCENE-8784
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Munkyu Im
>Priority: Major
> Attachments: LUCENE-8784.patch
>
>
> This is the same issue that I mentioned to 
> [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367]
> unlike standard analyzer, nori analyzer removes the decimal point.
> nori tokenizer removes "." character by default.
>  In this case, it is difficult to index the keywords including the decimal 
> point.
> It would be nice if there had the option whether add a decimal point or not.
> Like Japanese tokenizer does,  Nori need an option to preserve decimal point.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.

2019-05-21 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845015#comment-16845015
 ] 

Namgyu Kim commented on LUCENE-8784:


Hi. [~jim.ferenczi] and [~Munkyu].
I uploaded a patch for this issue.

I only worked about Tokenizer and TokenizerFactory, and did not work about 
Analyzer.
In the case of Japanese, it could not be customized. (discardPunctuation is 
always true)
If necessary, we can easily add it to Analyzer.

However, I have a question now.
The current patch was developed in such a way that it continues to pass 
parameters. (in _isPunctuation_ method)
If we don't use the static method, we don't have to pass the parameters every 
time.

What do you think about disabling static in the _isPunctuation_ method?

>  Nori(Korean) tokenizer removes the decimal point. 
> ---
>
> Key: LUCENE-8784
> URL: https://issues.apache.org/jira/browse/LUCENE-8784
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Munkyu Im
>Priority: Major
> Attachments: LUCENE-8784.patch
>
>
> This is the same issue that I mentioned to 
> [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367]
> unlike standard analyzer, nori analyzer removes the decimal point.
> nori tokenizer removes "." character by default.
>  In this case, it is difficult to index the keywords including the decimal 
> point.
> It would be nice if there had the option whether add a decimal point or not.
> Like Japanese tokenizer does,  Nori need an option to preserve decimal point.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org