[jira] [Updated] (LUCENE-8631) How Nori Tokenizer can deal with Longest-Matching

Yeongsu Kim (JIRA) Tue, 08 Jan 2019 17:00:21 -0800


     [ 
https://issues.apache.org/jira/browse/LUCENE-8631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yeongsu Kim updated LUCENE-8631:
--------------------------------
    Description: 
I think... Nori tokenizer has one issue. 

I don’t understand why “Longest-Matching” is NOT working to Nori tokenizer via 
config mode (config mode: 
[https://www.elastic.co/guide/en/elasticsearch/plugins/6.x/analysis-nori-tokenizer.html).]

 

Here is an example for explaining what is longest-matching.

Let assume we have `userdict_ko.txt` including only three Korean single-words 
such as ‘골드’, ‘브라운’, ‘골드브라운’, and save it to Nori analyzer. After update, we 
can see that it outputs two tokens such as ‘골드’ and ‘브라운’, when the input is 
‘골드브라운’. (In English: ‘골드’ means ‘gold’, ‘브라운’ means ‘brown’, and ‘골드브라운’ means 
‘goldbrown’)

 

With this result, we recognize that “Longest-Matching” is NOT working. If 
“Longest-Matching” is working, the output must be ‘골드브라운’, which is the longest 
matching word in the user dictionary.

 

Curiously enough, when we add user dictionary via custom mode (custom mode: 
[https://github.com/jimczi/nori/blob/master/how-to-custom-dict.asciidoc|https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjimczi%2Fnori%2Fblob%2Fmaster%2Fhow-to-custom-dict.asciidoc&data=02%7C01%7Chigh_yeongsu%40wemakeprice.com%7C6953d739414e4da5ad1408d67473a6fe%7C6322d5f522044e9d9ca6d18828a04daf%7C0%7C0%7C636824437418170758&sdata=5iuNvKr8WJCXlCkJQrf5r3BgDVnF5hpG7l%2BQL0Ok7Aw%3D&reserved=0]),
 we found the result is ‘골드브라운’, where ‘Longest-Matching’ is applied. We think 
the reason is because learned Mecab engine automatically generates word costs 
by its own criteria. We hope this mechanism is also applied to config mode.

 

Would you tell me the way to “Longest-Matching” via config mode (not custom) or 
give me some hints (e.g. where to modify source codes) to solve this problem?

 

P.S

Recently, I've mailed to [~jim.ferenczi], who is a developer of Nori, and 
received his suggestions:

   - Add a way to set a score to each new rule (this way you could set up a 
negative cost for the compound word that is less than the sum of the two single 
words.

   - Same as above but the cost is computed from the statistics of the training 
(like the custom dictionary does when you recompile entirely).

   - Implement longest-match first in the dictionary.

 

Thanks for your support.

  was:
 

 

I think... Nori tokenizer has one issue.

 

I don’t understand why “Longest-Matching” is NOT working to Nori tokenizer via 
config mode (config mode: 
[https://www.elastic.co/guide/en/elasticsearch/plugins/6.x/analysis-nori-tokenizer.html).]

 

Here is an example for explaining what is longest-matching.

Let assume we have `userdict_ko.txt` including only three Korean single-words 
such as ‘골드’, ‘브라운’, ‘골드브라운’, and save it to Nori tokenizer. After update, we 
can see that it outputs two tokens such as ‘골드’ and ‘브라운’ when the input is 
‘골드브라운’. (In English: ‘골드’ means ‘gold’, ‘브라운’ means ‘brown’, and ‘골드브라운’ means 
‘goldbrown’)

 

With this result, we recognize that “Longest-Matching” is NOT working.

If “Longest-Matching” is working, the output must be ‘골드브라운’, which is the 
longest matching word in the user dictionary.

 

Curiously enough, when we add user dictionary via custom mode (custom mode: 
[https://github.com/jimczi/nori/blob/master/how-to-custom-dict.asciidoc|https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjimczi%2Fnori%2Fblob%2Fmaster%2Fhow-to-custom-dict.asciidoc&data=02%7C01%7Chigh_yeongsu%40wemakeprice.com%7C6953d739414e4da5ad1408d67473a6fe%7C6322d5f522044e9d9ca6d18828a04daf%7C0%7C0%7C636824437418170758&sdata=5iuNvKr8WJCXlCkJQrf5r3BgDVnF5hpG7l%2BQL0Ok7Aw%3D&reserved=0]),
 we found the result is ‘골드브라운’, where ‘Longest-Matching’ is applied. We think 
that the reason is because learned Mecab engine automatically generates word 
costs by its own criteria. We hope this mechanism is also applied to config 
mode.

 

Would you tell me the way to “Longest-Matching” via config mode (not custom) or 
give me some hints (e.g. where to modify source codes) to solve this problem?

 

P.S

Recently, I've mailed to Jim Ferenczi, who is a developer of Nori, and received 
his suggestions:

* Add a way to set a score to each new rule (this way you could set up a 
negative cost for the compound word that is less than the sum of the two single 
words.

* Same as above but the cost is computed from the statistics of the training 
(like the custom dictionary does when you recompile entirely).

* Implement longest-match first in the dictionary.

 

Thanks for your support.


> How Nori Tokenizer can deal with Longest-Matching
> -------------------------------------------------
>
>                 Key: LUCENE-8631
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8631
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Yeongsu Kim
>            Priority: Major
>
> I think... Nori tokenizer has one issue. 
> I don’t understand why “Longest-Matching” is NOT working to Nori tokenizer 
> via config mode (config mode: 
> [https://www.elastic.co/guide/en/elasticsearch/plugins/6.x/analysis-nori-tokenizer.html).]
>  
> Here is an example for explaining what is longest-matching.
> Let assume we have `userdict_ko.txt` including only three Korean single-words 
> such as ‘골드’, ‘브라운’, ‘골드브라운’, and save it to Nori analyzer. After update, we 
> can see that it outputs two tokens such as ‘골드’ and ‘브라운’, when the input is 
> ‘골드브라운’. (In English: ‘골드’ means ‘gold’, ‘브라운’ means ‘brown’, and ‘골드브라운’ 
> means ‘goldbrown’)
>  
> With this result, we recognize that “Longest-Matching” is NOT working. If 
> “Longest-Matching” is working, the output must be ‘골드브라운’, which is the 
> longest matching word in the user dictionary.
>  
> Curiously enough, when we add user dictionary via custom mode (custom mode: 
> [https://github.com/jimczi/nori/blob/master/how-to-custom-dict.asciidoc|https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjimczi%2Fnori%2Fblob%2Fmaster%2Fhow-to-custom-dict.asciidoc&data=02%7C01%7Chigh_yeongsu%40wemakeprice.com%7C6953d739414e4da5ad1408d67473a6fe%7C6322d5f522044e9d9ca6d18828a04daf%7C0%7C0%7C636824437418170758&sdata=5iuNvKr8WJCXlCkJQrf5r3BgDVnF5hpG7l%2BQL0Ok7Aw%3D&reserved=0]),
>  we found the result is ‘골드브라운’, where ‘Longest-Matching’ is applied. We 
> think the reason is because learned Mecab engine automatically generates word 
> costs by its own criteria. We hope this mechanism is also applied to config 
> mode.
>  
> Would you tell me the way to “Longest-Matching” via config mode (not custom) 
> or give me some hints (e.g. where to modify source codes) to solve this 
> problem?
>  
> P.S
> Recently, I've mailed to [~jim.ferenczi], who is a developer of Nori, and 
> received his suggestions:
>    - Add a way to set a score to each new rule (this way you could set up a 
> negative cost for the compound word that is less than the sum of the two 
> single words.
>    - Same as above but the cost is computed from the statistics of the 
> training (like the custom dictionary does when you recompile entirely).
>    - Implement longest-match first in the dictionary.
>  
> Thanks for your support.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-8631) How Nori Tokenizer can deal with Longest-Matching

Reply via email to