[jira] [Comment Edited] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

Tomoko Uchida (JIRA) Thu, 30 May 2019 19:34:19 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852595#comment-16852595
 ]


Tomoko Uchida edited comment on LUCENE-8816 at 5/31/19 2:33 AM:
----------------------------------------------------------------

Thank you guys, for your comments and suggestions.

To me (having little knowledge about the code for now), to be honest it looks 
quite difficult to merge the two tokenizers because those seems have been 
greatly evolved respectively. So I thought we can share only DictionaryBuilder 
between Kuromoji and Nori module, as I wrote in the previous comment. However 
if it can be done to combine tokenizers in the future, this will be great 
unification.

For the time being, I will work for JapaneseTokenizer and welcome 
implementation advices from the viewpoint of possible future integration.


was (Author: tomoko uchida):
Thank you guys, for your comments and suggestions.

To me (having little knowledge about the code for now), to be honest it looks 
quite difficult to merge the two tokenizers because those seems have been 
greatly evolved respectively. So I thought we can share only DictionaryBuilder 
between Kuromoji and Nori, as I wrote in the previous comment. However if it 
can be done in the future, this will be great unification.

For the time being, I will work for JapaneseTokenizer and welcome 
implementation advices from the viewpoint of possible future integration.

> Decouple Kuromoji's morphological analyser and its dictionary
> -------------------------------------------------------------
>
>                 Key: LUCENE-8816
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8816
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Tomoko Uchida
>            Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

Reply via email to