[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16861672#comment-16861672
 ] 

Tomoko Uchida commented on LUCENE-8816:
---------------------------------------

I'd like to add some more information: the leftID (and rightID) is tied to the 
POS tags and in practice there are not so much pos tag variations. I think the 
current constraint {{leftId < 4096}} (or {{leftId < 8191}}, if it can be easily 
changed so) is perfectly okay if following conditions are met.

1. The dictionary learner/re-trainer program included in the mecab-ipadic 
devtool does not generate leftID (and rightID) values larger than 4196 (or 
8191).
2. UniDic (I'd like to support this dictionary on this issue as I wrote in the 
issue description) has no leftID (and rightID) values greater than 4196 (or 
8191).
3. A few well-known variants of mecab-ipadic or unidic does not have leftID 
(and rightID) values larger than 4196 (or 8191).

Give me some time to examine if we need to re-consider the constraint. (It's 
just a guess but the original mecab itself is also a performance-savvy 
software, so it could have similar restrictions for its dictionary format.) At 
least about the point 3, I think I can talk with the dictionary developers 
about it before tackling with Lucene code, if it's needed.

There is another possibility that users give large values to leftIDs (and 
rightIDs) in their customized dictionary by hand, however I don't think we 
should take care about that. I have no idea about Korian dictionaries.

I agree with that it will be better to change the all assertions to some 
Exceptions so that users can figure out the problem with their customized 
dictionary.

> Decouple Kuromoji's morphological analyser and its dictionary
> -------------------------------------------------------------
>
>                 Key: LUCENE-8816
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8816
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Tomoko Uchida
>            Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to