[ https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16867977#comment-16867977 ]
Mike Sokolov commented on LUCENE-8816: -------------------------------------- LUCENE-8871 opened to cover moving dictionary builder tools into main kuromoji source tree, mostly so it gets tested properly. > Decouple Kuromoji's morphological analyser and its dictionary > ------------------------------------------------------------- > > Key: LUCENE-8816 > URL: https://issues.apache.org/jira/browse/LUCENE-8816 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis > Reporter: Tomoko Uchida > Priority: Major > > I've inspired by this mail-list thread. > > [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E] > As many Japanese already know, default built-in dictionary bundled with > Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. > While it has been slowly obsoleted, well-maintained and/or extended > dictionaries risen up in recent years (e.g. > [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], > [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some > attempts/projects/efforts are made in Japan. > However current architecture - dictionary bundled jar - is essentially > incompatible with the idea "switch the system dictionary", and developers > have difficulties to do so. > Traditionally, the morphological analysis engine (viterbi logic) and the > encoded dictionary (language model) had been decoupled (like MeCab, the > origin of Kuromoji, or lucene-gosen). So actually decoupling them is a > natural idea, and I feel that it's good time to re-think the current > architecture. > Also this would be good for advanced users who have customized/re-trained > their own system dictionary. > Goals of this issue: > * Decouple JapaneseTokenizer itself and encoded system dictionary. > * Implement dynamic dictionary load mechanism. > * Provide developer-oriented dictionary build tool. > Non-goals: > * Provide learner or language model (it's up to users and should be outside > the scope). > I have not dove into the code yet, so have no idea about it's easy or > difficult at this moment. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org