[ https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16435420#comment-16435420 ]
Jim Ferenczi commented on LUCENE-8231: -------------------------------------- Thanks Robert. I attached a new patch that changes the enum to attach a description to each tag and reflected the description in javadocs comments. The toString reflection now returns the description of the POS tag: {noformat} KoreanTokenizer@22f9baea term=평창,bytes=[ed 8f 89 ec b0 bd],startOffset=0,endOffset=2,positionIncrement=1,positionLength=1,type=word,termFrequency=1,posType=MORPHEME,leftPOS=NNP(Proper Noun),rightPOS=NNP(Proper Noun),morphemes=null,reading=null {noformat} ... and the compounds are correctly rendered: {noformat} KoreanTokenizer@292528fd term=가락지나물,bytes=[ea b0 80 eb 9d bd ec a7 80 eb 82 98 eb ac bc],startOffset=0,endOffset=5,positionIncrement=1,positionLength=1,type=word,termFrequency=1,posType=COMPOUND,leftPOS=NNG(General Noun),rightPOS=NNG(General Noun),morphemes=가락지/NNG(General Noun)+나물/NNG(General Noun),reading=null {noformat} I also change the format for the Preanalysis token, they are now compressed using the same technic than for Compounds which gives another 2MB improvement over the last patch. > Nori, a Korean analyzer based on mecab-ko-dic > --------------------------------------------- > > Key: LUCENE-8231 > URL: https://issues.apache.org/jira/browse/LUCENE-8231 > Project: Lucene - Core > Issue Type: New Feature > Reporter: Jim Ferenczi > Priority: Major > Attachments: LUCENE-8231-remap-hangul.patch, LUCENE-8231.patch, > LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, > LUCENE-8231.patch, LUCENE-8231.patch > > > There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic: > It is available under an Apache license here: > https://bitbucket.org/eunjeon/mecab-ko-dic > This dictionary was built with MeCab, it defines a format for the features > adapted for the Korean language. > Since the Kuromoji tokenizer uses the same format for the morphological > analysis (left cost + right cost + word cost) I tried to adapt the module to > handle Korean with the mecab-ko-dic. I've started with a POC that copies the > Kuromoji module and adapts it for the mecab-ko-dic. > I used the same classes to build and read the dictionary but I had to make > some modifications to handle the differences with the IPADIC and Japanese. > The resulting binary dictionary takes 28MB on disk, it's bigger than the > IPADIC but mainly because the source is bigger and there are a lot of > compound and inflect terms that define a group of terms and the segmentation > that can be applied. > I attached the patch that contains this new Korean module called -godori- > nori. It is an adaptation of the Kuromoji module so currently > the two modules don't share any code. I wanted to validate the approach first > and check the relevancy of the results. I don't speak Korean so I used the > relevancy > tests that was added for another Korean tokenizer > (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output > against mecab-ko which is the official fork of mecab to use the mecab-ko-dic. > I had to simplify the JapaneseTokenizer, my version removes the nBest output > and the decomposition of too long tokens. I also > modified the handling of whitespaces since they are important in Korean. > Whitespaces that appear before a term are attached to that term and this > information is used to compute a penalty based on the Part of Speech of the > token. The penalty cost is a feature added to mecab-ko to handle > morphemes that should not appear after a morpheme and is described in the > mecab-ko page: > https://bitbucket.org/eunjeon/mecab-ko > Ignoring whitespaces is also more inlined with the official MeCab library > which attach the whitespaces to the term that follows. > I also added a decompounder filter that expand the compounds and inflects > defined in the dictionary and a part of speech filter similar to the Japanese > that removes the morpheme that are not useful for relevance (suffix, prefix, > interjection, ...). These filters don't play well with the tokenizer if it > can > output multiple paths (nBest output for instance) so for simplicity I removed > this ability and the Korean tokenizer only outputs the best path. > I compared the result with mecab-ko to confirm that the analyzer is working > and ran the relevancy test that is defined in HantecRel.java included > in the patch (written by Robert for another Korean analyzer). Here are the > results: > ||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)|| > |Standard|35s|131MB|.007|.1044|.1053| > |CJK|36s|164MB|.1418|.1924|.1916| > |Korean|212s|90MB|.1628|.2094|.2078| > I find the results very promising so I plan to continue to work on this > project. I started to extract the part of the code that could be shared with > the > Kuromoji module but I wanted to share the status and this POC first to > confirm that this approach is viable. The advantages of using the same model > than > the Japanese analyzer are multiple: we don't have a Korean analyzer at the > moment ;), the resulting dictionary is small compared to other libraries that > use the mecab-ko-dic (the FST takes only 5.4MB) and the Tokenizer prunes the > lattice on the fly to select the best path efficiently. > The dictionary can be built directly from the godori module with the > following command: > ant regenerate (you need to create the resource directory (mkdir > lucene/analysis/godori/src/resources/org/apache/lucene/analysis/ko/dict) > first since the dictionary is not included in the patch). > I've also added some minimal tests in the module to play with the analysis. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org