[ https://issues.apache.org/jira/browse/LUCENE-4056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16947791#comment-16947791 ]
Jun Ohtani commented on LUCENE-4056: ------------------------------------ I made a pull request on github repo. https://github.com/apache/lucene-solr/pull/935 > Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary > ------------------------------------------------------------ > > Key: LUCENE-4056 > URL: https://issues.apache.org/jira/browse/LUCENE-4056 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis > Affects Versions: 3.6 > Environment: Solr 3.6 > UniDic 1.3.12 for MeCab (unidic-mecab1312src.tar.gz) > Reporter: Kazuaki Hiraga > Priority: Major > Attachments: LUCENE-4056.patch > > Time Spent: 10m > Remaining Estimate: 0h > > I tried to build a UniDic dictionary for using it along with Kuromoji on Solr > 3.6. I think UniDic is a good dictionary than IPA dictionary, so Kuromoji for > Lucene/Solr should support UniDic dictionary as standalone Kuromoji does. > The following is my procedure: > Modified build.xml under lucene/contrib/analyzers/kuromoji directory and run > 'ant build-dict', I got the error as the below. > build-dict: > [java] dictionary builder > [java] > [java] dictionary format: UNIDIC > [java] input directory: > /home/kazu/Work/src/solr/brunch_3_6/lucene/build/contrib/analyzers/kuromoji/unidic-mecab1312src > [java] output directory: > /home/kazu/Work/src/solr/brunch_3_6/lucene/contrib/analyzers/kuromoji/src/resources > [java] input encoding: utf-8 > [java] normalize entries: false > [java] > [java] building tokeninfo dict... > [java] parse... > [java] sort... > [java] Exception in thread "main" java.lang.AssertionError > [java] encode... > [java] at > org.apache.lucene.analysis.ja.util.BinaryDictionaryWriter.put(BinaryDictionaryWriter.java:113) > [java] at > org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.buildDictionary(TokenInfoDictionaryBuilder.java:141) > [java] at > org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.build(TokenInfoDictionaryBuilder.java:76) > [java] at > org.apache.lucene.analysis.ja.util.DictionaryBuilder.build(DictionaryBuilder.java:37) > [java] at > org.apache.lucene.analysis.ja.util.DictionaryBuilder.main(DictionaryBuilder.java:82) > And the diff of build.xml: > =================================================================== > --- build.xml (revision 1338023) > +++ build.xml (working copy) > @@ -28,19 +28,31 @@ > <property name="maven.dist.dir" location="../../../dist/maven" /> > > <!-- default configuration: uses mecab-ipadic --> > + <!-- > <property name="ipadic.version" value="mecab-ipadic-2.7.0-20070801" /> > <property name="dict.src.file" value="${ipadic.version}.tar.gz" /> > <property name="dict.url" > value="http://mecab.googlecode.com/files/${dict.src.file}"/> > + --> > > <!-- alternative configuration: uses mecab-naist-jdic > <property name="ipadic.version" value="mecab-naist-jdic-0.6.3b-20111013" /> > <property name="dict.src.file" value="${ipadic.version}.tar.gz" /> > <property name="dict.url" > value="http://sourceforge.jp/frs/redir.php?m=iij&f=/naist-jdic/53500/${dict.src.file}"/> > --> > - > + > + <!-- alternative configuration: uses UniDic --> > + <property name="ipadic.version" value="unidic-mecab1312src" /> > + <property name="dict.src.file" value="unidic-mecab1312src.tar.gz" /> > + <property name="dict.loc.dir" > value="/home/kazu/Work/src/nlp/unidic/_archive"/> > + > <property name="dict.src.dir" value="${build.dir}/${ipadic.version}" /> > + <!-- > <property name="dict.encoding" value="euc-jp"/> > <property name="dict.format" value="ipadic"/> > + --> > + <property name="dict.encoding" value="utf-8"/> > + <property name="dict.format" value="unidic"/> > + > <property name="dict.normalize" value="false"/> > <property name="dict.target.dir" location="./src/resources"/> > > @@ -58,7 +70,8 @@ > > <target name="compile-core" depends="jar-analyzers-common, > common.compile-core" /> > <target name="download-dict" unless="dict.available"> > - <get src="${dict.url}" dest="${build.dir}/${dict.src.file}"/> > + <!-- get src="${dict.url}" dest="${build.dir}/${dict.src.file}"/ --> > + <copy file="${dict.loc.dir}/${dict.src.file}" > tofile="${build.dir}/${dict.src.file}"/> > <gunzip src="${build.dir}/${dict.src.file}"/> > <untar src="${build.dir}/${ipadic.version}.tar" dest="${build.dir}"/> > </target> -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org