[ 
https://issues.apache.org/jira/browse/LUCENE-4056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16826875#comment-16826875
 ] 

Tomoko Uchida commented on LUCENE-4056:
---------------------------------------

[~h.kazuaki]: thanks, do you have a patch for this? I think we can work 
together. Even if it isn't merged to the Lucene master, it would be valuable 
for users that the patch is available here.

While the mecab-ipadic dictionary will go obsolete (this fact sometimes affects 
search quality so the search engineers in Japan often suffer from this,) UniDic 
or their extension is still actively maintained to adopt the changes of 
language. Of course it is much better if we provide substantial evidence here.

 

> Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary
> ------------------------------------------------------------
>
>                 Key: LUCENE-4056
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4056
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 3.6
>         Environment: Solr 3.6
> UniDic 1.3.12 for MeCab (unidic-mecab1312src.tar.gz)
>            Reporter: Kazuaki Hiraga
>            Priority: Major
>
> I tried to build a UniDic dictionary for using it along with Kuromoji on Solr 
> 3.6. I think UniDic is a good dictionary than IPA dictionary, so Kuromoji for 
> Lucene/Solr should support UniDic dictionary as standalone Kuromoji does.
> The following is my procedure:
> Modified build.xml under lucene/contrib/analyzers/kuromoji directory and run 
> 'ant build-dict', I got the error as the below.
> build-dict:
>      [java] dictionary builder
>      [java] 
>      [java] dictionary format: UNIDIC
>      [java] input directory: 
> /home/kazu/Work/src/solr/brunch_3_6/lucene/build/contrib/analyzers/kuromoji/unidic-mecab1312src
>      [java] output directory: 
> /home/kazu/Work/src/solr/brunch_3_6/lucene/contrib/analyzers/kuromoji/src/resources
>      [java] input encoding: utf-8
>      [java] normalize entries: false
>      [java] 
>      [java] building tokeninfo dict...
>      [java]   parse...
>      [java]   sort...
>      [java] Exception in thread "main" java.lang.AssertionError
>      [java]   encode...
>      [java]   at 
> org.apache.lucene.analysis.ja.util.BinaryDictionaryWriter.put(BinaryDictionaryWriter.java:113)
>      [java]   at 
> org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.buildDictionary(TokenInfoDictionaryBuilder.java:141)
>      [java]   at 
> org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.build(TokenInfoDictionaryBuilder.java:76)
>      [java]   at 
> org.apache.lucene.analysis.ja.util.DictionaryBuilder.build(DictionaryBuilder.java:37)
>      [java]   at 
> org.apache.lucene.analysis.ja.util.DictionaryBuilder.main(DictionaryBuilder.java:82)
> And the diff of build.xml:
> ===================================================================
> --- build.xml (revision 1338023)
> +++ build.xml (working copy)
> @@ -28,19 +28,31 @@
>    <property name="maven.dist.dir" location="../../../dist/maven" />
>  
>    <!-- default configuration: uses mecab-ipadic -->
> +  <!--
>    <property name="ipadic.version" value="mecab-ipadic-2.7.0-20070801" />
>    <property name="dict.src.file" value="${ipadic.version}.tar.gz" />
>    <property name="dict.url" 
> value="http://mecab.googlecode.com/files/${dict.src.file}"/>
> +  -->
>  
>    <!-- alternative configuration: uses mecab-naist-jdic
>    <property name="ipadic.version" value="mecab-naist-jdic-0.6.3b-20111013" />
>    <property name="dict.src.file" value="${ipadic.version}.tar.gz" />
>    <property name="dict.url" 
> value="http://sourceforge.jp/frs/redir.php?m=iij&amp;f=/naist-jdic/53500/${dict.src.file}"/>
>    -->
> -  
> +
> +  <!-- alternative configuration: uses UniDic -->
> +  <property name="ipadic.version" value="unidic-mecab1312src" />
> +  <property name="dict.src.file" value="unidic-mecab1312src.tar.gz" />
> +  <property name="dict.loc.dir" 
> value="/home/kazu/Work/src/nlp/unidic/_archive"/>
> +
>    <property name="dict.src.dir" value="${build.dir}/${ipadic.version}" />
> +  <!--
>    <property name="dict.encoding" value="euc-jp"/>
>    <property name="dict.format" value="ipadic"/>
> +  -->
> +  <property name="dict.encoding" value="utf-8"/>
> +  <property name="dict.format" value="unidic"/>
> +
>    <property name="dict.normalize" value="false"/>
>    <property name="dict.target.dir" location="./src/resources"/>
>  
> @@ -58,7 +70,8 @@
>  
>    <target name="compile-core" depends="jar-analyzers-common, 
> common.compile-core" />
>    <target name="download-dict" unless="dict.available">
> -     <get src="${dict.url}" dest="${build.dir}/${dict.src.file}"/>
> +     <!-- get src="${dict.url}" dest="${build.dir}/${dict.src.file}"/ -->
> +     <copy file="${dict.loc.dir}/${dict.src.file}" 
> tofile="${build.dir}/${dict.src.file}"/>
>       <gunzip src="${build.dir}/${dict.src.file}"/>
>       <untar src="${build.dir}/${ipadic.version}.tar" dest="${build.dir}"/>
>    </target>



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to