building a kuromoji dictionary is very slow and eventually fails if you use 
java 5
----------------------------------------------------------------------------------

                 Key: LUCENE-3696
                 URL: https://issues.apache.org/jira/browse/LUCENE-3696
             Project: Lucene - Java
          Issue Type: Bug
    Affects Versions: 3.6
            Reporter: Robert Muir


Note: This only affects you if you use java 5 on 3.x, and it only affects you 
if you want to download/rebuild the dictionary. 
the analyzer itself works fine on 3.x with java 5.

With java 6, building a kuromoji dictionary is quite fast:
{noformat}
     [java] building tokeninfo dict...
     [java]   parse...
     [java]   sort...
     [java]   encode...
     [java]   53645 nodes, 253185 arcs, 1954817 bytes...   done
     [java] done
     [java] building unknown word dict...done
     [java] building connection costs...done

BUILD SUCCESSFUL
Total time: 6 seconds
{noformat}

However, if you use java 5, it takes forever and eventually runs out of memory 
in the CSV parsing phase.
So we might need to optimize the CSV parser (like precompile its patterns).

{noformat}
     [java] building tokeninfo dict...
     [java]   parse...
     [java] Exception in thread "main" java.lang.OutOfMemoryError: Java heap 
space
     [java]     at java.util.regex.Pattern.newSlice(Pattern.java:2909)
     [java]     at java.util.regex.Pattern.atom(Pattern.java:1898)
     [java]     at java.util.regex.Pattern.sequence(Pattern.java:1794)
     [java]     at java.util.regex.Pattern.expr(Pattern.java:1687)
     [java]     at java.util.regex.Pattern.compile(Pattern.java:1397)
     [java]     at java.util.regex.Pattern.<init>(Pattern.java:1124)
     [java]     at java.util.regex.Pattern.compile(Pattern.java:817)
     [java]     at java.lang.String.replaceAll(String.java:2000)
     [java]     at 
org.apache.lucene.analysis.kuromoji.util.CSVUtil.unQuoteUnEscape(CSVUtil.java:84)
     [java]     at 
org.apache.lucene.analysis.kuromoji.util.CSVUtil.parse(CSVUtil.java:55)
     [java]     at 
org.apache.lucene.analysis.kuromoji.util.TokenInfoDictionaryBuilder.buildDictionary(TokenInfoDictionaryBuilder.java:96)
     [java]     at 
org.apache.lucene.analysis.kuromoji.util.TokenInfoDictionaryBuilder.build(TokenInfoDictionaryBuilder.java:76)
     [java]     at 
org.apache.lucene.analysis.kuromoji.util.DictionaryBuilder.build(DictionaryBuilder.java:37)
     [java]     at 
org.apache.lucene.analysis.kuromoji.util.DictionaryBuilder.main(DictionaryBuilder.java:82)

BUILD FAILED
/home/rmuir/workspace/lucene-branch3x2/lucene/contrib/analyzers/kuromoji/build.xml:75:
 Java returned: 1

Total time: 2 minutes 4 seconds
{noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to