[
https://issues.apache.org/jira/browse/LUCENE-3696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186282#comment-13186282
]
Robert Muir commented on LUCENE-3696:
-------------------------------------
With the patch:
{noformat}
[java] building tokeninfo dict...
[java] parse...
[java] sort...
[java] encode...
[java] 53645 nodes, 253185 arcs, 1954817 bytes... done
[java] done
[java] building unknown word dict...done
[java] building connection costs...done
BUILD SUCCESSFUL
Total time: 10 seconds
{noformat}
> building a kuromoji dictionary is very slow and eventually fails if you use
> java 5
> ----------------------------------------------------------------------------------
>
> Key: LUCENE-3696
> URL: https://issues.apache.org/jira/browse/LUCENE-3696
> Project: Lucene - Java
> Issue Type: Bug
> Affects Versions: 3.6
> Reporter: Robert Muir
> Attachments: LUCENE-3696.patch
>
>
> Note: This only affects you if you use java 5 on 3.x, and it only affects you
> if you want to download/rebuild the dictionary.
> the analyzer itself works fine on 3.x with java 5.
> With java 6, building a kuromoji dictionary is quite fast:
> {noformat}
> [java] building tokeninfo dict...
> [java] parse...
> [java] sort...
> [java] encode...
> [java] 53645 nodes, 253185 arcs, 1954817 bytes... done
> [java] done
> [java] building unknown word dict...done
> [java] building connection costs...done
> BUILD SUCCESSFUL
> Total time: 6 seconds
> {noformat}
> However, if you use java 5, it takes forever and eventually runs out of
> memory in the CSV parsing phase.
> So we might need to optimize the CSV parser (like precompile its patterns).
> {noformat}
> [java] building tokeninfo dict...
> [java] parse...
> [java] Exception in thread "main" java.lang.OutOfMemoryError: Java heap
> space
> [java] at java.util.regex.Pattern.newSlice(Pattern.java:2909)
> [java] at java.util.regex.Pattern.atom(Pattern.java:1898)
> [java] at java.util.regex.Pattern.sequence(Pattern.java:1794)
> [java] at java.util.regex.Pattern.expr(Pattern.java:1687)
> [java] at java.util.regex.Pattern.compile(Pattern.java:1397)
> [java] at java.util.regex.Pattern.<init>(Pattern.java:1124)
> [java] at java.util.regex.Pattern.compile(Pattern.java:817)
> [java] at java.lang.String.replaceAll(String.java:2000)
> [java] at
> org.apache.lucene.analysis.kuromoji.util.CSVUtil.unQuoteUnEscape(CSVUtil.java:84)
> [java] at
> org.apache.lucene.analysis.kuromoji.util.CSVUtil.parse(CSVUtil.java:55)
> [java] at
> org.apache.lucene.analysis.kuromoji.util.TokenInfoDictionaryBuilder.buildDictionary(TokenInfoDictionaryBuilder.java:96)
> [java] at
> org.apache.lucene.analysis.kuromoji.util.TokenInfoDictionaryBuilder.build(TokenInfoDictionaryBuilder.java:76)
> [java] at
> org.apache.lucene.analysis.kuromoji.util.DictionaryBuilder.build(DictionaryBuilder.java:37)
> [java] at
> org.apache.lucene.analysis.kuromoji.util.DictionaryBuilder.main(DictionaryBuilder.java:82)
> BUILD FAILED
> /home/rmuir/workspace/lucene-branch3x2/lucene/contrib/analyzers/kuromoji/build.xml:75:
> Java returned: 1
> Total time: 2 minutes 4 seconds
> {noformat}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]