[ https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17799841#comment-17799841 ]
ASF GitHub Bot commented on OPENNLP-421: ---------------------------------------- kinow commented on PR #568: URL: https://github.com/apache/opennlp/pull/568#issuecomment-1867709096 I'm working until noon today, then I should have time to test it again :tada: (sorry the delay) Trying the branch again after rebuilding it, but I keep getting ```java # Run progress: 0.74% complete, ETA 00:01:51 # Fork: 1 of 1 <failure> java.lang.NoClassDefFoundError: opennlp/tools/jmh/ExecutionPlan at java.base/java.lang.Class.forName0(Native Method) at java.base/java.lang.Class.forName(Class.java:375) at org.openjdk.jmh.util.ClassUtils.loadClass(ClassUtils.java:73) ... ``` > Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning > ------------------------------------------------------------------------------ > > Key: OPENNLP-421 > URL: https://issues.apache.org/jira/browse/OPENNLP-421 > Project: OpenNLP > Issue Type: Bug > Components: Name Finder > Affects Versions: tools-1.5.2-incubating > Environment: RedHat 5, JDK 1.6.0_29 > Reporter: Jay Hacker > Assignee: Martin Wiesner > Priority: Major > Labels: performance > Fix For: 2.3.2 > > Original Estimate: 168h > Remaining Estimate: 168h > > The current implementation of StringList: > https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup > > calls intern() on every String. Presumably this is an attempt to reduce > memory usage for duplicate tokens. Interned Strings are stored in the JVM's > permanent generation, which has a small fixed size (seems to be about 83 MB > on modern 64-bit JVMs: > [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]). > Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen > space. > The size of the PermGen can be increased with the -XX:MaxPermSize= option to > the JVM. However, this option is non-standard and not well known, and it > would be nice if OpenNLP worked out of the box without deep JVM tuning. > This immediate problem could be fixed by simply not interning Strings. > Looking at the Dictionary and DictionaryNameFinder code as a whole, however, > there is a huge amount of room for performance improvement. Currently, > DictionaryNameFinder.find works something like this: > for every token in every tokenlist in the dictionary: > copy it into a "meta dictionary" of single tokens > for every possible subsequence of tokens in the sentence: // of which > there are O(N^2) > copy the sequence into a new array > if the last token is in the "meta dictionary": > make a StringList from the tokens > look it up in the dictionary > Dictionary itself is very heavyweight: it's a Set<StringListWrapper>, which > wraps StringList, which wraps Array<String>. Every entry in the dictionary > requires at least four allocated objects (in addition to the Strings): Array, > StringList, StringListWrapper, and HashMap.Entry. Even contains and remove > allocate new objects! > From this comment in DictionaryNameFinder: > // TODO: improve performance here > It seems like improvements would be welcome. :) Removing some of the object > overhead would more than make up for interning strings. Should I create a > new Jira ticket to propose a more efficient design? -- This message was sent by Atlassian Jira (v8.20.10#820010)