[ https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17798057#comment-17798057 ]
ASF GitHub Bot commented on OPENNLP-421: ---------------------------------------- rzo1 commented on code in PR #568: URL: https://github.com/apache/opennlp/pull/568#discussion_r1429734990 ########## opennlp-tools/src/main/java/opennlp/tools/util/jvm/StringInterners.java: ########## @@ -0,0 +1,67 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package opennlp.tools.util.jvm; + +import java.lang.reflect.Constructor; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * Provides string interning utility methods. Interning mechanism can be configured via the + * system property {@code opennlp.interner.class} by specifying an implementation via its + * full qualified classname. It needs to implement {@link StringInterner}. + * <p> + * If not specified by the user, the default interner is {@link CHMStringInterner}. + */ +public class StringInterners { + + private static final Logger LOGGER = LoggerFactory.getLogger(StringInterners.class); + private static final StringInterner INTERNER; + + static { + + final String clazzName = System.getProperty("opennlp.interner.class", Review Comment: ```java final String clazzName = System.getProperty("opennlp.interner.class", CHMStringInterner.class.getCanonicalName()); ``` uses ```java /** * Gets the system property indicated by the specified key. * * First, if there is a security manager, its * {@code checkPropertyAccess} method is called with the * {@code key} as its argument. * <p> * If there is no current set of system properties, a set of system * properties is first created and initialized in the same manner as * for the {@code getProperties} method. * * @param key the name of the system property. * @param def a default value. * @return the string value of the system property, * or the default value if there is no property with that key. * * @throws SecurityException if a security manager exists and its * {@code checkPropertyAccess} method doesn't allow * access to the specified system property. * @throws NullPointerException if {@code key} is {@code null}. * @throws IllegalArgumentException if {@code key} is empty. * @see #setProperty * @see java.lang.SecurityManager#checkPropertyAccess(java.lang.String) * @see java.lang.System#getProperties() */ public static String getProperty(String key, String def) ``` which will return `CHMStringInterner.class.getCanonicalName()` if the property `opennlp.interner.class` is not defined. > Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning > ------------------------------------------------------------------------------ > > Key: OPENNLP-421 > URL: https://issues.apache.org/jira/browse/OPENNLP-421 > Project: OpenNLP > Issue Type: Bug > Components: Name Finder > Affects Versions: tools-1.5.2-incubating > Environment: RedHat 5, JDK 1.6.0_29 > Reporter: Jay Hacker > Assignee: Richard Zowalla > Priority: Minor > Labels: performance > Original Estimate: 168h > Remaining Estimate: 168h > > The current implementation of StringList: > https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup > > calls intern() on every String. Presumably this is an attempt to reduce > memory usage for duplicate tokens. Interned Strings are stored in the JVM's > permanent generation, which has a small fixed size (seems to be about 83 MB > on modern 64-bit JVMs: > [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]). > Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen > space. > The size of the PermGen can be increased with the -XX:MaxPermSize= option to > the JVM. However, this option is non-standard and not well known, and it > would be nice if OpenNLP worked out of the box without deep JVM tuning. > This immediate problem could be fixed by simply not interning Strings. > Looking at the Dictionary and DictionaryNameFinder code as a whole, however, > there is a huge amount of room for performance improvement. Currently, > DictionaryNameFinder.find works something like this: > for every token in every tokenlist in the dictionary: > copy it into a "meta dictionary" of single tokens > for every possible subsequence of tokens in the sentence: // of which > there are O(N^2) > copy the sequence into a new array > if the last token is in the "meta dictionary": > make a StringList from the tokens > look it up in the dictionary > Dictionary itself is very heavyweight: it's a Set<StringListWrapper>, which > wraps StringList, which wraps Array<String>. Every entry in the dictionary > requires at least four allocated objects (in addition to the Strings): Array, > StringList, StringListWrapper, and HashMap.Entry. Even contains and remove > allocate new objects! > From this comment in DictionaryNameFinder: > // TODO: improve performance here > It seems like improvements would be welcome. :) Removing some of the object > overhead would more than make up for interning strings. Should I create a > new Jira ticket to propose a more efficient design? -- This message was sent by Atlassian Jira (v8.20.10#820010)