[ https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17797377#comment-17797377 ]
ASF GitHub Bot commented on OPENNLP-421: ---------------------------------------- kinow commented on PR #568: URL: https://github.com/apache/opennlp/pull/568#issuecomment-1858767471 > I think, we c/should consider using a `ConcurrentHashMap`-based deduplicator approach to replace "intern()" calls, as investigated and explained by Shipilev in his blog post here: > > https://shipilev.net/jvm/anatomy-quarks/10-string-intern/ > > The removal of the String deduplication, as found in the code segments of this PR, could otherwise lead to more required RAM at runtime - as @kinow already expressed. > > Wdyt? @jzonthemtn @rzo1 > > (A) Removal (PR as is now), or (B) Replacement with concurrent-capable custom deduplicator? (B) looks interesting, +1 for trying that out, and for @rzo1 suggestion to add that on top of this one. >`CHMInterner` First time I've seen an intern made in the code (intentionally like that). Sounds like a good idea! You do the intern without using the heap. Not sure if it will perform as fast as String.intern, but I think it might work! Thanks @mawiesne > Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning > ------------------------------------------------------------------------------ > > Key: OPENNLP-421 > URL: https://issues.apache.org/jira/browse/OPENNLP-421 > Project: OpenNLP > Issue Type: Bug > Components: Name Finder > Affects Versions: tools-1.5.2-incubating > Environment: RedHat 5, JDK 1.6.0_29 > Reporter: Jay Hacker > Assignee: Richard Zowalla > Priority: Minor > Labels: performance > Original Estimate: 168h > Remaining Estimate: 168h > > The current implementation of StringList: > https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup > > calls intern() on every String. Presumably this is an attempt to reduce > memory usage for duplicate tokens. Interned Strings are stored in the JVM's > permanent generation, which has a small fixed size (seems to be about 83 MB > on modern 64-bit JVMs: > [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]). > Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen > space. > The size of the PermGen can be increased with the -XX:MaxPermSize= option to > the JVM. However, this option is non-standard and not well known, and it > would be nice if OpenNLP worked out of the box without deep JVM tuning. > This immediate problem could be fixed by simply not interning Strings. > Looking at the Dictionary and DictionaryNameFinder code as a whole, however, > there is a huge amount of room for performance improvement. Currently, > DictionaryNameFinder.find works something like this: > for every token in every tokenlist in the dictionary: > copy it into a "meta dictionary" of single tokens > for every possible subsequence of tokens in the sentence: // of which > there are O(N^2) > copy the sequence into a new array > if the last token is in the "meta dictionary": > make a StringList from the tokens > look it up in the dictionary > Dictionary itself is very heavyweight: it's a Set<StringListWrapper>, which > wraps StringList, which wraps Array<String>. Every entry in the dictionary > requires at least four allocated objects (in addition to the Strings): Array, > StringList, StringListWrapper, and HashMap.Entry. Even contains and remove > allocate new objects! > From this comment in DictionaryNameFinder: > // TODO: improve performance here > It seems like improvements would be welcome. :) Removing some of the object > overhead would more than make up for interning strings. Should I create a > new Jira ticket to propose a more efficient design? -- This message was sent by Atlassian Jira (v8.20.10#820010)