[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
[ https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800347#comment-17800347 ] Richard Zowalla commented on OPENNLP-421: - Next release will contain related fixes and should improve performance/ result in lower object creation rates. > Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning > -- > > Key: OPENNLP-421 > URL: https://issues.apache.org/jira/browse/OPENNLP-421 > Project: OpenNLP > Issue Type: Bug > Components: Name Finder >Affects Versions: tools-1.5.2-incubating > Environment: RedHat 5, JDK 1.6.0_29 >Reporter: Jay Hacker >Assignee: Richard Zowalla >Priority: Major > Labels: performance > Fix For: 2.3.2 > > Original Estimate: 168h > Remaining Estimate: 168h > > The current implementation of StringList: > https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup > > calls intern() on every String. Presumably this is an attempt to reduce > memory usage for duplicate tokens. Interned Strings are stored in the JVM's > permanent generation, which has a small fixed size (seems to be about 83 MB > on modern 64-bit JVMs: > [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]). > Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen > space. > The size of the PermGen can be increased with the -XX:MaxPermSize= option to > the JVM. However, this option is non-standard and not well known, and it > would be nice if OpenNLP worked out of the box without deep JVM tuning. > This immediate problem could be fixed by simply not interning Strings. > Looking at the Dictionary and DictionaryNameFinder code as a whole, however, > there is a huge amount of room for performance improvement. Currently, > DictionaryNameFinder.find works something like this: > for every token in every tokenlist in the dictionary: > copy it into a "meta dictionary" of single tokens > for every possible subsequence of tokens in the sentence:// of which > there are O(N^2) > copy the sequence into a new array > if the last token is in the "meta dictionary": > make a StringList from the tokens > look it up in the dictionary > Dictionary itself is very heavyweight: it's a Set, which > wraps StringList, which wraps Array. Every entry in the dictionary > requires at least four allocated objects (in addition to the Strings): Array, > StringList, StringListWrapper, and HashMap.Entry. Even contains and remove > allocate new objects! > From this comment in DictionaryNameFinder: > // TODO: improve performance here > It seems like improvements would be welcome. :) Removing some of the object > overhead would more than make up for interning strings. Should I create a > new Jira ticket to propose a more efficient design? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
[ https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800346#comment-17800346 ] ASF GitHub Bot commented on OPENNLP-421: rzo1 merged PR #572: URL: https://github.com/apache/opennlp/pull/572 > Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning > -- > > Key: OPENNLP-421 > URL: https://issues.apache.org/jira/browse/OPENNLP-421 > Project: OpenNLP > Issue Type: Bug > Components: Name Finder >Affects Versions: tools-1.5.2-incubating > Environment: RedHat 5, JDK 1.6.0_29 >Reporter: Jay Hacker >Assignee: Martin Wiesner >Priority: Major > Labels: performance > Fix For: 2.3.2 > > Original Estimate: 168h > Remaining Estimate: 168h > > The current implementation of StringList: > https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup > > calls intern() on every String. Presumably this is an attempt to reduce > memory usage for duplicate tokens. Interned Strings are stored in the JVM's > permanent generation, which has a small fixed size (seems to be about 83 MB > on modern 64-bit JVMs: > [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]). > Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen > space. > The size of the PermGen can be increased with the -XX:MaxPermSize= option to > the JVM. However, this option is non-standard and not well known, and it > would be nice if OpenNLP worked out of the box without deep JVM tuning. > This immediate problem could be fixed by simply not interning Strings. > Looking at the Dictionary and DictionaryNameFinder code as a whole, however, > there is a huge amount of room for performance improvement. Currently, > DictionaryNameFinder.find works something like this: > for every token in every tokenlist in the dictionary: > copy it into a "meta dictionary" of single tokens > for every possible subsequence of tokens in the sentence:// of which > there are O(N^2) > copy the sequence into a new array > if the last token is in the "meta dictionary": > make a StringList from the tokens > look it up in the dictionary > Dictionary itself is very heavyweight: it's a Set, which > wraps StringList, which wraps Array. Every entry in the dictionary > requires at least four allocated objects (in addition to the Strings): Array, > StringList, StringListWrapper, and HashMap.Entry. Even contains and remove > allocate new objects! > From this comment in DictionaryNameFinder: > // TODO: improve performance here > It seems like improvements would be welcome. :) Removing some of the object > overhead would more than make up for interning strings. Should I create a > new Jira ticket to propose a more efficient design? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
[ https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799904#comment-17799904 ] ASF GitHub Bot commented on OPENNLP-421: rzo1 opened a new pull request, #572: URL: https://github.com/apache/opennlp/pull/572 Thank you for contributing to Apache OpenNLP. In order to streamline the review of the contribution we ask you to ensure the following steps have been taken: ### For all changes: - [x] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message? - [x] Does your PR title start with OPENNLP- where is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character. - [x] Has your PR been rebased against the latest commit within the target branch (typically main)? - [x] Is your initial contribution a single, squashed commit? ### For code changes: - [x] Have you ensured that the full suite of tests is executed via mvn clean install at the root opennlp folder? - [ ] Have you written or updated unit tests to verify your changes? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the LICENSE file, including the main LICENSE file in opennlp folder? - [ ] If applicable, have you updated the NOTICE file, including the main NOTICE file found in opennlp folder? ### For documentation related changes: - [ ] Have you ensured that format looks appropriate for the output in which it is rendered? ### Note: This PR is a follow-up for OPENNLP-421 to remove unnecessary object creation by removing `StringListWrapper` used in `Dictionary`. The sole purpose of the `StringListWrapper` impl was to provide a way to ignore cases for dictionary entries stored in a `StringList` by overriding the related `hashCode()` and `equals(...)` method. This PR delegates the behaviour to the actual `StringList` and adds methods to "convert" the behaviour from a case-sensitive `StringList` to its insensitive form. By doing so, we can reduce the creation of objects for `contains(...)`, `add(...)`, etc. operation. We only create new `StringLists` if their case does not match the case of the dictionary. > Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning > -- > > Key: OPENNLP-421 > URL: https://issues.apache.org/jira/browse/OPENNLP-421 > Project: OpenNLP > Issue Type: Bug > Components: Name Finder >Affects Versions: tools-1.5.2-incubating > Environment: RedHat 5, JDK 1.6.0_29 >Reporter: Jay Hacker >Assignee: Martin Wiesner >Priority: Major > Labels: performance > Fix For: 2.3.2 > > Original Estimate: 168h > Remaining Estimate: 168h > > The current implementation of StringList: > https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup > > calls intern() on every String. Presumably this is an attempt to reduce > memory usage for duplicate tokens. Interned Strings are stored in the JVM's > permanent generation, which has a small fixed size (seems to be about 83 MB > on modern 64-bit JVMs: > [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]). > Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen > space. > The size of the PermGen can be increased with the -XX:MaxPermSize= option to > the JVM. However, this option is non-standard and not well known, and it > would be nice if OpenNLP worked out of the box without deep JVM tuning. > This immediate problem could be fixed by simply not interning Strings. > Looking at the Dictionary and DictionaryNameFinder code as a whole, however, > there is a huge amount of room for performance improvement. Currently, > DictionaryNameFinder.find works something like this: > for every token in every tokenlist in the dictionary: > copy it into a "meta dictionary" of single tokens > for every possible subsequence of tokens in the sentence:// of which > there are O(N^2) > copy the sequence into a new array > if the last token is in the "meta dictionary": > make a StringList from the tokens > look it up in the dictionary > Dictionary itself is very heavyweight: it's a Set, which > wraps StringList, which wraps Array. Every entry in the dictionary > requires at least four allocated objects (in addition to the Strings): Array, > StringList, StringListWrapper, and HashMap.Entry.
[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
[ https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799901#comment-17799901 ] ASF GitHub Bot commented on OPENNLP-421: rzo1 commented on PR #568: URL: https://github.com/apache/opennlp/pull/568#issuecomment-1867971510 Thx to all reviewers / comments with feedback & ideas! Think we solved a good thing here for 2.3.2 ! Other PR for OPENNLP-421 incoming in a few minutes. > Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning > -- > > Key: OPENNLP-421 > URL: https://issues.apache.org/jira/browse/OPENNLP-421 > Project: OpenNLP > Issue Type: Bug > Components: Name Finder >Affects Versions: tools-1.5.2-incubating > Environment: RedHat 5, JDK 1.6.0_29 >Reporter: Jay Hacker >Assignee: Martin Wiesner >Priority: Major > Labels: performance > Fix For: 2.3.2 > > Original Estimate: 168h > Remaining Estimate: 168h > > The current implementation of StringList: > https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup > > calls intern() on every String. Presumably this is an attempt to reduce > memory usage for duplicate tokens. Interned Strings are stored in the JVM's > permanent generation, which has a small fixed size (seems to be about 83 MB > on modern 64-bit JVMs: > [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]). > Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen > space. > The size of the PermGen can be increased with the -XX:MaxPermSize= option to > the JVM. However, this option is non-standard and not well known, and it > would be nice if OpenNLP worked out of the box without deep JVM tuning. > This immediate problem could be fixed by simply not interning Strings. > Looking at the Dictionary and DictionaryNameFinder code as a whole, however, > there is a huge amount of room for performance improvement. Currently, > DictionaryNameFinder.find works something like this: > for every token in every tokenlist in the dictionary: > copy it into a "meta dictionary" of single tokens > for every possible subsequence of tokens in the sentence:// of which > there are O(N^2) > copy the sequence into a new array > if the last token is in the "meta dictionary": > make a StringList from the tokens > look it up in the dictionary > Dictionary itself is very heavyweight: it's a Set, which > wraps StringList, which wraps Array. Every entry in the dictionary > requires at least four allocated objects (in addition to the Strings): Array, > StringList, StringListWrapper, and HashMap.Entry. Even contains and remove > allocate new objects! > From this comment in DictionaryNameFinder: > // TODO: improve performance here > It seems like improvements would be welcome. :) Removing some of the object > overhead would more than make up for interning strings. Should I create a > new Jira ticket to propose a more efficient design? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
[ https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799900#comment-17799900 ] ASF GitHub Bot commented on OPENNLP-421: rzo1 merged PR #568: URL: https://github.com/apache/opennlp/pull/568 > Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning > -- > > Key: OPENNLP-421 > URL: https://issues.apache.org/jira/browse/OPENNLP-421 > Project: OpenNLP > Issue Type: Bug > Components: Name Finder >Affects Versions: tools-1.5.2-incubating > Environment: RedHat 5, JDK 1.6.0_29 >Reporter: Jay Hacker >Assignee: Martin Wiesner >Priority: Major > Labels: performance > Fix For: 2.3.2 > > Original Estimate: 168h > Remaining Estimate: 168h > > The current implementation of StringList: > https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup > > calls intern() on every String. Presumably this is an attempt to reduce > memory usage for duplicate tokens. Interned Strings are stored in the JVM's > permanent generation, which has a small fixed size (seems to be about 83 MB > on modern 64-bit JVMs: > [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]). > Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen > space. > The size of the PermGen can be increased with the -XX:MaxPermSize= option to > the JVM. However, this option is non-standard and not well known, and it > would be nice if OpenNLP worked out of the box without deep JVM tuning. > This immediate problem could be fixed by simply not interning Strings. > Looking at the Dictionary and DictionaryNameFinder code as a whole, however, > there is a huge amount of room for performance improvement. Currently, > DictionaryNameFinder.find works something like this: > for every token in every tokenlist in the dictionary: > copy it into a "meta dictionary" of single tokens > for every possible subsequence of tokens in the sentence:// of which > there are O(N^2) > copy the sequence into a new array > if the last token is in the "meta dictionary": > make a StringList from the tokens > look it up in the dictionary > Dictionary itself is very heavyweight: it's a Set, which > wraps StringList, which wraps Array. Every entry in the dictionary > requires at least four allocated objects (in addition to the Strings): Array, > StringList, StringListWrapper, and HashMap.Entry. Even contains and remove > allocate new objects! > From this comment in DictionaryNameFinder: > // TODO: improve performance here > It seems like improvements would be welcome. :) Removing some of the object > overhead would more than make up for interning strings. Should I create a > new Jira ticket to propose a more efficient design? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
[ https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799899#comment-17799899 ] ASF GitHub Bot commented on OPENNLP-421: rzo1 commented on code in PR #568: URL: https://github.com/apache/opennlp/pull/568#discussion_r1435274503 ## opennlp-tools/src/main/java/opennlp/tools/util/jvm/Interner.java: ## @@ -0,0 +1,23 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package opennlp.tools.util.jvm; + +public interface Interner { + + String intern(String s); Review Comment: Ah ic. Old comments. So no issue :) > Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning > -- > > Key: OPENNLP-421 > URL: https://issues.apache.org/jira/browse/OPENNLP-421 > Project: OpenNLP > Issue Type: Bug > Components: Name Finder >Affects Versions: tools-1.5.2-incubating > Environment: RedHat 5, JDK 1.6.0_29 >Reporter: Jay Hacker >Assignee: Martin Wiesner >Priority: Major > Labels: performance > Fix For: 2.3.2 > > Original Estimate: 168h > Remaining Estimate: 168h > > The current implementation of StringList: > https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup > > calls intern() on every String. Presumably this is an attempt to reduce > memory usage for duplicate tokens. Interned Strings are stored in the JVM's > permanent generation, which has a small fixed size (seems to be about 83 MB > on modern 64-bit JVMs: > [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]). > Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen > space. > The size of the PermGen can be increased with the -XX:MaxPermSize= option to > the JVM. However, this option is non-standard and not well known, and it > would be nice if OpenNLP worked out of the box without deep JVM tuning. > This immediate problem could be fixed by simply not interning Strings. > Looking at the Dictionary and DictionaryNameFinder code as a whole, however, > there is a huge amount of room for performance improvement. Currently, > DictionaryNameFinder.find works something like this: > for every token in every tokenlist in the dictionary: > copy it into a "meta dictionary" of single tokens > for every possible subsequence of tokens in the sentence:// of which > there are O(N^2) > copy the sequence into a new array > if the last token is in the "meta dictionary": > make a StringList from the tokens > look it up in the dictionary > Dictionary itself is very heavyweight: it's a Set, which > wraps StringList, which wraps Array. Every entry in the dictionary > requires at least four allocated objects (in addition to the Strings): Array, > StringList, StringListWrapper, and HashMap.Entry. Even contains and remove > allocate new objects! > From this comment in DictionaryNameFinder: > // TODO: improve performance here > It seems like improvements would be welcome. :) Removing some of the object > overhead would more than make up for interning strings. Should I create a > new Jira ticket to propose a more efficient design? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
[ https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799890#comment-17799890 ] ASF GitHub Bot commented on OPENNLP-421: rzo1 commented on PR #568: URL: https://github.com/apache/opennlp/pull/568#issuecomment-1867907835 @kinow thx for trying. Both JMH benchmark took around 8h together on my old tower. I will address your comments, squash it up and merge it. Stay tuned for the wrapper removal after > Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning > -- > > Key: OPENNLP-421 > URL: https://issues.apache.org/jira/browse/OPENNLP-421 > Project: OpenNLP > Issue Type: Bug > Components: Name Finder >Affects Versions: tools-1.5.2-incubating > Environment: RedHat 5, JDK 1.6.0_29 >Reporter: Jay Hacker >Assignee: Martin Wiesner >Priority: Major > Labels: performance > Fix For: 2.3.2 > > Original Estimate: 168h > Remaining Estimate: 168h > > The current implementation of StringList: > https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup > > calls intern() on every String. Presumably this is an attempt to reduce > memory usage for duplicate tokens. Interned Strings are stored in the JVM's > permanent generation, which has a small fixed size (seems to be about 83 MB > on modern 64-bit JVMs: > [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]). > Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen > space. > The size of the PermGen can be increased with the -XX:MaxPermSize= option to > the JVM. However, this option is non-standard and not well known, and it > would be nice if OpenNLP worked out of the box without deep JVM tuning. > This immediate problem could be fixed by simply not interning Strings. > Looking at the Dictionary and DictionaryNameFinder code as a whole, however, > there is a huge amount of room for performance improvement. Currently, > DictionaryNameFinder.find works something like this: > for every token in every tokenlist in the dictionary: > copy it into a "meta dictionary" of single tokens > for every possible subsequence of tokens in the sentence:// of which > there are O(N^2) > copy the sequence into a new array > if the last token is in the "meta dictionary": > make a StringList from the tokens > look it up in the dictionary > Dictionary itself is very heavyweight: it's a Set, which > wraps StringList, which wraps Array. Every entry in the dictionary > requires at least four allocated objects (in addition to the Strings): Array, > StringList, StringListWrapper, and HashMap.Entry. Even contains and remove > allocate new objects! > From this comment in DictionaryNameFinder: > // TODO: improve performance here > It seems like improvements would be welcome. :) Removing some of the object > overhead would more than make up for interning strings. Should I create a > new Jira ticket to propose a more efficient design? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
[ https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799862#comment-17799862 ] ASF GitHub Bot commented on OPENNLP-421: kinow commented on code in PR #568: URL: https://github.com/apache/opennlp/pull/568#discussion_r1429217272 ## opennlp-tools/src/main/java/opennlp/tools/util/jvm/DefaultStringInterner.java: ## @@ -0,0 +1,27 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package opennlp.tools.util.jvm; + +/** + * An interner implementation based on String.intern() Review Comment: Missing period? ## opennlp-tools/src/main/java/opennlp/tools/util/jvm/HMInterner.java: ## @@ -0,0 +1,42 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package opennlp.tools.util.jvm; + +import java.util.HashMap; +import java.util.Map; + +/** + * A user String interner implementation by Aleksey Shipilëv. + * + * Origin: + * https://shipilev.net/jvm/anatomy-quarks/10-string-intern/;> + * https://shipilev.net/jvm/anatomy-quarks/10-string-intern/ + * + */ +public class HMInterner implements Interner { +private final Map map; + +public HMInterner() { + map = new HashMap<>(); +} + + @Override +public String intern(String s) { +final String exist = map.putIfAbsent(s, s); + return (exist == null) ? s : exist; +} + } Review Comment: Missing newline at end. ## opennlp-tools/src/main/java/opennlp/tools/util/jvm/Interner.java: ## @@ -0,0 +1,23 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package opennlp.tools.util.jvm; + +public interface Interner { Review Comment: Add a simple javadoc, like "String interner"? ## opennlp-tools/src/main/java/opennlp/tools/util/jvm/Interner.java: ## @@ -0,0 +1,23 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package
[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
[ https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799849#comment-17799849 ] ASF GitHub Bot commented on OPENNLP-421: kinow commented on PR #568: URL: https://github.com/apache/opennlp/pull/568#issuecomment-1867746846 I did, like the previous time. I invalidated the cache and restart IntelliJ, deleted the Run configuration… that's really odd — although it's Friday afternoon, so my :brain: is probably giving up on me already. Will try again later. > Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning > -- > > Key: OPENNLP-421 > URL: https://issues.apache.org/jira/browse/OPENNLP-421 > Project: OpenNLP > Issue Type: Bug > Components: Name Finder >Affects Versions: tools-1.5.2-incubating > Environment: RedHat 5, JDK 1.6.0_29 >Reporter: Jay Hacker >Assignee: Martin Wiesner >Priority: Major > Labels: performance > Fix For: 2.3.2 > > Original Estimate: 168h > Remaining Estimate: 168h > > The current implementation of StringList: > https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup > > calls intern() on every String. Presumably this is an attempt to reduce > memory usage for duplicate tokens. Interned Strings are stored in the JVM's > permanent generation, which has a small fixed size (seems to be about 83 MB > on modern 64-bit JVMs: > [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]). > Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen > space. > The size of the PermGen can be increased with the -XX:MaxPermSize= option to > the JVM. However, this option is non-standard and not well known, and it > would be nice if OpenNLP worked out of the box without deep JVM tuning. > This immediate problem could be fixed by simply not interning Strings. > Looking at the Dictionary and DictionaryNameFinder code as a whole, however, > there is a huge amount of room for performance improvement. Currently, > DictionaryNameFinder.find works something like this: > for every token in every tokenlist in the dictionary: > copy it into a "meta dictionary" of single tokens > for every possible subsequence of tokens in the sentence:// of which > there are O(N^2) > copy the sequence into a new array > if the last token is in the "meta dictionary": > make a StringList from the tokens > look it up in the dictionary > Dictionary itself is very heavyweight: it's a Set, which > wraps StringList, which wraps Array. Every entry in the dictionary > requires at least four allocated objects (in addition to the Strings): Array, > StringList, StringListWrapper, and HashMap.Entry. Even contains and remove > allocate new objects! > From this comment in DictionaryNameFinder: > // TODO: improve performance here > It seems like improvements would be welcome. :) Removing some of the object > overhead would more than make up for interning strings. Should I create a > new Jira ticket to propose a more efficient design? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
[ https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799843#comment-17799843 ] ASF GitHub Bot commented on OPENNLP-421: rzo1 commented on PR #568: URL: https://github.com/apache/opennlp/pull/568#issuecomment-1867718995 Did you activate the JMH profile, so it adds the related dependencies? > Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning > -- > > Key: OPENNLP-421 > URL: https://issues.apache.org/jira/browse/OPENNLP-421 > Project: OpenNLP > Issue Type: Bug > Components: Name Finder >Affects Versions: tools-1.5.2-incubating > Environment: RedHat 5, JDK 1.6.0_29 >Reporter: Jay Hacker >Assignee: Martin Wiesner >Priority: Major > Labels: performance > Fix For: 2.3.2 > > Original Estimate: 168h > Remaining Estimate: 168h > > The current implementation of StringList: > https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup > > calls intern() on every String. Presumably this is an attempt to reduce > memory usage for duplicate tokens. Interned Strings are stored in the JVM's > permanent generation, which has a small fixed size (seems to be about 83 MB > on modern 64-bit JVMs: > [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]). > Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen > space. > The size of the PermGen can be increased with the -XX:MaxPermSize= option to > the JVM. However, this option is non-standard and not well known, and it > would be nice if OpenNLP worked out of the box without deep JVM tuning. > This immediate problem could be fixed by simply not interning Strings. > Looking at the Dictionary and DictionaryNameFinder code as a whole, however, > there is a huge amount of room for performance improvement. Currently, > DictionaryNameFinder.find works something like this: > for every token in every tokenlist in the dictionary: > copy it into a "meta dictionary" of single tokens > for every possible subsequence of tokens in the sentence:// of which > there are O(N^2) > copy the sequence into a new array > if the last token is in the "meta dictionary": > make a StringList from the tokens > look it up in the dictionary > Dictionary itself is very heavyweight: it's a Set, which > wraps StringList, which wraps Array. Every entry in the dictionary > requires at least four allocated objects (in addition to the Strings): Array, > StringList, StringListWrapper, and HashMap.Entry. Even contains and remove > allocate new objects! > From this comment in DictionaryNameFinder: > // TODO: improve performance here > It seems like improvements would be welcome. :) Removing some of the object > overhead would more than make up for interning strings. Should I create a > new Jira ticket to propose a more efficient design? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
[ https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799841#comment-17799841 ] ASF GitHub Bot commented on OPENNLP-421: kinow commented on PR #568: URL: https://github.com/apache/opennlp/pull/568#issuecomment-1867709096 I'm working until noon today, then I should have time to test it again :tada: (sorry the delay) Trying the branch again after rebuilding it, but I keep getting ```java # Run progress: 0.74% complete, ETA 00:01:51 # Fork: 1 of 1 java.lang.NoClassDefFoundError: opennlp/tools/jmh/ExecutionPlan at java.base/java.lang.Class.forName0(Native Method) at java.base/java.lang.Class.forName(Class.java:375) at org.openjdk.jmh.util.ClassUtils.loadClass(ClassUtils.java:73) ... ``` > Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning > -- > > Key: OPENNLP-421 > URL: https://issues.apache.org/jira/browse/OPENNLP-421 > Project: OpenNLP > Issue Type: Bug > Components: Name Finder >Affects Versions: tools-1.5.2-incubating > Environment: RedHat 5, JDK 1.6.0_29 >Reporter: Jay Hacker >Assignee: Martin Wiesner >Priority: Major > Labels: performance > Fix For: 2.3.2 > > Original Estimate: 168h > Remaining Estimate: 168h > > The current implementation of StringList: > https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup > > calls intern() on every String. Presumably this is an attempt to reduce > memory usage for duplicate tokens. Interned Strings are stored in the JVM's > permanent generation, which has a small fixed size (seems to be about 83 MB > on modern 64-bit JVMs: > [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]). > Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen > space. > The size of the PermGen can be increased with the -XX:MaxPermSize= option to > the JVM. However, this option is non-standard and not well known, and it > would be nice if OpenNLP worked out of the box without deep JVM tuning. > This immediate problem could be fixed by simply not interning Strings. > Looking at the Dictionary and DictionaryNameFinder code as a whole, however, > there is a huge amount of room for performance improvement. Currently, > DictionaryNameFinder.find works something like this: > for every token in every tokenlist in the dictionary: > copy it into a "meta dictionary" of single tokens > for every possible subsequence of tokens in the sentence:// of which > there are O(N^2) > copy the sequence into a new array > if the last token is in the "meta dictionary": > make a StringList from the tokens > look it up in the dictionary > Dictionary itself is very heavyweight: it's a Set, which > wraps StringList, which wraps Array. Every entry in the dictionary > requires at least four allocated objects (in addition to the Strings): Array, > StringList, StringListWrapper, and HashMap.Entry. Even contains and remove > allocate new objects! > From this comment in DictionaryNameFinder: > // TODO: improve performance here > It seems like improvements would be welcome. :) Removing some of the object > overhead would more than make up for interning strings. Should I create a > new Jira ticket to propose a more efficient design? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
[ https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799691#comment-17799691 ] ASF GitHub Bot commented on OPENNLP-421: rzo1 commented on PR #568: URL: https://github.com/apache/opennlp/pull/568#issuecomment-1867340084 If this code is integrated, there is another branch based on this PR ( https://github.com/apache/opennlp/commit/7d55599a8ff35adc462a05b882bb3260c7943e35 ) which will remove the unnecessary creation of `StringListWrapper` objects. > Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning > -- > > Key: OPENNLP-421 > URL: https://issues.apache.org/jira/browse/OPENNLP-421 > Project: OpenNLP > Issue Type: Bug > Components: Name Finder >Affects Versions: tools-1.5.2-incubating > Environment: RedHat 5, JDK 1.6.0_29 >Reporter: Jay Hacker >Assignee: Richard Zowalla >Priority: Minor > Labels: performance > Original Estimate: 168h > Remaining Estimate: 168h > > The current implementation of StringList: > https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup > > calls intern() on every String. Presumably this is an attempt to reduce > memory usage for duplicate tokens. Interned Strings are stored in the JVM's > permanent generation, which has a small fixed size (seems to be about 83 MB > on modern 64-bit JVMs: > [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]). > Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen > space. > The size of the PermGen can be increased with the -XX:MaxPermSize= option to > the JVM. However, this option is non-standard and not well known, and it > would be nice if OpenNLP worked out of the box without deep JVM tuning. > This immediate problem could be fixed by simply not interning Strings. > Looking at the Dictionary and DictionaryNameFinder code as a whole, however, > there is a huge amount of room for performance improvement. Currently, > DictionaryNameFinder.find works something like this: > for every token in every tokenlist in the dictionary: > copy it into a "meta dictionary" of single tokens > for every possible subsequence of tokens in the sentence:// of which > there are O(N^2) > copy the sequence into a new array > if the last token is in the "meta dictionary": > make a StringList from the tokens > look it up in the dictionary > Dictionary itself is very heavyweight: it's a Set, which > wraps StringList, which wraps Array. Every entry in the dictionary > requires at least four allocated objects (in addition to the Strings): Array, > StringList, StringListWrapper, and HashMap.Entry. Even contains and remove > allocate new objects! > From this comment in DictionaryNameFinder: > // TODO: improve performance here > It seems like improvements would be welcome. :) Removing some of the object > overhead would more than make up for interning strings. Should I create a > new Jira ticket to propose a more efficient design? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
[ https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17798258#comment-17798258 ] ASF GitHub Bot commented on OPENNLP-421: mawiesne commented on code in PR #568: URL: https://github.com/apache/opennlp/pull/568#discussion_r1430325906 ## opennlp-tools/src/main/java/opennlp/tools/util/StringList.java: ## @@ -33,19 +35,19 @@ public class StringList implements Iterable { * Initializes a {@link StringList} instance. * * Note: - * Token String will be replaced by identical internal String object. + * Token String will be interened via {@link StringInterners}. * * @param singleToken One single token */ public StringList(String singleToken) { -tokens = new String[]{singleToken.intern()}; +tokens = new String[]{StringInterners.intern(singleToken)}; } /** * Initializes a {@link StringList} instance. * * Note: - * Token Strings will be replaced by identical internal String object. + * Token Strings will be interened via {@link StringInterners}. Review Comment: Should read `interned`, minus the extra `e`. ## opennlp-tools/src/main/java/opennlp/tools/util/StringList.java: ## @@ -33,19 +35,19 @@ public class StringList implements Iterable { * Initializes a {@link StringList} instance. * * Note: - * Token String will be replaced by identical internal String object. + * Token String will be interened via {@link StringInterners}. Review Comment: Should read `interned`, minus the extra `e`. ## opennlp-tools/src/main/java/opennlp/tools/util/jvm/CHMStringDeduplicator.java: ## @@ -0,0 +1,66 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package opennlp.tools.util.jvm; + + +import java.util.Map; +import java.util.concurrent.ConcurrentHashMap; +import java.util.concurrent.ThreadLocalRandom; + +import opennlp.tools.commons.Internal; +import opennlp.tools.commons.ThreadSafe; + +/** + * A {@link StringInterner} implementation by Aleksey Shipilëv with relaxed canonical requirements. + * It is a probabilistic deduplication implementation with a default probability of {@code 0.5}. + * Users may implement a custom class with a different probability value. + * + * Origin: + * https://shipilev.net/talks/joker-Oct2014-string-catechism.pdf;> + * https://shipilev.net/talks/joker-Oct2014-string-catechism.pdf + * Review Comment: No extra `` required here. ## opennlp-tools/src/main/java/opennlp/tools/util/jvm/CHMStringDeduplicator.java: ## @@ -0,0 +1,66 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package opennlp.tools.util.jvm; + + +import java.util.Map; +import java.util.concurrent.ConcurrentHashMap; +import java.util.concurrent.ThreadLocalRandom; + +import opennlp.tools.commons.Internal; +import opennlp.tools.commons.ThreadSafe; + +/** + * A {@link StringInterner} implementation by Aleksey Shipilëv with relaxed canonical requirements. + * It is a probabilistic deduplication implementation with a default probability of {@code 0.5}. + * Users may implement a custom class with a different probability value. + * + * Origin: + * https://shipilev.net/talks/joker-Oct2014-string-catechism.pdf;> + * https://shipilev.net/talks/joker-Oct2014-string-catechism.pdf + * + */ +@Internal +@ThreadSafe +class CHMStringDeduplicator implements
[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
[ https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17798227#comment-17798227 ] ASF GitHub Bot commented on OPENNLP-421: rzo1 commented on PR #568: URL: https://github.com/apache/opennlp/pull/568#issuecomment-1860649753 Basically, it is now ready for review. > Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning > -- > > Key: OPENNLP-421 > URL: https://issues.apache.org/jira/browse/OPENNLP-421 > Project: OpenNLP > Issue Type: Bug > Components: Name Finder >Affects Versions: tools-1.5.2-incubating > Environment: RedHat 5, JDK 1.6.0_29 >Reporter: Jay Hacker >Assignee: Richard Zowalla >Priority: Minor > Labels: performance > Original Estimate: 168h > Remaining Estimate: 168h > > The current implementation of StringList: > https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup > > calls intern() on every String. Presumably this is an attempt to reduce > memory usage for duplicate tokens. Interned Strings are stored in the JVM's > permanent generation, which has a small fixed size (seems to be about 83 MB > on modern 64-bit JVMs: > [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]). > Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen > space. > The size of the PermGen can be increased with the -XX:MaxPermSize= option to > the JVM. However, this option is non-standard and not well known, and it > would be nice if OpenNLP worked out of the box without deep JVM tuning. > This immediate problem could be fixed by simply not interning Strings. > Looking at the Dictionary and DictionaryNameFinder code as a whole, however, > there is a huge amount of room for performance improvement. Currently, > DictionaryNameFinder.find works something like this: > for every token in every tokenlist in the dictionary: > copy it into a "meta dictionary" of single tokens > for every possible subsequence of tokens in the sentence:// of which > there are O(N^2) > copy the sequence into a new array > if the last token is in the "meta dictionary": > make a StringList from the tokens > look it up in the dictionary > Dictionary itself is very heavyweight: it's a Set, which > wraps StringList, which wraps Array. Every entry in the dictionary > requires at least four allocated objects (in addition to the Strings): Array, > StringList, StringListWrapper, and HashMap.Entry. Even contains and remove > allocate new objects! > From this comment in DictionaryNameFinder: > // TODO: improve performance here > It seems like improvements would be welcome. :) Removing some of the object > overhead would more than make up for interning strings. Should I create a > new Jira ticket to propose a more efficient design? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
[ https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17798226#comment-17798226 ] ASF GitHub Bot commented on OPENNLP-421: rzo1 commented on PR #568: URL: https://github.com/apache/opennlp/pull/568#issuecomment-1860649042 Here is the other benchmark: ```bash Benchmark (internerClazz) (size) Mode Cnt Score Error Units StringListBenchmark.newWithArrayConstructor opennlp.tools.util.jvm.CHMStringDeduplicator1 thrpt 25 45717169,968 ± 4894532,156 ops/s StringListBenchmark.newWithArrayConstructor opennlp.tools.util.jvm.CHMStringDeduplicator 100 thrpt 25 544627,570 ± 9900,671 ops/s StringListBenchmark.newWithArrayConstructor opennlp.tools.util.jvm.CHMStringDeduplicator1 thrpt 25 3876,808 ± 211,760 ops/s StringListBenchmark.newWithArrayConstructor opennlp.tools.util.jvm.CHMStringDeduplicator 100 thrpt 25 11,190 ±1,009 ops/s StringListBenchmark.newWithArrayConstructor opennlp.tools.util.jvm.CHMStringInterner1 thrpt 25 43774046,611 ± 11369890,960 ops/s StringListBenchmark.newWithArrayConstructor opennlp.tools.util.jvm.CHMStringInterner 100 thrpt 25 557563,692 ± 12776,109 ops/s StringListBenchmark.newWithArrayConstructor opennlp.tools.util.jvm.CHMStringInterner1 thrpt 25 2951,642 ± 189,985 ops/s StringListBenchmark.newWithArrayConstructor opennlp.tools.util.jvm.CHMStringInterner 100 thrpt 25 8,232 ± 0,847 ops/s StringListBenchmark.newWithArrayConstructor opennlp.tools.util.jvm.HMStringInterner1 thrpt 25 63103971,334 ± 1958932,641 ops/s StringListBenchmark.newWithArrayConstructor opennlp.tools.util.jvm.HMStringInterner 100 thrpt 25 854268,631 ± 22261,811 ops/s StringListBenchmark.newWithArrayConstructor opennlp.tools.util.jvm.HMStringInterner1 thrpt 25 3191,138 ± 241,943 ops/s StringListBenchmark.newWithArrayConstructor opennlp.tools.util.jvm.HMStringInterner 100 thrpt 25 8,891 ± 1,040 ops/s StringListBenchmark.newWithArrayConstructor opennlp.tools.util.jvm.JvmStringInterner1 thrpt 258881121,516 ± 108767,585 ops/s StringListBenchmark.newWithArrayConstructor opennlp.tools.util.jvm.JvmStringInterner 100 thrpt 25 81073,526 ± 342,867 ops/s StringListBenchmark.newWithArrayConstructor opennlp.tools.util.jvm.JvmStringInterner1 thrpt 25729,416 ± 6,782 ops/s StringListBenchmark.newWithArrayConstructor opennlp.tools.util.jvm.JvmStringInterner 100 thrpt 25 2,346 ± 0,076 ops/s StringListBenchmark.newWithArrayConstructor opennlp.tools.util.jvm.NoOpStringInterner1 thrpt 25 132530907,915 ± 368546,226 ops/s StringListBenchmark.newWithArrayConstructor opennlp.tools.util.jvm.NoOpStringInterner 100 thrpt 256446247,778 ± 14485,272 ops/s StringListBenchmark.newWithArrayConstructor opennlp.tools.util.jvm.NoOpStringInterner1 thrpt 25 53441,292 ± 107,629 ops/s StringListBenchmark.newWithArrayConstructor opennlp.tools.util.jvm.NoOpStringInterner 100 thrpt 25124,158 ± 1,921 ops/s ``` https://gist.github.com/rzo1/1e85f65ff5d152e168c58e72b2de5375 > Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning > -- > > Key: OPENNLP-421 > URL: https://issues.apache.org/jira/browse/OPENNLP-421 > Project: OpenNLP > Issue Type: Bug > Components: Name Finder >Affects Versions: tools-1.5.2-incubating > Environment: RedHat 5, JDK 1.6.0_29 >Reporter: Jay Hacker >Assignee: Richard Zowalla >Priority: Minor > Labels: performance > Original Estimate: 168h > Remaining Estimate: 168h > > The current implementation of StringList: > https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup > > calls intern() on every String. Presumably this is an attempt to reduce > memory usage for duplicate tokens. Interned Strings are stored in the JVM's > permanent generation, which has a small fixed size (seems to be about 83 MB > on modern 64-bit JVMs: > [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]). > Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen > space. > The size of the PermGen can
[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
[ https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17798214#comment-17798214 ] ASF GitHub Bot commented on OPENNLP-421: jzonthemtn commented on PR #568: URL: https://github.com/apache/opennlp/pull/568#issuecomment-1860559806 Love it. Curious to see the impact on things like training. > Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning > -- > > Key: OPENNLP-421 > URL: https://issues.apache.org/jira/browse/OPENNLP-421 > Project: OpenNLP > Issue Type: Bug > Components: Name Finder >Affects Versions: tools-1.5.2-incubating > Environment: RedHat 5, JDK 1.6.0_29 >Reporter: Jay Hacker >Assignee: Richard Zowalla >Priority: Minor > Labels: performance > Original Estimate: 168h > Remaining Estimate: 168h > > The current implementation of StringList: > https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup > > calls intern() on every String. Presumably this is an attempt to reduce > memory usage for duplicate tokens. Interned Strings are stored in the JVM's > permanent generation, which has a small fixed size (seems to be about 83 MB > on modern 64-bit JVMs: > [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]). > Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen > space. > The size of the PermGen can be increased with the -XX:MaxPermSize= option to > the JVM. However, this option is non-standard and not well known, and it > would be nice if OpenNLP worked out of the box without deep JVM tuning. > This immediate problem could be fixed by simply not interning Strings. > Looking at the Dictionary and DictionaryNameFinder code as a whole, however, > there is a huge amount of room for performance improvement. Currently, > DictionaryNameFinder.find works something like this: > for every token in every tokenlist in the dictionary: > copy it into a "meta dictionary" of single tokens > for every possible subsequence of tokens in the sentence:// of which > there are O(N^2) > copy the sequence into a new array > if the last token is in the "meta dictionary": > make a StringList from the tokens > look it up in the dictionary > Dictionary itself is very heavyweight: it's a Set, which > wraps StringList, which wraps Array. Every entry in the dictionary > requires at least four allocated objects (in addition to the Strings): Array, > StringList, StringListWrapper, and HashMap.Entry. Even contains and remove > allocate new objects! > From this comment in DictionaryNameFinder: > // TODO: improve performance here > It seems like improvements would be welcome. :) Removing some of the object > overhead would more than make up for interning strings. Should I create a > new Jira ticket to propose a more efficient design? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
[ https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17798123#comment-17798123 ] ASF GitHub Bot commented on OPENNLP-421: rzo1 commented on PR #568: URL: https://github.com/apache/opennlp/pull/568#issuecomment-1860086040 Here is the JMH result for the interner impls only: ```bash Benchmark (size) Mode Cnt Score Error Units StringDeduplicationBenchmark.chm 1 thrpt 25 36070236,872 ± 1204122,063 ops/s StringDeduplicationBenchmark.chm 100 thrpt 25 354445,896 ± 2893,432 ops/s StringDeduplicationBenchmark.chm 1 thrpt 25 2323,076 ± 25,596 ops/s StringDeduplicationBenchmark.chm 100 thrpt 25 13,137 ± 0,199 ops/s StringDeduplicationBenchmark.chmd051 thrpt 25 45289035,126 ± 143794,468 ops/s StringDeduplicationBenchmark.chmd05 100 thrpt 25 433279,497 ± 3165,815 ops/s StringDeduplicationBenchmark.chmd051 thrpt 25 2692,779 ± 8,173 ops/s StringDeduplicationBenchmark.chmd05 100 thrpt 25 13,413 ± 0,155 ops/s StringDeduplicationBenchmark.hm1 thrpt 25 35123958,776 ± 472485,649 ops/s StringDeduplicationBenchmark.hm 100 thrpt 25 371997,311 ± 6780,622 ops/s StringDeduplicationBenchmark.hm1 thrpt 25 2311,588 ± 115,117 ops/s StringDeduplicationBenchmark.hm 100 thrpt 25 14,073 ± 0,068 ops/s StringDeduplicationBenchmark.intern1 thrpt 25 10040026,472 ± 77470,327 ops/s StringDeduplicationBenchmark.intern 100 thrpt 25 87644,048 ± 844,053 ops/s StringDeduplicationBenchmark.intern1 thrpt 25764,752 ± 34,300 ops/s StringDeduplicationBenchmark.intern 100 thrpt 25 2,956 ± 0,024 ops/s StringDeduplicationBenchmark.noop 1 thrpt 25 148719353,102 ± 743493,703 ops/s StringDeduplicationBenchmark.noop100 thrpt 251491614,947 ± 1587,406 ops/s StringDeduplicationBenchmark.noop 1 thrpt 25 9848,732 ± 11,076 ops/s StringDeduplicationBenchmark.noop100 thrpt 25 78,726 ± 0,064 ops/s ``` https://gist.github.com/rzo1/754b15381041b28718cf44073f4986c5 > Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning > -- > > Key: OPENNLP-421 > URL: https://issues.apache.org/jira/browse/OPENNLP-421 > Project: OpenNLP > Issue Type: Bug > Components: Name Finder >Affects Versions: tools-1.5.2-incubating > Environment: RedHat 5, JDK 1.6.0_29 >Reporter: Jay Hacker >Assignee: Richard Zowalla >Priority: Minor > Labels: performance > Original Estimate: 168h > Remaining Estimate: 168h > > The current implementation of StringList: > https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup > > calls intern() on every String. Presumably this is an attempt to reduce > memory usage for duplicate tokens. Interned Strings are stored in the JVM's > permanent generation, which has a small fixed size (seems to be about 83 MB > on modern 64-bit JVMs: > [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]). > Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen > space. > The size of the PermGen can be increased with the -XX:MaxPermSize= option to > the JVM. However, this option is non-standard and not well known, and it > would be nice if OpenNLP worked out of the box without deep JVM tuning. > This immediate problem could be fixed by simply not interning Strings. > Looking at the Dictionary and DictionaryNameFinder code as a whole, however, > there is a huge amount of room for performance improvement. Currently, > DictionaryNameFinder.find works something like this: > for every token in every tokenlist in the dictionary: > copy it into a "meta dictionary" of single tokens > for every possible subsequence of tokens in the sentence:// of which > there are O(N^2) > copy the sequence into a new array > if the last token is in the "meta dictionary": > make a StringList from the tokens > look it up in the dictionary > Dictionary itself is very heavyweight: it's a Set, which > wraps StringList, which wraps Array. Every entry in the dictionary > requires at least four allocated objects (in addition to the Strings): Array, > StringList, StringListWrapper, and HashMap.Entry. Even
[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
[ https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17798058#comment-17798058 ] ASF GitHub Bot commented on OPENNLP-421: mawiesne commented on code in PR #568: URL: https://github.com/apache/opennlp/pull/568#discussion_r1429738970 ## opennlp-tools/src/main/java/opennlp/tools/util/jvm/StringInterners.java: ## @@ -0,0 +1,67 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package opennlp.tools.util.jvm; + +import java.lang.reflect.Constructor; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * Provides string interning utility methods. Interning mechanism can be configured via the + * system property {@code opennlp.interner.class} by specifying an implementation via its + * full qualified classname. It needs to implement {@link StringInterner}. + * + * If not specified by the user, the default interner is {@link CHMStringInterner}. + */ +public class StringInterners { + + private static final Logger LOGGER = LoggerFactory.getLogger(StringInterners.class); + private static final StringInterner INTERNER; + + static { + +final String clazzName = System.getProperty("opennlp.interner.class", Review Comment: Ah, ic. > Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning > -- > > Key: OPENNLP-421 > URL: https://issues.apache.org/jira/browse/OPENNLP-421 > Project: OpenNLP > Issue Type: Bug > Components: Name Finder >Affects Versions: tools-1.5.2-incubating > Environment: RedHat 5, JDK 1.6.0_29 >Reporter: Jay Hacker >Assignee: Richard Zowalla >Priority: Minor > Labels: performance > Original Estimate: 168h > Remaining Estimate: 168h > > The current implementation of StringList: > https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup > > calls intern() on every String. Presumably this is an attempt to reduce > memory usage for duplicate tokens. Interned Strings are stored in the JVM's > permanent generation, which has a small fixed size (seems to be about 83 MB > on modern 64-bit JVMs: > [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]). > Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen > space. > The size of the PermGen can be increased with the -XX:MaxPermSize= option to > the JVM. However, this option is non-standard and not well known, and it > would be nice if OpenNLP worked out of the box without deep JVM tuning. > This immediate problem could be fixed by simply not interning Strings. > Looking at the Dictionary and DictionaryNameFinder code as a whole, however, > there is a huge amount of room for performance improvement. Currently, > DictionaryNameFinder.find works something like this: > for every token in every tokenlist in the dictionary: > copy it into a "meta dictionary" of single tokens > for every possible subsequence of tokens in the sentence:// of which > there are O(N^2) > copy the sequence into a new array > if the last token is in the "meta dictionary": > make a StringList from the tokens > look it up in the dictionary > Dictionary itself is very heavyweight: it's a Set, which > wraps StringList, which wraps Array. Every entry in the dictionary > requires at least four allocated objects (in addition to the Strings): Array, > StringList, StringListWrapper, and HashMap.Entry. Even contains and remove > allocate new objects! > From this comment in DictionaryNameFinder: > // TODO: improve performance here > It seems like improvements would be welcome. :) Removing some of the object > overhead would more than make up for interning strings. Should I create a > new Jira ticket to propose a more efficient design? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
[ https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17798056#comment-17798056 ] ASF GitHub Bot commented on OPENNLP-421: rzo1 commented on code in PR #568: URL: https://github.com/apache/opennlp/pull/568#discussion_r1429734990 ## opennlp-tools/src/main/java/opennlp/tools/util/jvm/StringInterners.java: ## @@ -0,0 +1,67 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package opennlp.tools.util.jvm; + +import java.lang.reflect.Constructor; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * Provides string interning utility methods. Interning mechanism can be configured via the + * system property {@code opennlp.interner.class} by specifying an implementation via its + * full qualified classname. It needs to implement {@link StringInterner}. + * + * If not specified by the user, the default interner is {@link CHMStringInterner}. + */ +public class StringInterners { + + private static final Logger LOGGER = LoggerFactory.getLogger(StringInterners.class); + private static final StringInterner INTERNER; + + static { + +final String clazzName = System.getProperty("opennlp.interner.class", Review Comment: ```java final String clazzName = System.getProperty("opennlp.interner.class", CHMStringInterner.class.getCanonicalName()); ``` uses ```java /** * Gets the system property indicated by the specified key. * * First, if there is a security manager, its * {@code checkPropertyAccess} method is called with the * {@code key} as its argument. * * If there is no current set of system properties, a set of system * properties is first created and initialized in the same manner as * for the {@code getProperties} method. * * @param key the name of the system property. * @param def a default value. * @return the string value of the system property, * or the default value if there is no property with that key. * * @throws SecurityException if a security manager exists and its * {@code checkPropertyAccess} method doesn't allow * access to the specified system property. * @throws NullPointerException if {@code key} is {@code null}. * @throws IllegalArgumentException if {@code key} is empty. * @see#setProperty * @see java.lang.SecurityManager#checkPropertyAccess(java.lang.String) * @seejava.lang.System#getProperties() */ public static String getProperty(String key, String def) ``` will return `CHMStringInterner.class.getCanonicalName()` if the property is not defined. > Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning > -- > > Key: OPENNLP-421 > URL: https://issues.apache.org/jira/browse/OPENNLP-421 > Project: OpenNLP > Issue Type: Bug > Components: Name Finder >Affects Versions: tools-1.5.2-incubating > Environment: RedHat 5, JDK 1.6.0_29 >Reporter: Jay Hacker >Assignee: Richard Zowalla >Priority: Minor > Labels: performance > Original Estimate: 168h > Remaining Estimate: 168h > > The current implementation of StringList: > https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup > > calls intern() on every String. Presumably this is an attempt to reduce > memory usage for duplicate tokens. Interned Strings are stored in the JVM's > permanent generation, which has a small fixed size (seems to be about 83 MB > on modern 64-bit JVMs: > [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]). > Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen > space. > The size of the
[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
[ https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17798057#comment-17798057 ] ASF GitHub Bot commented on OPENNLP-421: rzo1 commented on code in PR #568: URL: https://github.com/apache/opennlp/pull/568#discussion_r1429734990 ## opennlp-tools/src/main/java/opennlp/tools/util/jvm/StringInterners.java: ## @@ -0,0 +1,67 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package opennlp.tools.util.jvm; + +import java.lang.reflect.Constructor; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * Provides string interning utility methods. Interning mechanism can be configured via the + * system property {@code opennlp.interner.class} by specifying an implementation via its + * full qualified classname. It needs to implement {@link StringInterner}. + * + * If not specified by the user, the default interner is {@link CHMStringInterner}. + */ +public class StringInterners { + + private static final Logger LOGGER = LoggerFactory.getLogger(StringInterners.class); + private static final StringInterner INTERNER; + + static { + +final String clazzName = System.getProperty("opennlp.interner.class", Review Comment: ```java final String clazzName = System.getProperty("opennlp.interner.class", CHMStringInterner.class.getCanonicalName()); ``` uses ```java /** * Gets the system property indicated by the specified key. * * First, if there is a security manager, its * {@code checkPropertyAccess} method is called with the * {@code key} as its argument. * * If there is no current set of system properties, a set of system * properties is first created and initialized in the same manner as * for the {@code getProperties} method. * * @param key the name of the system property. * @param def a default value. * @return the string value of the system property, * or the default value if there is no property with that key. * * @throws SecurityException if a security manager exists and its * {@code checkPropertyAccess} method doesn't allow * access to the specified system property. * @throws NullPointerException if {@code key} is {@code null}. * @throws IllegalArgumentException if {@code key} is empty. * @see#setProperty * @see java.lang.SecurityManager#checkPropertyAccess(java.lang.String) * @seejava.lang.System#getProperties() */ public static String getProperty(String key, String def) ``` which will return `CHMStringInterner.class.getCanonicalName()` if the property `opennlp.interner.class` is not defined. > Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning > -- > > Key: OPENNLP-421 > URL: https://issues.apache.org/jira/browse/OPENNLP-421 > Project: OpenNLP > Issue Type: Bug > Components: Name Finder >Affects Versions: tools-1.5.2-incubating > Environment: RedHat 5, JDK 1.6.0_29 >Reporter: Jay Hacker >Assignee: Richard Zowalla >Priority: Minor > Labels: performance > Original Estimate: 168h > Remaining Estimate: 168h > > The current implementation of StringList: > https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup > > calls intern() on every String. Presumably this is an attempt to reduce > memory usage for duplicate tokens. Interned Strings are stored in the JVM's > permanent generation, which has a small fixed size (seems to be about 83 MB > on modern 64-bit JVMs: > [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]). > Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen >
[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
[ https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17798055#comment-17798055 ] ASF GitHub Bot commented on OPENNLP-421: mawiesne commented on code in PR #568: URL: https://github.com/apache/opennlp/pull/568#discussion_r1429730195 ## opennlp-tools/src/main/java/opennlp/tools/util/jvm/StringInterners.java: ## @@ -0,0 +1,67 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package opennlp.tools.util.jvm; + +import java.lang.reflect.Constructor; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * Provides string interning utility methods. Interning mechanism can be configured via the + * system property {@code opennlp.interner.class} by specifying an implementation via its + * full qualified classname. It needs to implement {@link StringInterner}. + * + * If not specified by the user, the default interner is {@link CHMStringInterner}. + */ +public class StringInterners { + + private static final Logger LOGGER = LoggerFactory.getLogger(StringInterners.class); + private static final StringInterner INTERNER; + + static { + +final String clazzName = System.getProperty("opennlp.interner.class", Review Comment: What is the result if the property is not set / found at runtime? How to ensure that the default interner is set "active"? Am I missing something? > Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning > -- > > Key: OPENNLP-421 > URL: https://issues.apache.org/jira/browse/OPENNLP-421 > Project: OpenNLP > Issue Type: Bug > Components: Name Finder >Affects Versions: tools-1.5.2-incubating > Environment: RedHat 5, JDK 1.6.0_29 >Reporter: Jay Hacker >Assignee: Richard Zowalla >Priority: Minor > Labels: performance > Original Estimate: 168h > Remaining Estimate: 168h > > The current implementation of StringList: > https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup > > calls intern() on every String. Presumably this is an attempt to reduce > memory usage for duplicate tokens. Interned Strings are stored in the JVM's > permanent generation, which has a small fixed size (seems to be about 83 MB > on modern 64-bit JVMs: > [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]). > Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen > space. > The size of the PermGen can be increased with the -XX:MaxPermSize= option to > the JVM. However, this option is non-standard and not well known, and it > would be nice if OpenNLP worked out of the box without deep JVM tuning. > This immediate problem could be fixed by simply not interning Strings. > Looking at the Dictionary and DictionaryNameFinder code as a whole, however, > there is a huge amount of room for performance improvement. Currently, > DictionaryNameFinder.find works something like this: > for every token in every tokenlist in the dictionary: > copy it into a "meta dictionary" of single tokens > for every possible subsequence of tokens in the sentence:// of which > there are O(N^2) > copy the sequence into a new array > if the last token is in the "meta dictionary": > make a StringList from the tokens > look it up in the dictionary > Dictionary itself is very heavyweight: it's a Set, which > wraps StringList, which wraps Array. Every entry in the dictionary > requires at least four allocated objects (in addition to the Strings): Array, > StringList, StringListWrapper, and HashMap.Entry. Even contains and remove > allocate new objects! > From this comment in DictionaryNameFinder: > // TODO: improve performance here > It seems like improvements would be welcome. :) Removing some of the object > overhead would more than make up for interning strings. Should I create a > new
[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
[ https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17798043#comment-17798043 ] ASF GitHub Bot commented on OPENNLP-421: rzo1 commented on PR #568: URL: https://github.com/apache/opennlp/pull/568#issuecomment-1859746321 Hi all, I added related interner / dedup implementations inspired / based on the code provided by @mawiesne from Aleksey. I added the following interner impls: - CHMStringInterner (as provided by Aleksey), thread-safe - CHMStringDeduplicator (as provided by Aleksey in his talk), thread-safe -> relaxes the canonical requirements on interning. It is more or less a probabilistic deduplication - HMStringInterner (as provided by Aleksey), not thread-safe - JvmStringInterner -> relies on `String.intern()` - can be used to get the previous OpenNLP behaviour ;) - NoOpStringInterner -> doesn't actually intern/dedup Strings. The implementation of the static `StringInterners` is based on how Hadoop is doing it [1]. The default interner used in OpenNLP is now: `CHMStringInterner` I added a system property `opennlp.interner.class`, which can be used to specify the interner implementation which will be used at runtime: - If people want the old behaviour back: they can. - If people do not want interning at all: they can. - If people want probabilistic dedup: they can. Currently, an updated benchmark of the different impls is running as well as a full eval build with the default. Will update this PR with the JMH results in a few hours. - [1] https://github.com/c9n/hadoop/blob/master/hadoop-common-project%2Fhadoop-common%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fhadoop%2Futil%2FStringInterner.java > Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning > -- > > Key: OPENNLP-421 > URL: https://issues.apache.org/jira/browse/OPENNLP-421 > Project: OpenNLP > Issue Type: Bug > Components: Name Finder >Affects Versions: tools-1.5.2-incubating > Environment: RedHat 5, JDK 1.6.0_29 >Reporter: Jay Hacker >Assignee: Richard Zowalla >Priority: Minor > Labels: performance > Original Estimate: 168h > Remaining Estimate: 168h > > The current implementation of StringList: > https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup > > calls intern() on every String. Presumably this is an attempt to reduce > memory usage for duplicate tokens. Interned Strings are stored in the JVM's > permanent generation, which has a small fixed size (seems to be about 83 MB > on modern 64-bit JVMs: > [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]). > Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen > space. > The size of the PermGen can be increased with the -XX:MaxPermSize= option to > the JVM. However, this option is non-standard and not well known, and it > would be nice if OpenNLP worked out of the box without deep JVM tuning. > This immediate problem could be fixed by simply not interning Strings. > Looking at the Dictionary and DictionaryNameFinder code as a whole, however, > there is a huge amount of room for performance improvement. Currently, > DictionaryNameFinder.find works something like this: > for every token in every tokenlist in the dictionary: > copy it into a "meta dictionary" of single tokens > for every possible subsequence of tokens in the sentence:// of which > there are O(N^2) > copy the sequence into a new array > if the last token is in the "meta dictionary": > make a StringList from the tokens > look it up in the dictionary > Dictionary itself is very heavyweight: it's a Set, which > wraps StringList, which wraps Array. Every entry in the dictionary > requires at least four allocated objects (in addition to the Strings): Array, > StringList, StringListWrapper, and HashMap.Entry. Even contains and remove > allocate new objects! > From this comment in DictionaryNameFinder: > // TODO: improve performance here > It seems like improvements would be welcome. :) Removing some of the object > overhead would more than make up for interning strings. Should I create a > new Jira ticket to propose a more efficient design? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
[ https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17797421#comment-17797421 ] ASF GitHub Bot commented on OPENNLP-421: rzo1 commented on PR #568: URL: https://github.com/apache/opennlp/pull/568#issuecomment-1858815410 Ack. > Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning > -- > > Key: OPENNLP-421 > URL: https://issues.apache.org/jira/browse/OPENNLP-421 > Project: OpenNLP > Issue Type: Bug > Components: Name Finder >Affects Versions: tools-1.5.2-incubating > Environment: RedHat 5, JDK 1.6.0_29 >Reporter: Jay Hacker >Assignee: Richard Zowalla >Priority: Minor > Labels: performance > Original Estimate: 168h > Remaining Estimate: 168h > > The current implementation of StringList: > https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup > > calls intern() on every String. Presumably this is an attempt to reduce > memory usage for duplicate tokens. Interned Strings are stored in the JVM's > permanent generation, which has a small fixed size (seems to be about 83 MB > on modern 64-bit JVMs: > [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]). > Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen > space. > The size of the PermGen can be increased with the -XX:MaxPermSize= option to > the JVM. However, this option is non-standard and not well known, and it > would be nice if OpenNLP worked out of the box without deep JVM tuning. > This immediate problem could be fixed by simply not interning Strings. > Looking at the Dictionary and DictionaryNameFinder code as a whole, however, > there is a huge amount of room for performance improvement. Currently, > DictionaryNameFinder.find works something like this: > for every token in every tokenlist in the dictionary: > copy it into a "meta dictionary" of single tokens > for every possible subsequence of tokens in the sentence:// of which > there are O(N^2) > copy the sequence into a new array > if the last token is in the "meta dictionary": > make a StringList from the tokens > look it up in the dictionary > Dictionary itself is very heavyweight: it's a Set, which > wraps StringList, which wraps Array. Every entry in the dictionary > requires at least four allocated objects (in addition to the Strings): Array, > StringList, StringListWrapper, and HashMap.Entry. Even contains and remove > allocate new objects! > From this comment in DictionaryNameFinder: > // TODO: improve performance here > It seems like improvements would be welcome. :) Removing some of the object > overhead would more than make up for interning strings. Should I create a > new Jira ticket to propose a more efficient design? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
[ https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17797419#comment-17797419 ] ASF GitHub Bot commented on OPENNLP-421: mawiesne commented on PR #568: URL: https://github.com/apache/opennlp/pull/568#issuecomment-1858813771 > Do you want to take over here (in which case I would re-assign the Jira) or shall I add it in the next days? Feel free to proceed after I've pushed the basic string dedup benchmark, leave Jira as is. > Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning > -- > > Key: OPENNLP-421 > URL: https://issues.apache.org/jira/browse/OPENNLP-421 > Project: OpenNLP > Issue Type: Bug > Components: Name Finder >Affects Versions: tools-1.5.2-incubating > Environment: RedHat 5, JDK 1.6.0_29 >Reporter: Jay Hacker >Assignee: Richard Zowalla >Priority: Minor > Labels: performance > Original Estimate: 168h > Remaining Estimate: 168h > > The current implementation of StringList: > https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup > > calls intern() on every String. Presumably this is an attempt to reduce > memory usage for duplicate tokens. Interned Strings are stored in the JVM's > permanent generation, which has a small fixed size (seems to be about 83 MB > on modern 64-bit JVMs: > [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]). > Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen > space. > The size of the PermGen can be increased with the -XX:MaxPermSize= option to > the JVM. However, this option is non-standard and not well known, and it > would be nice if OpenNLP worked out of the box without deep JVM tuning. > This immediate problem could be fixed by simply not interning Strings. > Looking at the Dictionary and DictionaryNameFinder code as a whole, however, > there is a huge amount of room for performance improvement. Currently, > DictionaryNameFinder.find works something like this: > for every token in every tokenlist in the dictionary: > copy it into a "meta dictionary" of single tokens > for every possible subsequence of tokens in the sentence:// of which > there are O(N^2) > copy the sequence into a new array > if the last token is in the "meta dictionary": > make a StringList from the tokens > look it up in the dictionary > Dictionary itself is very heavyweight: it's a Set, which > wraps StringList, which wraps Array. Every entry in the dictionary > requires at least four allocated objects (in addition to the Strings): Array, > StringList, StringListWrapper, and HashMap.Entry. Even contains and remove > allocate new objects! > From this comment in DictionaryNameFinder: > // TODO: improve performance here > It seems like improvements would be welcome. :) Removing some of the object > overhead would more than make up for interning strings. Should I create a > new Jira ticket to propose a more efficient design? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
[ https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17797418#comment-17797418 ] ASF GitHub Bot commented on OPENNLP-421: rzo1 commented on PR #568: URL: https://github.com/apache/opennlp/pull/568#issuecomment-1858809818 Looks great, @mawiesne that it also applies to Java 17. Think we should give it a try and add our own `CHMInterner`. Do you want to take over here (in which case I would re-assign the Jira) or shall I add it in the next days? Just want to make sure no duplicate work is done here ;-) > Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning > -- > > Key: OPENNLP-421 > URL: https://issues.apache.org/jira/browse/OPENNLP-421 > Project: OpenNLP > Issue Type: Bug > Components: Name Finder >Affects Versions: tools-1.5.2-incubating > Environment: RedHat 5, JDK 1.6.0_29 >Reporter: Jay Hacker >Assignee: Richard Zowalla >Priority: Minor > Labels: performance > Original Estimate: 168h > Remaining Estimate: 168h > > The current implementation of StringList: > https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup > > calls intern() on every String. Presumably this is an attempt to reduce > memory usage for duplicate tokens. Interned Strings are stored in the JVM's > permanent generation, which has a small fixed size (seems to be about 83 MB > on modern 64-bit JVMs: > [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]). > Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen > space. > The size of the PermGen can be increased with the -XX:MaxPermSize= option to > the JVM. However, this option is non-standard and not well known, and it > would be nice if OpenNLP worked out of the box without deep JVM tuning. > This immediate problem could be fixed by simply not interning Strings. > Looking at the Dictionary and DictionaryNameFinder code as a whole, however, > there is a huge amount of room for performance improvement. Currently, > DictionaryNameFinder.find works something like this: > for every token in every tokenlist in the dictionary: > copy it into a "meta dictionary" of single tokens > for every possible subsequence of tokens in the sentence:// of which > there are O(N^2) > copy the sequence into a new array > if the last token is in the "meta dictionary": > make a StringList from the tokens > look it up in the dictionary > Dictionary itself is very heavyweight: it's a Set, which > wraps StringList, which wraps Array. Every entry in the dictionary > requires at least four allocated objects (in addition to the Strings): Array, > StringList, StringListWrapper, and HashMap.Entry. Even contains and remove > allocate new objects! > From this comment in DictionaryNameFinder: > // TODO: improve performance here > It seems like improvements would be welcome. :) Removing some of the object > overhead would more than make up for interning strings. Should I create a > new Jira ticket to propose a more efficient design? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
[ https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17797387#comment-17797387 ] ASF GitHub Bot commented on OPENNLP-421: mawiesne commented on PR #568: URL: https://github.com/apache/opennlp/pull/568#issuecomment-1858779780 I reproduced the basic String deduplication benchmark by Shipilev (Java 8) with Java 17 (MacOS); the results are consistent with his findings: ``` Benchmark (size) Mode Cnt Score Error Units StringDeduplicationBenchmark.chm 1 thrpt 25 71093243,405 ± 1437503,007 ops/s StringDeduplicationBenchmark.chm 100 thrpt 25698295,053 ± 10420,119 ops/s StringDeduplicationBenchmark.chm 1 thrpt 25 3389,917 ± 56,170 ops/s StringDeduplicationBenchmark.chm 100 thrpt 2522,715 ± 0,120 ops/s StringDeduplicationBenchmark.hm1 thrpt 25 61916588,259 ± 1271732,213 ops/s StringDeduplicationBenchmark.hm 100 thrpt 25714613,401 ± 10472,821 ops/s StringDeduplicationBenchmark.hm1 thrpt 25 3201,984 ± 18,741 ops/s StringDeduplicationBenchmark.hm 100 thrpt 2523,081 ± 0,122 ops/s StringDeduplicationBenchmark.intern1 thrpt 25 11033749,712 ± 35746,520 ops/s StringDeduplicationBenchmark.intern 100 thrpt 25109538,438 ± 193,911 ops/s StringDeduplicationBenchmark.intern1 thrpt 25 997,533 ± 3,149 ops/s StringDeduplicationBenchmark.intern 100 thrpt 25 6,024 ± 0,026 ops/s ``` > Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning > -- > > Key: OPENNLP-421 > URL: https://issues.apache.org/jira/browse/OPENNLP-421 > Project: OpenNLP > Issue Type: Bug > Components: Name Finder >Affects Versions: tools-1.5.2-incubating > Environment: RedHat 5, JDK 1.6.0_29 >Reporter: Jay Hacker >Assignee: Richard Zowalla >Priority: Minor > Labels: performance > Original Estimate: 168h > Remaining Estimate: 168h > > The current implementation of StringList: > https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup > > calls intern() on every String. Presumably this is an attempt to reduce > memory usage for duplicate tokens. Interned Strings are stored in the JVM's > permanent generation, which has a small fixed size (seems to be about 83 MB > on modern 64-bit JVMs: > [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]). > Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen > space. > The size of the PermGen can be increased with the -XX:MaxPermSize= option to > the JVM. However, this option is non-standard and not well known, and it > would be nice if OpenNLP worked out of the box without deep JVM tuning. > This immediate problem could be fixed by simply not interning Strings. > Looking at the Dictionary and DictionaryNameFinder code as a whole, however, > there is a huge amount of room for performance improvement. Currently, > DictionaryNameFinder.find works something like this: > for every token in every tokenlist in the dictionary: > copy it into a "meta dictionary" of single tokens > for every possible subsequence of tokens in the sentence:// of which > there are O(N^2) > copy the sequence into a new array > if the last token is in the "meta dictionary": > make a StringList from the tokens > look it up in the dictionary > Dictionary itself is very heavyweight: it's a Set, which > wraps StringList, which wraps Array. Every entry in the dictionary > requires at least four allocated objects (in addition to the Strings): Array, > StringList, StringListWrapper, and HashMap.Entry. Even contains and remove > allocate new objects! > From this comment in DictionaryNameFinder: > // TODO: improve performance here > It seems like improvements would be welcome. :) Removing some of the object > overhead would more than make up for interning strings. Should I create a > new Jira ticket to propose a more efficient design? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
[ https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17797377#comment-17797377 ] ASF GitHub Bot commented on OPENNLP-421: kinow commented on PR #568: URL: https://github.com/apache/opennlp/pull/568#issuecomment-1858767471 > I think, we c/should consider using a `ConcurrentHashMap`-based deduplicator approach to replace "intern()" calls, as investigated and explained by Shipilev in his blog post here: > > https://shipilev.net/jvm/anatomy-quarks/10-string-intern/ > > The removal of the String deduplication, as found in the code segments of this PR, could otherwise lead to more required RAM at runtime - as @kinow already expressed. > > Wdyt? @jzonthemtn @rzo1 > > (A) Removal (PR as is now), or (B) Replacement with concurrent-capable custom deduplicator? (B) looks interesting, +1 for trying that out, and for @rzo1 suggestion to add that on top of this one. >`CHMInterner` First time I've seen an intern made in the code (intentionally like that). Sounds like a good idea! You do the intern without using the heap. Not sure if it will perform as fast as String.intern, but I think it might work! Thanks @mawiesne > Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning > -- > > Key: OPENNLP-421 > URL: https://issues.apache.org/jira/browse/OPENNLP-421 > Project: OpenNLP > Issue Type: Bug > Components: Name Finder >Affects Versions: tools-1.5.2-incubating > Environment: RedHat 5, JDK 1.6.0_29 >Reporter: Jay Hacker >Assignee: Richard Zowalla >Priority: Minor > Labels: performance > Original Estimate: 168h > Remaining Estimate: 168h > > The current implementation of StringList: > https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup > > calls intern() on every String. Presumably this is an attempt to reduce > memory usage for duplicate tokens. Interned Strings are stored in the JVM's > permanent generation, which has a small fixed size (seems to be about 83 MB > on modern 64-bit JVMs: > [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]). > Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen > space. > The size of the PermGen can be increased with the -XX:MaxPermSize= option to > the JVM. However, this option is non-standard and not well known, and it > would be nice if OpenNLP worked out of the box without deep JVM tuning. > This immediate problem could be fixed by simply not interning Strings. > Looking at the Dictionary and DictionaryNameFinder code as a whole, however, > there is a huge amount of room for performance improvement. Currently, > DictionaryNameFinder.find works something like this: > for every token in every tokenlist in the dictionary: > copy it into a "meta dictionary" of single tokens > for every possible subsequence of tokens in the sentence:// of which > there are O(N^2) > copy the sequence into a new array > if the last token is in the "meta dictionary": > make a StringList from the tokens > look it up in the dictionary > Dictionary itself is very heavyweight: it's a Set, which > wraps StringList, which wraps Array. Every entry in the dictionary > requires at least four allocated objects (in addition to the Strings): Array, > StringList, StringListWrapper, and HashMap.Entry. Even contains and remove > allocate new objects! > From this comment in DictionaryNameFinder: > // TODO: improve performance here > It seems like improvements would be welcome. :) Removing some of the object > overhead would more than make up for interning strings. Should I create a > new Jira ticket to propose a more efficient design? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
[ https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17797371#comment-17797371 ] ASF GitHub Bot commented on OPENNLP-421: rzo1 commented on PR #568: URL: https://github.com/apache/opennlp/pull/568#issuecomment-1858754762 Let's make it a draft and add B. Saving memory is always a good thing. > Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning > -- > > Key: OPENNLP-421 > URL: https://issues.apache.org/jira/browse/OPENNLP-421 > Project: OpenNLP > Issue Type: Bug > Components: Name Finder >Affects Versions: tools-1.5.2-incubating > Environment: RedHat 5, JDK 1.6.0_29 >Reporter: Jay Hacker >Assignee: Richard Zowalla >Priority: Minor > Labels: performance > Original Estimate: 168h > Remaining Estimate: 168h > > The current implementation of StringList: > https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup > > calls intern() on every String. Presumably this is an attempt to reduce > memory usage for duplicate tokens. Interned Strings are stored in the JVM's > permanent generation, which has a small fixed size (seems to be about 83 MB > on modern 64-bit JVMs: > [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]). > Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen > space. > The size of the PermGen can be increased with the -XX:MaxPermSize= option to > the JVM. However, this option is non-standard and not well known, and it > would be nice if OpenNLP worked out of the box without deep JVM tuning. > This immediate problem could be fixed by simply not interning Strings. > Looking at the Dictionary and DictionaryNameFinder code as a whole, however, > there is a huge amount of room for performance improvement. Currently, > DictionaryNameFinder.find works something like this: > for every token in every tokenlist in the dictionary: > copy it into a "meta dictionary" of single tokens > for every possible subsequence of tokens in the sentence:// of which > there are O(N^2) > copy the sequence into a new array > if the last token is in the "meta dictionary": > make a StringList from the tokens > look it up in the dictionary > Dictionary itself is very heavyweight: it's a Set, which > wraps StringList, which wraps Array. Every entry in the dictionary > requires at least four allocated objects (in addition to the Strings): Array, > StringList, StringListWrapper, and HashMap.Entry. Even contains and remove > allocate new objects! > From this comment in DictionaryNameFinder: > // TODO: improve performance here > It seems like improvements would be welcome. :) Removing some of the object > overhead would more than make up for interning strings. Should I create a > new Jira ticket to propose a more efficient design? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
[ https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17797369#comment-17797369 ] ASF GitHub Bot commented on OPENNLP-421: rzo1 commented on PR #568: URL: https://github.com/apache/opennlp/pull/568#issuecomment-1858753109 I am OK with A or B. Feel free to add B on-top of this PR. > Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning > -- > > Key: OPENNLP-421 > URL: https://issues.apache.org/jira/browse/OPENNLP-421 > Project: OpenNLP > Issue Type: Bug > Components: Name Finder >Affects Versions: tools-1.5.2-incubating > Environment: RedHat 5, JDK 1.6.0_29 >Reporter: Jay Hacker >Assignee: Richard Zowalla >Priority: Minor > Labels: performance > Original Estimate: 168h > Remaining Estimate: 168h > > The current implementation of StringList: > https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup > > calls intern() on every String. Presumably this is an attempt to reduce > memory usage for duplicate tokens. Interned Strings are stored in the JVM's > permanent generation, which has a small fixed size (seems to be about 83 MB > on modern 64-bit JVMs: > [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]). > Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen > space. > The size of the PermGen can be increased with the -XX:MaxPermSize= option to > the JVM. However, this option is non-standard and not well known, and it > would be nice if OpenNLP worked out of the box without deep JVM tuning. > This immediate problem could be fixed by simply not interning Strings. > Looking at the Dictionary and DictionaryNameFinder code as a whole, however, > there is a huge amount of room for performance improvement. Currently, > DictionaryNameFinder.find works something like this: > for every token in every tokenlist in the dictionary: > copy it into a "meta dictionary" of single tokens > for every possible subsequence of tokens in the sentence:// of which > there are O(N^2) > copy the sequence into a new array > if the last token is in the "meta dictionary": > make a StringList from the tokens > look it up in the dictionary > Dictionary itself is very heavyweight: it's a Set, which > wraps StringList, which wraps Array. Every entry in the dictionary > requires at least four allocated objects (in addition to the Strings): Array, > StringList, StringListWrapper, and HashMap.Entry. Even contains and remove > allocate new objects! > From this comment in DictionaryNameFinder: > // TODO: improve performance here > It seems like improvements would be welcome. :) Removing some of the object > overhead would more than make up for interning strings. Should I create a > new Jira ticket to propose a more efficient design? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
[ https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17797367#comment-17797367 ] ASF GitHub Bot commented on OPENNLP-421: mawiesne commented on PR #568: URL: https://github.com/apache/opennlp/pull/568#issuecomment-1858750219 I think, we c/should consider using a `ConcurrentHashMap`-based deduplicator approach as a replacement for "intern()" calls, as investigated and explained by Shipilev in his blog post here: https://shipilev.net/jvm/anatomy-quarks/10-string-intern/ The removal of the String deduplication, as found in the code segments of this PR, could otherwise lead to more required RAM at runtime - as @kinow already expressed. > Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning > -- > > Key: OPENNLP-421 > URL: https://issues.apache.org/jira/browse/OPENNLP-421 > Project: OpenNLP > Issue Type: Bug > Components: Name Finder >Affects Versions: tools-1.5.2-incubating > Environment: RedHat 5, JDK 1.6.0_29 >Reporter: Jay Hacker >Assignee: Richard Zowalla >Priority: Minor > Labels: performance > Original Estimate: 168h > Remaining Estimate: 168h > > The current implementation of StringList: > https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup > > calls intern() on every String. Presumably this is an attempt to reduce > memory usage for duplicate tokens. Interned Strings are stored in the JVM's > permanent generation, which has a small fixed size (seems to be about 83 MB > on modern 64-bit JVMs: > [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]). > Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen > space. > The size of the PermGen can be increased with the -XX:MaxPermSize= option to > the JVM. However, this option is non-standard and not well known, and it > would be nice if OpenNLP worked out of the box without deep JVM tuning. > This immediate problem could be fixed by simply not interning Strings. > Looking at the Dictionary and DictionaryNameFinder code as a whole, however, > there is a huge amount of room for performance improvement. Currently, > DictionaryNameFinder.find works something like this: > for every token in every tokenlist in the dictionary: > copy it into a "meta dictionary" of single tokens > for every possible subsequence of tokens in the sentence:// of which > there are O(N^2) > copy the sequence into a new array > if the last token is in the "meta dictionary": > make a StringList from the tokens > look it up in the dictionary > Dictionary itself is very heavyweight: it's a Set, which > wraps StringList, which wraps Array. Every entry in the dictionary > requires at least four allocated objects (in addition to the Strings): Array, > StringList, StringListWrapper, and HashMap.Entry. Even contains and remove > allocate new objects! > From this comment in DictionaryNameFinder: > // TODO: improve performance here > It seems like improvements would be welcome. :) Removing some of the object > overhead would more than make up for interning strings. Should I create a > new Jira ticket to propose a more efficient design? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
[ https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17797137#comment-17797137 ] ASF GitHub Bot commented on OPENNLP-421: rzo1 commented on PR #568: URL: https://github.com/apache/opennlp/pull/568#issuecomment-1857681741 @mawiesne and myself just started an eval run from the office. Let's see if we see anything ;-) > Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning > -- > > Key: OPENNLP-421 > URL: https://issues.apache.org/jira/browse/OPENNLP-421 > Project: OpenNLP > Issue Type: Bug > Components: Name Finder >Affects Versions: tools-1.5.2-incubating > Environment: RedHat 5, JDK 1.6.0_29 >Reporter: Jay Hacker >Assignee: Richard Zowalla >Priority: Minor > Labels: performance > Original Estimate: 168h > Remaining Estimate: 168h > > The current implementation of StringList: > https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup > > calls intern() on every String. Presumably this is an attempt to reduce > memory usage for duplicate tokens. Interned Strings are stored in the JVM's > permanent generation, which has a small fixed size (seems to be about 83 MB > on modern 64-bit JVMs: > [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]). > Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen > space. > The size of the PermGen can be increased with the -XX:MaxPermSize= option to > the JVM. However, this option is non-standard and not well known, and it > would be nice if OpenNLP worked out of the box without deep JVM tuning. > This immediate problem could be fixed by simply not interning Strings. > Looking at the Dictionary and DictionaryNameFinder code as a whole, however, > there is a huge amount of room for performance improvement. Currently, > DictionaryNameFinder.find works something like this: > for every token in every tokenlist in the dictionary: > copy it into a "meta dictionary" of single tokens > for every possible subsequence of tokens in the sentence:// of which > there are O(N^2) > copy the sequence into a new array > if the last token is in the "meta dictionary": > make a StringList from the tokens > look it up in the dictionary > Dictionary itself is very heavyweight: it's a Set, which > wraps StringList, which wraps Array. Every entry in the dictionary > requires at least four allocated objects (in addition to the Strings): Array, > StringList, StringListWrapper, and HashMap.Entry. Even contains and remove > allocate new objects! > From this comment in DictionaryNameFinder: > // TODO: improve performance here > It seems like improvements would be welcome. :) Removing some of the object > overhead would more than make up for interning strings. Should I create a > new Jira ticket to propose a more efficient design? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
[ https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17797054#comment-17797054 ] ASF GitHub Bot commented on OPENNLP-421: rzo1 commented on PR #568: URL: https://github.com/apache/opennlp/pull/568#issuecomment-1857359451 @kinow thanks for the details and the flight recording :-) I think, that the current use of String interning was an attempt to optimize string comparisons by using the string constant pool (until it explodes). I guess, that the idea was also to save memory. There is a really nice blog article which summarizes the pros/cons: https://www.codecentric.de/wissens-hub/blog/save-memory-by-using-string-intern-in-java The state, that > Only use String.intern() on Strings you know are occurring multiple times, and only do it to save memory I think, that we can remove a lot of the current object creation related to the use of `StringList`, `StringListWrapper` or the creation of `StringList` objects just for the sake of running a `contains(...)`, which should make up for the overhead due to the removal of String interning. > Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning > -- > > Key: OPENNLP-421 > URL: https://issues.apache.org/jira/browse/OPENNLP-421 > Project: OpenNLP > Issue Type: Bug > Components: Name Finder >Affects Versions: tools-1.5.2-incubating > Environment: RedHat 5, JDK 1.6.0_29 >Reporter: Jay Hacker >Assignee: Richard Zowalla >Priority: Minor > Labels: performance > Original Estimate: 168h > Remaining Estimate: 168h > > The current implementation of StringList: > https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup > > calls intern() on every String. Presumably this is an attempt to reduce > memory usage for duplicate tokens. Interned Strings are stored in the JVM's > permanent generation, which has a small fixed size (seems to be about 83 MB > on modern 64-bit JVMs: > [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]). > Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen > space. > The size of the PermGen can be increased with the -XX:MaxPermSize= option to > the JVM. However, this option is non-standard and not well known, and it > would be nice if OpenNLP worked out of the box without deep JVM tuning. > This immediate problem could be fixed by simply not interning Strings. > Looking at the Dictionary and DictionaryNameFinder code as a whole, however, > there is a huge amount of room for performance improvement. Currently, > DictionaryNameFinder.find works something like this: > for every token in every tokenlist in the dictionary: > copy it into a "meta dictionary" of single tokens > for every possible subsequence of tokens in the sentence:// of which > there are O(N^2) > copy the sequence into a new array > if the last token is in the "meta dictionary": > make a StringList from the tokens > look it up in the dictionary > Dictionary itself is very heavyweight: it's a Set, which > wraps StringList, which wraps Array. Every entry in the dictionary > requires at least four allocated objects (in addition to the Strings): Array, > StringList, StringListWrapper, and HashMap.Entry. Even contains and remove > allocate new objects! > From this comment in DictionaryNameFinder: > // TODO: improve performance here > It seems like improvements would be welcome. :) Removing some of the object > overhead would more than make up for interning strings. Should I create a > new Jira ticket to propose a more efficient design? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
[ https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796940#comment-17796940 ] ASF GitHub Bot commented on OPENNLP-421: kinow commented on PR #568: URL: https://github.com/apache/opennlp/pull/568#issuecomment-1856892659 Thanks for the GC numbers. The new code seems a bit heavier on the GC and mem allocations, but I think that was expected. A bit late here, but I decided to give it a try. Excuse if I say anything silly, I can check these results again tomorrow after having some :coffee: . I opened your branch on IntelliJ, then made the jmh profile active by default (how do you tell IntelliJ to include the deps of a profile? anywho), invalidated cache, etc.. Executed the tests just to make sure they were running, as you already provided the results. Then executed with the profiler, to generate the `.jfr` file. Saved the `jmh` source folder, checked out `main`, then patched the `pom.xml` files, and executed the same test again, to get another `.jfr` for the `main` branch (so earlier file is for this branch, older is for main). Opened both in IntelliJ and selected the `.jfr` file from this branch to compare with the one from `main`. Not sure if they show a fair comparison as the difference in throughput might exacerbate some number, but in case you find it useful, @rzo1 . ## CPU samples ![flamegraph](https://github.com/apache/opennlp/assets/304786/5d715064-5ea0-4dbf-8d89-f8c277ad0a1f) I think it just confirms the change of calls to intern functions. ![image](https://github.com/apache/opennlp/assets/304786/26235942-e711-4ed3-b242-18098b9ddca1) Also the increase in GC calls. ![image](https://github.com/apache/opennlp/assets/304786/759928d2-1674-426a-a80a-24a39bf90ff8) And I think the Random calls are from JMH for having more/less samples due to higher throughput. ![image](https://github.com/apache/opennlp/assets/304786/a0eb0c07-8feb-421f-bd1d-01f56135479e) ## Memory allocations ![flamegraph](https://github.com/apache/opennlp/assets/304786/7e20ff7b-3b5d-45be-9984-67229c9441c6) With a lot more Strings, as expected, as well as bytes due to the array copy calls. ![image](https://github.com/apache/opennlp/assets/304786/d3f203ff-1bbc-4d33-8c70-c2ed93861443) ![image](https://github.com/apache/opennlp/assets/304786/43ba112d-2402-4c68-966f-f2e335334f74) Unfortunately I don't have any better or more practical test to compare memory (maybe a single run of a large example would be better for comparing memory, instead of multiple jmh runs…). But I don't think this should really be a problem. I wonder if that was some premature optimization or if at some point someone had memory issues. But a change that's easy to revert if needed, and that way users won't have to modify heap settings. So I am inclined to approve it! :tada: Will wait until tomorrow so you, Martin, Jeff, others had some time to review :sleeping_bed: Thanks! Bruno > Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning > -- > > Key: OPENNLP-421 > URL: https://issues.apache.org/jira/browse/OPENNLP-421 > Project: OpenNLP > Issue Type: Bug > Components: Name Finder >Affects Versions: tools-1.5.2-incubating > Environment: RedHat 5, JDK 1.6.0_29 >Reporter: Jay Hacker >Assignee: Richard Zowalla >Priority: Minor > Labels: performance > Original Estimate: 168h > Remaining Estimate: 168h > > The current implementation of StringList: > https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup > > calls intern() on every String. Presumably this is an attempt to reduce > memory usage for duplicate tokens. Interned Strings are stored in the JVM's > permanent generation, which has a small fixed size (seems to be about 83 MB > on modern 64-bit JVMs: > [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]). > Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen > space. > The size of the PermGen can be increased with the -XX:MaxPermSize= option to > the JVM. However, this option is non-standard and not well known, and it > would be nice if OpenNLP worked out of the box without deep JVM tuning. > This immediate problem could be fixed by simply not interning Strings. > Looking at the Dictionary and DictionaryNameFinder code as a whole, however, > there is a huge amount of room for performance improvement. Currently, > DictionaryNameFinder.find works something like this: > for every token in every
[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
[ https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796894#comment-17796894 ] ASF GitHub Bot commented on OPENNLP-421: rzo1 commented on PR #568: URL: https://github.com/apache/opennlp/pull/568#issuecomment-1856512985 Added GC Profiling output to JMH. Here it is # Old Code ``` BenchmarkMode Cnt Score Error Units StringListBenchmark.newWithArrayConstructor thrpt 10 1,104 ± 0,311 ops/s StringListBenchmark.newWithArrayConstructor:gc.alloc.rate thrpt 10 73,739 ±14,797 MB/sec StringListBenchmark.newWithArrayConstructor:gc.alloc.rate.norm thrpt 10 92007982,187 ± 37715,232B/op StringListBenchmark.newWithArrayConstructor:gc.countthrpt 10 68,000 counts StringListBenchmark.newWithArrayConstructor:gc.time thrpt 10 4396,000 ms ``` # New Code ``` BenchmarkMode Cnt Score Error Units StringListBenchmark.newWithArrayConstructor thrpt 10 471,103 ± 170,077 ops/s StringListBenchmark.newWithArrayConstructor:gc.alloc.rate thrpt 10 443,527 ± 37,567 MB/sec StringListBenchmark.newWithArrayConstructor:gc.alloc.rate.norm thrpt 10 9242,655 ± 0,953B/op StringListBenchmark.newWithArrayConstructor:gc.countthrpt 10 277,000counts StringListBenchmark.newWithArrayConstructor:gc.time thrpt 10 14432,000ms `` > Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning > -- > > Key: OPENNLP-421 > URL: https://issues.apache.org/jira/browse/OPENNLP-421 > Project: OpenNLP > Issue Type: Bug > Components: Name Finder >Affects Versions: tools-1.5.2-incubating > Environment: RedHat 5, JDK 1.6.0_29 >Reporter: Jay Hacker >Assignee: Richard Zowalla >Priority: Minor > Labels: performance > Original Estimate: 168h > Remaining Estimate: 168h > > The current implementation of StringList: > https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup > > calls intern() on every String. Presumably this is an attempt to reduce > memory usage for duplicate tokens. Interned Strings are stored in the JVM's > permanent generation, which has a small fixed size (seems to be about 83 MB > on modern 64-bit JVMs: > [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]). > Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen > space. > The size of the PermGen can be increased with the -XX:MaxPermSize= option to > the JVM. However, this option is non-standard and not well known, and it > would be nice if OpenNLP worked out of the box without deep JVM tuning. > This immediate problem could be fixed by simply not interning Strings. > Looking at the Dictionary and DictionaryNameFinder code as a whole, however, > there is a huge amount of room for performance improvement. Currently, > DictionaryNameFinder.find works something like this: > for every token in every tokenlist in the dictionary: > copy it into a "meta dictionary" of single tokens > for every possible subsequence of tokens in the sentence:// of which > there are O(N^2) > copy the sequence into a new array > if the last token is in the "meta dictionary": > make a StringList from the tokens > look it up in the dictionary > Dictionary itself is very heavyweight: it's a Set, which > wraps StringList, which wraps Array. Every entry in the dictionary > requires at least four allocated objects (in addition to the Strings): Array, > StringList, StringListWrapper, and HashMap.Entry. Even contains and remove > allocate new objects! > From this comment in DictionaryNameFinder: > // TODO: improve performance here > It seems like improvements would be welcome. :) Removing some of the object > overhead would more than make up for interning strings. Should I create a > new Jira ticket to propose a more efficient design? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
[ https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796893#comment-17796893 ] ASF GitHub Bot commented on OPENNLP-421: rzo1 commented on code in PR #568: URL: https://github.com/apache/opennlp/pull/568#discussion_r1427220452 ## opennlp-tools/src/jmh/java/opennlp/tools/jmh/ExecutionPlan.java: ## @@ -0,0 +1,51 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package opennlp.tools.jmh; + +import java.util.Random; + +import org.openjdk.jmh.annotations.Level; +import org.openjdk.jmh.annotations.Scope; +import org.openjdk.jmh.annotations.Setup; +import org.openjdk.jmh.annotations.State; + +@State(Scope.Benchmark) +public class ExecutionPlan { + + private static final Random RANDOM = new Random(42); + + public final String[] strings = new String[1_000_000]; + + @Setup(Level.Invocation) + public void setUp() { +for (int i = 0; i < 1_000_000; i++) { + strings[i] = generateRandomString(15); +} + } + + private static String generateRandomString(int length) { +final String characters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789"; Review Comment: did > Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning > -- > > Key: OPENNLP-421 > URL: https://issues.apache.org/jira/browse/OPENNLP-421 > Project: OpenNLP > Issue Type: Bug > Components: Name Finder >Affects Versions: tools-1.5.2-incubating > Environment: RedHat 5, JDK 1.6.0_29 >Reporter: Jay Hacker >Assignee: Richard Zowalla >Priority: Minor > Labels: performance > Original Estimate: 168h > Remaining Estimate: 168h > > The current implementation of StringList: > https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup > > calls intern() on every String. Presumably this is an attempt to reduce > memory usage for duplicate tokens. Interned Strings are stored in the JVM's > permanent generation, which has a small fixed size (seems to be about 83 MB > on modern 64-bit JVMs: > [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]). > Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen > space. > The size of the PermGen can be increased with the -XX:MaxPermSize= option to > the JVM. However, this option is non-standard and not well known, and it > would be nice if OpenNLP worked out of the box without deep JVM tuning. > This immediate problem could be fixed by simply not interning Strings. > Looking at the Dictionary and DictionaryNameFinder code as a whole, however, > there is a huge amount of room for performance improvement. Currently, > DictionaryNameFinder.find works something like this: > for every token in every tokenlist in the dictionary: > copy it into a "meta dictionary" of single tokens > for every possible subsequence of tokens in the sentence:// of which > there are O(N^2) > copy the sequence into a new array > if the last token is in the "meta dictionary": > make a StringList from the tokens > look it up in the dictionary > Dictionary itself is very heavyweight: it's a Set, which > wraps StringList, which wraps Array. Every entry in the dictionary > requires at least four allocated objects (in addition to the Strings): Array, > StringList, StringListWrapper, and HashMap.Entry. Even contains and remove > allocate new objects! > From this comment in DictionaryNameFinder: > // TODO: improve performance here > It seems like improvements would be welcome. :) Removing some of the object > overhead would more than make up for interning strings. Should I create a > new Jira ticket to propose a more efficient design? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
[ https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796886#comment-17796886 ] ASF GitHub Bot commented on OPENNLP-421: mawiesne commented on code in PR #568: URL: https://github.com/apache/opennlp/pull/568#discussion_r1427176318 ## opennlp-tools/src/jmh/java/opennlp/tools/jmh/ExecutionPlan.java: ## @@ -0,0 +1,51 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package opennlp.tools.jmh; + +import java.util.Random; + +import org.openjdk.jmh.annotations.Level; +import org.openjdk.jmh.annotations.Scope; +import org.openjdk.jmh.annotations.Setup; +import org.openjdk.jmh.annotations.State; + +@State(Scope.Benchmark) +public class ExecutionPlan { + + private static final Random RANDOM = new Random(42); + + public final String[] strings = new String[1_000_000]; + + @Setup(Level.Invocation) + public void setUp() { +for (int i = 0; i < 1_000_000; i++) { + strings[i] = generateRandomString(15); +} + } + + private static String generateRandomString(int length) { +final String characters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789"; Review Comment: Can we make this a constant? By nature, it's pretty "stable", yet recreated every time the private helper method is called. > Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning > -- > > Key: OPENNLP-421 > URL: https://issues.apache.org/jira/browse/OPENNLP-421 > Project: OpenNLP > Issue Type: Bug > Components: Name Finder >Affects Versions: tools-1.5.2-incubating > Environment: RedHat 5, JDK 1.6.0_29 >Reporter: Jay Hacker >Assignee: Richard Zowalla >Priority: Minor > Labels: performance > Original Estimate: 168h > Remaining Estimate: 168h > > The current implementation of StringList: > https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup > > calls intern() on every String. Presumably this is an attempt to reduce > memory usage for duplicate tokens. Interned Strings are stored in the JVM's > permanent generation, which has a small fixed size (seems to be about 83 MB > on modern 64-bit JVMs: > [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]). > Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen > space. > The size of the PermGen can be increased with the -XX:MaxPermSize= option to > the JVM. However, this option is non-standard and not well known, and it > would be nice if OpenNLP worked out of the box without deep JVM tuning. > This immediate problem could be fixed by simply not interning Strings. > Looking at the Dictionary and DictionaryNameFinder code as a whole, however, > there is a huge amount of room for performance improvement. Currently, > DictionaryNameFinder.find works something like this: > for every token in every tokenlist in the dictionary: > copy it into a "meta dictionary" of single tokens > for every possible subsequence of tokens in the sentence:// of which > there are O(N^2) > copy the sequence into a new array > if the last token is in the "meta dictionary": > make a StringList from the tokens > look it up in the dictionary > Dictionary itself is very heavyweight: it's a Set, which > wraps StringList, which wraps Array. Every entry in the dictionary > requires at least four allocated objects (in addition to the Strings): Array, > StringList, StringListWrapper, and HashMap.Entry. Even contains and remove > allocate new objects! > From this comment in DictionaryNameFinder: > // TODO: improve performance here > It seems like improvements would be welcome. :) Removing some of the object > overhead would more than make up for interning strings. Should I create a > new Jira ticket to propose a more efficient design? -- This message was sent by Atlassian
[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
[ https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796745#comment-17796745 ] ASF GitHub Bot commented on OPENNLP-421: rzo1 commented on PR #568: URL: https://github.com/apache/opennlp/pull/568#issuecomment-1855883632 > If I understood the issue and read the benchmark gist results correctly, with the new code JMH benchmark starts faster, then the old code catchs up and plateaus around 1.2K ops/s. But in overall the non interned version still process more ops/s, with the advantage of not having the JVM max memory issue. It is `,`, so we have (if you look at the mean values): - 1,211 ± 0,149 ops/s (old code) vs. - 468,404 ± 140,482 ops/s (new code) Systemarray.copy(...) is the reason for the speed improvement from merely 1 ops/s to 328 op/s (worst case) > Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning > -- > > Key: OPENNLP-421 > URL: https://issues.apache.org/jira/browse/OPENNLP-421 > Project: OpenNLP > Issue Type: Bug > Components: Name Finder >Affects Versions: tools-1.5.2-incubating > Environment: RedHat 5, JDK 1.6.0_29 >Reporter: Jay Hacker >Assignee: Richard Zowalla >Priority: Minor > Labels: performance > Original Estimate: 168h > Remaining Estimate: 168h > > The current implementation of StringList: > https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup > > calls intern() on every String. Presumably this is an attempt to reduce > memory usage for duplicate tokens. Interned Strings are stored in the JVM's > permanent generation, which has a small fixed size (seems to be about 83 MB > on modern 64-bit JVMs: > [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]). > Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen > space. > The size of the PermGen can be increased with the -XX:MaxPermSize= option to > the JVM. However, this option is non-standard and not well known, and it > would be nice if OpenNLP worked out of the box without deep JVM tuning. > This immediate problem could be fixed by simply not interning Strings. > Looking at the Dictionary and DictionaryNameFinder code as a whole, however, > there is a huge amount of room for performance improvement. Currently, > DictionaryNameFinder.find works something like this: > for every token in every tokenlist in the dictionary: > copy it into a "meta dictionary" of single tokens > for every possible subsequence of tokens in the sentence:// of which > there are O(N^2) > copy the sequence into a new array > if the last token is in the "meta dictionary": > make a StringList from the tokens > look it up in the dictionary > Dictionary itself is very heavyweight: it's a Set, which > wraps StringList, which wraps Array. Every entry in the dictionary > requires at least four allocated objects (in addition to the Strings): Array, > StringList, StringListWrapper, and HashMap.Entry. Even contains and remove > allocate new objects! > From this comment in DictionaryNameFinder: > // TODO: improve performance here > It seems like improvements would be welcome. :) Removing some of the object > overhead would more than make up for interning strings. Should I create a > new Jira ticket to propose a more efficient design? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
[ https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796739#comment-17796739 ] ASF GitHub Bot commented on OPENNLP-421: kinow commented on code in PR #568: URL: https://github.com/apache/opennlp/pull/568#discussion_r1426703504 ## opennlp-tools/src/jmh/java/opennlp/tools/jmh/ExecutionPlan.java: ## @@ -0,0 +1,51 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package opennlp.tools.jmh; + +import java.util.Random; + +import org.openjdk.jmh.annotations.Level; +import org.openjdk.jmh.annotations.Scope; +import org.openjdk.jmh.annotations.Setup; +import org.openjdk.jmh.annotations.State; + +@State(Scope.Benchmark) +public class ExecutionPlan { + + private static final Random RANDOM = new Random(42); + + public final String[] strings = new String[1_000_000]; + + @Setup(Level.Invocation) + public void setUp() { +for (int i = 0; i < 1_000_000; i++) { + strings[i] = generateRandomString(15); +} + } + + private static String generateRandomString(int length) { +final String characters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789"; +final StringBuilder randomString = new StringBuilder(); + +for (int i = 0; i < length; i++) { + int index = RANDOM.nextInt(characters.length()); + randomString.append(characters.charAt(index)); +} + +return randomString.toString(); + } +} Review Comment: Missing newline, but not important. > Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning > -- > > Key: OPENNLP-421 > URL: https://issues.apache.org/jira/browse/OPENNLP-421 > Project: OpenNLP > Issue Type: Bug > Components: Name Finder >Affects Versions: tools-1.5.2-incubating > Environment: RedHat 5, JDK 1.6.0_29 >Reporter: Jay Hacker >Assignee: Richard Zowalla >Priority: Minor > Labels: performance > Original Estimate: 168h > Remaining Estimate: 168h > > The current implementation of StringList: > https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup > > calls intern() on every String. Presumably this is an attempt to reduce > memory usage for duplicate tokens. Interned Strings are stored in the JVM's > permanent generation, which has a small fixed size (seems to be about 83 MB > on modern 64-bit JVMs: > [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]). > Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen > space. > The size of the PermGen can be increased with the -XX:MaxPermSize= option to > the JVM. However, this option is non-standard and not well known, and it > would be nice if OpenNLP worked out of the box without deep JVM tuning. > This immediate problem could be fixed by simply not interning Strings. > Looking at the Dictionary and DictionaryNameFinder code as a whole, however, > there is a huge amount of room for performance improvement. Currently, > DictionaryNameFinder.find works something like this: > for every token in every tokenlist in the dictionary: > copy it into a "meta dictionary" of single tokens > for every possible subsequence of tokens in the sentence:// of which > there are O(N^2) > copy the sequence into a new array > if the last token is in the "meta dictionary": > make a StringList from the tokens > look it up in the dictionary > Dictionary itself is very heavyweight: it's a Set, which > wraps StringList, which wraps Array. Every entry in the dictionary > requires at least four allocated objects (in addition to the Strings): Array, > StringList, StringListWrapper, and HashMap.Entry. Even contains and remove > allocate new objects! > From this comment in DictionaryNameFinder: > // TODO: improve performance here > It seems like improvements would be welcome. :) Removing some of the object
[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
[ https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796698#comment-17796698 ] ASF GitHub Bot commented on OPENNLP-421: rzo1 opened a new pull request, #568: URL: https://github.com/apache/opennlp/pull/568 ### For all changes: - [x] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message? - [x] Does your PR title start with OPENNLP- where is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character. - [x] Has your PR been rebased against the latest commit within the target branch (typically main)? - [ ] Is your initial contribution a single, squashed commit? ### For code changes: - [x] Have you ensured that the full suite of tests is executed via mvn clean install at the root opennlp folder? - [ ] Have you written or updated unit tests to verify your changes? - [x] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the LICENSE file, including the main LICENSE file in opennlp folder? - [ ] If applicable, have you updated the NOTICE file, including the main NOTICE file found in opennlp folder? ### For documentation related changes: - [ ] Have you ensured that format looks appropriate for the output in which it is rendered? ### Note: I found [OPENNLP-421](https://issues.apache.org/jira/projects/OPENNLP/issues/OPENNLP-421) in our issue tracker and added a simple JMH benchmark to look into the impact of interning strings and our multiple copy-operations in `StringList`. - Here is a JMH Benchmark for the vanilla version of OpenNLP: https://gist.github.com/rzo1/d2ba3e48c6bc190977baf9ee42388823 - Here is a JMH Benchmark with String interning removed from `StringList` and usage of `System.arraycopy(...)`: https://gist.github.com/rzo1/327c84ebac18b62baf927f1a87ec7480 As stated by the original issue opener, String interning was most likely used because of: > Presumably this is an attempt to reduce memory usage for duplicate tokens. Interned Strings are stored in the JVM's permanent generation, which has a small fixed size (seems to be about 83 MB on modern 64-bit JVMs: https://www.oracle.com/java/technologies/javase/vmoptions-jsp.html) So if we have huge `StringLists` or `Dictionaries`, users need to increase `-XX:MaxPermSize=` option to avoid an OutOfMemoryException. There might also be room for improvement in the `Dictionary` implemention as we mess around with `StringLists`, `StringListWrappers`, etc. - but that would be something to look into next. Just want to have some thoughts on the interning removal here ;-) > Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning > -- > > Key: OPENNLP-421 > URL: https://issues.apache.org/jira/browse/OPENNLP-421 > Project: OpenNLP > Issue Type: Bug > Components: Name Finder >Affects Versions: tools-1.5.2-incubating > Environment: RedHat 5, JDK 1.6.0_29 >Reporter: Jay Hacker >Assignee: Richard Zowalla >Priority: Minor > Labels: performance > Original Estimate: 168h > Remaining Estimate: 168h > > The current implementation of StringList: > https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup > > calls intern() on every String. Presumably this is an attempt to reduce > memory usage for duplicate tokens. Interned Strings are stored in the JVM's > permanent generation, which has a small fixed size (seems to be about 83 MB > on modern 64-bit JVMs: > [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]). > Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen > space. > The size of the PermGen can be increased with the -XX:MaxPermSize= option to > the JVM. However, this option is non-standard and not well known, and it > would be nice if OpenNLP worked out of the box without deep JVM tuning. > This immediate problem could be fixed by simply not interning Strings. > Looking at the Dictionary and DictionaryNameFinder code as a whole, however, > there is a huge amount of room for performance improvement. Currently, > DictionaryNameFinder.find works something like this: > for every token in every tokenlist in the dictionary: > copy it into a "meta dictionary" of single tokens > for every possible subsequence of tokens in the sentence:// of which > there