[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning

2023-12-25 Thread Richard Zowalla (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800347#comment-17800347
 ] 

Richard Zowalla commented on OPENNLP-421:
-

Next release will contain related fixes and should improve performance/ result 
in lower object creation rates.

> Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
> --
>
> Key: OPENNLP-421
> URL: https://issues.apache.org/jira/browse/OPENNLP-421
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Name Finder
>Affects Versions: tools-1.5.2-incubating
> Environment: RedHat 5, JDK 1.6.0_29
>Reporter: Jay Hacker
>Assignee: Richard Zowalla
>Priority: Major
>  Labels: performance
> Fix For: 2.3.2
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The current implementation of StringList:
> https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup
>  
> calls intern() on every String.  Presumably this is an attempt to reduce 
> memory usage for duplicate tokens.  Interned Strings are stored in the JVM's 
> permanent generation, which has a small fixed size (seems to be about 83 MB 
> on modern 64-bit JVMs: 
> [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]).
>   Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen 
> space.  
> The size of the PermGen can be increased with the -XX:MaxPermSize= option to 
> the JVM.  However, this option is non-standard and not well known, and it 
> would be nice if OpenNLP worked out of the box without deep JVM tuning.
> This immediate problem could be fixed by simply not interning Strings.  
> Looking at the Dictionary and DictionaryNameFinder code as a whole, however, 
> there is a huge amount of room for performance improvement.  Currently, 
> DictionaryNameFinder.find works something like this:
> for every token in every tokenlist in the dictionary:
> copy it into a "meta dictionary" of single tokens
> for every possible subsequence of tokens in the sentence:// of which 
> there are O(N^2)
> copy the sequence into a new array
> if the last token is in the "meta dictionary":
> make a StringList from the tokens
> look it up in the dictionary
> Dictionary itself is very heavyweight: it's a Set, which 
> wraps StringList, which wraps Array.  Every entry in the dictionary 
> requires at least four allocated objects (in addition to the Strings): Array, 
> StringList, StringListWrapper, and HashMap.Entry.  Even contains and remove 
> allocate new objects!
> From this comment in DictionaryNameFinder:
> // TODO: improve performance here
> It seems like improvements would be welcome.  :)  Removing some of the object 
> overhead would more than make up for interning strings.  Should I create a 
> new Jira ticket to propose a more efficient design?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning

2023-12-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800346#comment-17800346
 ] 

ASF GitHub Bot commented on OPENNLP-421:


rzo1 merged PR #572:
URL: https://github.com/apache/opennlp/pull/572




> Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
> --
>
> Key: OPENNLP-421
> URL: https://issues.apache.org/jira/browse/OPENNLP-421
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Name Finder
>Affects Versions: tools-1.5.2-incubating
> Environment: RedHat 5, JDK 1.6.0_29
>Reporter: Jay Hacker
>Assignee: Martin Wiesner
>Priority: Major
>  Labels: performance
> Fix For: 2.3.2
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The current implementation of StringList:
> https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup
>  
> calls intern() on every String.  Presumably this is an attempt to reduce 
> memory usage for duplicate tokens.  Interned Strings are stored in the JVM's 
> permanent generation, which has a small fixed size (seems to be about 83 MB 
> on modern 64-bit JVMs: 
> [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]).
>   Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen 
> space.  
> The size of the PermGen can be increased with the -XX:MaxPermSize= option to 
> the JVM.  However, this option is non-standard and not well known, and it 
> would be nice if OpenNLP worked out of the box without deep JVM tuning.
> This immediate problem could be fixed by simply not interning Strings.  
> Looking at the Dictionary and DictionaryNameFinder code as a whole, however, 
> there is a huge amount of room for performance improvement.  Currently, 
> DictionaryNameFinder.find works something like this:
> for every token in every tokenlist in the dictionary:
> copy it into a "meta dictionary" of single tokens
> for every possible subsequence of tokens in the sentence:// of which 
> there are O(N^2)
> copy the sequence into a new array
> if the last token is in the "meta dictionary":
> make a StringList from the tokens
> look it up in the dictionary
> Dictionary itself is very heavyweight: it's a Set, which 
> wraps StringList, which wraps Array.  Every entry in the dictionary 
> requires at least four allocated objects (in addition to the Strings): Array, 
> StringList, StringListWrapper, and HashMap.Entry.  Even contains and remove 
> allocate new objects!
> From this comment in DictionaryNameFinder:
> // TODO: improve performance here
> It seems like improvements would be welcome.  :)  Removing some of the object 
> overhead would more than make up for interning strings.  Should I create a 
> new Jira ticket to propose a more efficient design?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning

2023-12-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799904#comment-17799904
 ] 

ASF GitHub Bot commented on OPENNLP-421:


rzo1 opened a new pull request, #572:
URL: https://github.com/apache/opennlp/pull/572

   Thank you for contributing to Apache OpenNLP.
   
   In order to streamline the review of the contribution we ask you
   to ensure the following steps have been taken:
   
   ### For all changes:
   - [x] Is there a JIRA ticket associated with this PR? Is it referenced 
in the commit message?
   
   - [x] Does your PR title start with OPENNLP- where  is the JIRA 
number you are trying to resolve? Pay particular attention to the hyphen "-" 
character.
   
   - [x] Has your PR been rebased against the latest commit within the target 
branch (typically main)?
   
   - [x] Is your initial contribution a single, squashed commit?
   
   ### For code changes:
   - [x] Have you ensured that the full suite of tests is executed via mvn 
clean install at the root opennlp folder?
   - [ ] Have you written or updated unit tests to verify your changes?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)? 
   - [ ] If applicable, have you updated the LICENSE file, including the main 
LICENSE file in opennlp folder?
   - [ ] If applicable, have you updated the NOTICE file, including the main 
NOTICE file found in opennlp folder?
   
   ### For documentation related changes:
   - [ ] Have you ensured that format looks appropriate for the output in which 
it is rendered?
   
   ### Note:
   
   This PR is a follow-up for OPENNLP-421 to remove unnecessary object creation 
by removing `StringListWrapper` used in `Dictionary`.
   
   The sole purpose of the `StringListWrapper` impl was to provide a way to 
ignore cases for dictionary entries stored in a `StringList` by overriding the 
related `hashCode()` and `equals(...)` method.
   
   This PR delegates the behaviour to the actual `StringList` and adds methods 
to "convert" the behaviour from a case-sensitive `StringList` to its 
insensitive form.
   
   By doing so, we can reduce the creation of objects for `contains(...)`, 
`add(...)`, etc. operation. We only create new `StringLists` if their case does 
not match the case of the dictionary.
   




> Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
> --
>
> Key: OPENNLP-421
> URL: https://issues.apache.org/jira/browse/OPENNLP-421
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Name Finder
>Affects Versions: tools-1.5.2-incubating
> Environment: RedHat 5, JDK 1.6.0_29
>Reporter: Jay Hacker
>Assignee: Martin Wiesner
>Priority: Major
>  Labels: performance
> Fix For: 2.3.2
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The current implementation of StringList:
> https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup
>  
> calls intern() on every String.  Presumably this is an attempt to reduce 
> memory usage for duplicate tokens.  Interned Strings are stored in the JVM's 
> permanent generation, which has a small fixed size (seems to be about 83 MB 
> on modern 64-bit JVMs: 
> [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]).
>   Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen 
> space.  
> The size of the PermGen can be increased with the -XX:MaxPermSize= option to 
> the JVM.  However, this option is non-standard and not well known, and it 
> would be nice if OpenNLP worked out of the box without deep JVM tuning.
> This immediate problem could be fixed by simply not interning Strings.  
> Looking at the Dictionary and DictionaryNameFinder code as a whole, however, 
> there is a huge amount of room for performance improvement.  Currently, 
> DictionaryNameFinder.find works something like this:
> for every token in every tokenlist in the dictionary:
> copy it into a "meta dictionary" of single tokens
> for every possible subsequence of tokens in the sentence:// of which 
> there are O(N^2)
> copy the sequence into a new array
> if the last token is in the "meta dictionary":
> make a StringList from the tokens
> look it up in the dictionary
> Dictionary itself is very heavyweight: it's a Set, which 
> wraps StringList, which wraps Array.  Every entry in the dictionary 
> requires at least four allocated objects (in addition to the Strings): Array, 
> StringList, StringListWrapper, and HashMap.Entry.  

[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning

2023-12-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799901#comment-17799901
 ] 

ASF GitHub Bot commented on OPENNLP-421:


rzo1 commented on PR #568:
URL: https://github.com/apache/opennlp/pull/568#issuecomment-1867971510

   Thx to all reviewers / comments with feedback & ideas! Think we solved a 
good thing here for 2.3.2 ! Other PR for OPENNLP-421 incoming in a few minutes.




> Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
> --
>
> Key: OPENNLP-421
> URL: https://issues.apache.org/jira/browse/OPENNLP-421
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Name Finder
>Affects Versions: tools-1.5.2-incubating
> Environment: RedHat 5, JDK 1.6.0_29
>Reporter: Jay Hacker
>Assignee: Martin Wiesner
>Priority: Major
>  Labels: performance
> Fix For: 2.3.2
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The current implementation of StringList:
> https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup
>  
> calls intern() on every String.  Presumably this is an attempt to reduce 
> memory usage for duplicate tokens.  Interned Strings are stored in the JVM's 
> permanent generation, which has a small fixed size (seems to be about 83 MB 
> on modern 64-bit JVMs: 
> [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]).
>   Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen 
> space.  
> The size of the PermGen can be increased with the -XX:MaxPermSize= option to 
> the JVM.  However, this option is non-standard and not well known, and it 
> would be nice if OpenNLP worked out of the box without deep JVM tuning.
> This immediate problem could be fixed by simply not interning Strings.  
> Looking at the Dictionary and DictionaryNameFinder code as a whole, however, 
> there is a huge amount of room for performance improvement.  Currently, 
> DictionaryNameFinder.find works something like this:
> for every token in every tokenlist in the dictionary:
> copy it into a "meta dictionary" of single tokens
> for every possible subsequence of tokens in the sentence:// of which 
> there are O(N^2)
> copy the sequence into a new array
> if the last token is in the "meta dictionary":
> make a StringList from the tokens
> look it up in the dictionary
> Dictionary itself is very heavyweight: it's a Set, which 
> wraps StringList, which wraps Array.  Every entry in the dictionary 
> requires at least four allocated objects (in addition to the Strings): Array, 
> StringList, StringListWrapper, and HashMap.Entry.  Even contains and remove 
> allocate new objects!
> From this comment in DictionaryNameFinder:
> // TODO: improve performance here
> It seems like improvements would be welcome.  :)  Removing some of the object 
> overhead would more than make up for interning strings.  Should I create a 
> new Jira ticket to propose a more efficient design?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning

2023-12-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799900#comment-17799900
 ] 

ASF GitHub Bot commented on OPENNLP-421:


rzo1 merged PR #568:
URL: https://github.com/apache/opennlp/pull/568




> Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
> --
>
> Key: OPENNLP-421
> URL: https://issues.apache.org/jira/browse/OPENNLP-421
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Name Finder
>Affects Versions: tools-1.5.2-incubating
> Environment: RedHat 5, JDK 1.6.0_29
>Reporter: Jay Hacker
>Assignee: Martin Wiesner
>Priority: Major
>  Labels: performance
> Fix For: 2.3.2
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The current implementation of StringList:
> https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup
>  
> calls intern() on every String.  Presumably this is an attempt to reduce 
> memory usage for duplicate tokens.  Interned Strings are stored in the JVM's 
> permanent generation, which has a small fixed size (seems to be about 83 MB 
> on modern 64-bit JVMs: 
> [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]).
>   Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen 
> space.  
> The size of the PermGen can be increased with the -XX:MaxPermSize= option to 
> the JVM.  However, this option is non-standard and not well known, and it 
> would be nice if OpenNLP worked out of the box without deep JVM tuning.
> This immediate problem could be fixed by simply not interning Strings.  
> Looking at the Dictionary and DictionaryNameFinder code as a whole, however, 
> there is a huge amount of room for performance improvement.  Currently, 
> DictionaryNameFinder.find works something like this:
> for every token in every tokenlist in the dictionary:
> copy it into a "meta dictionary" of single tokens
> for every possible subsequence of tokens in the sentence:// of which 
> there are O(N^2)
> copy the sequence into a new array
> if the last token is in the "meta dictionary":
> make a StringList from the tokens
> look it up in the dictionary
> Dictionary itself is very heavyweight: it's a Set, which 
> wraps StringList, which wraps Array.  Every entry in the dictionary 
> requires at least four allocated objects (in addition to the Strings): Array, 
> StringList, StringListWrapper, and HashMap.Entry.  Even contains and remove 
> allocate new objects!
> From this comment in DictionaryNameFinder:
> // TODO: improve performance here
> It seems like improvements would be welcome.  :)  Removing some of the object 
> overhead would more than make up for interning strings.  Should I create a 
> new Jira ticket to propose a more efficient design?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning

2023-12-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799899#comment-17799899
 ] 

ASF GitHub Bot commented on OPENNLP-421:


rzo1 commented on code in PR #568:
URL: https://github.com/apache/opennlp/pull/568#discussion_r1435274503


##
opennlp-tools/src/main/java/opennlp/tools/util/jvm/Interner.java:
##
@@ -0,0 +1,23 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.tools.util.jvm;
+
+public interface Interner {
+
+  String intern(String s);

Review Comment:
   Ah ic. Old comments. So no issue :)





> Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
> --
>
> Key: OPENNLP-421
> URL: https://issues.apache.org/jira/browse/OPENNLP-421
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Name Finder
>Affects Versions: tools-1.5.2-incubating
> Environment: RedHat 5, JDK 1.6.0_29
>Reporter: Jay Hacker
>Assignee: Martin Wiesner
>Priority: Major
>  Labels: performance
> Fix For: 2.3.2
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The current implementation of StringList:
> https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup
>  
> calls intern() on every String.  Presumably this is an attempt to reduce 
> memory usage for duplicate tokens.  Interned Strings are stored in the JVM's 
> permanent generation, which has a small fixed size (seems to be about 83 MB 
> on modern 64-bit JVMs: 
> [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]).
>   Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen 
> space.  
> The size of the PermGen can be increased with the -XX:MaxPermSize= option to 
> the JVM.  However, this option is non-standard and not well known, and it 
> would be nice if OpenNLP worked out of the box without deep JVM tuning.
> This immediate problem could be fixed by simply not interning Strings.  
> Looking at the Dictionary and DictionaryNameFinder code as a whole, however, 
> there is a huge amount of room for performance improvement.  Currently, 
> DictionaryNameFinder.find works something like this:
> for every token in every tokenlist in the dictionary:
> copy it into a "meta dictionary" of single tokens
> for every possible subsequence of tokens in the sentence:// of which 
> there are O(N^2)
> copy the sequence into a new array
> if the last token is in the "meta dictionary":
> make a StringList from the tokens
> look it up in the dictionary
> Dictionary itself is very heavyweight: it's a Set, which 
> wraps StringList, which wraps Array.  Every entry in the dictionary 
> requires at least four allocated objects (in addition to the Strings): Array, 
> StringList, StringListWrapper, and HashMap.Entry.  Even contains and remove 
> allocate new objects!
> From this comment in DictionaryNameFinder:
> // TODO: improve performance here
> It seems like improvements would be welcome.  :)  Removing some of the object 
> overhead would more than make up for interning strings.  Should I create a 
> new Jira ticket to propose a more efficient design?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning

2023-12-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799890#comment-17799890
 ] 

ASF GitHub Bot commented on OPENNLP-421:


rzo1 commented on PR #568:
URL: https://github.com/apache/opennlp/pull/568#issuecomment-1867907835

   @kinow thx for trying. Both JMH benchmark took around 8h together on my old 
tower. I will address your comments, squash it up and merge it. Stay tuned for 
the wrapper removal after 




> Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
> --
>
> Key: OPENNLP-421
> URL: https://issues.apache.org/jira/browse/OPENNLP-421
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Name Finder
>Affects Versions: tools-1.5.2-incubating
> Environment: RedHat 5, JDK 1.6.0_29
>Reporter: Jay Hacker
>Assignee: Martin Wiesner
>Priority: Major
>  Labels: performance
> Fix For: 2.3.2
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The current implementation of StringList:
> https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup
>  
> calls intern() on every String.  Presumably this is an attempt to reduce 
> memory usage for duplicate tokens.  Interned Strings are stored in the JVM's 
> permanent generation, which has a small fixed size (seems to be about 83 MB 
> on modern 64-bit JVMs: 
> [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]).
>   Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen 
> space.  
> The size of the PermGen can be increased with the -XX:MaxPermSize= option to 
> the JVM.  However, this option is non-standard and not well known, and it 
> would be nice if OpenNLP worked out of the box without deep JVM tuning.
> This immediate problem could be fixed by simply not interning Strings.  
> Looking at the Dictionary and DictionaryNameFinder code as a whole, however, 
> there is a huge amount of room for performance improvement.  Currently, 
> DictionaryNameFinder.find works something like this:
> for every token in every tokenlist in the dictionary:
> copy it into a "meta dictionary" of single tokens
> for every possible subsequence of tokens in the sentence:// of which 
> there are O(N^2)
> copy the sequence into a new array
> if the last token is in the "meta dictionary":
> make a StringList from the tokens
> look it up in the dictionary
> Dictionary itself is very heavyweight: it's a Set, which 
> wraps StringList, which wraps Array.  Every entry in the dictionary 
> requires at least four allocated objects (in addition to the Strings): Array, 
> StringList, StringListWrapper, and HashMap.Entry.  Even contains and remove 
> allocate new objects!
> From this comment in DictionaryNameFinder:
> // TODO: improve performance here
> It seems like improvements would be welcome.  :)  Removing some of the object 
> overhead would more than make up for interning strings.  Should I create a 
> new Jira ticket to propose a more efficient design?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning

2023-12-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799862#comment-17799862
 ] 

ASF GitHub Bot commented on OPENNLP-421:


kinow commented on code in PR #568:
URL: https://github.com/apache/opennlp/pull/568#discussion_r1429217272


##
opennlp-tools/src/main/java/opennlp/tools/util/jvm/DefaultStringInterner.java:
##
@@ -0,0 +1,27 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.util.jvm;
+
+/**
+ * An interner implementation based on String.intern()

Review Comment:
   Missing period?



##
opennlp-tools/src/main/java/opennlp/tools/util/jvm/HMInterner.java:
##
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.util.jvm;
+
+import java.util.HashMap;
+import java.util.Map;
+
+/**
+ * A user String interner implementation by Aleksey Shipilëv.
+ * 
+ * Origin:
+ * https://shipilev.net/jvm/anatomy-quarks/10-string-intern/;>
+ *   https://shipilev.net/jvm/anatomy-quarks/10-string-intern/
+ * 
+ */
+public class HMInterner implements Interner {
+private final Map map;
+
+public HMInterner() {
+  map = new HashMap<>();
+}
+
+  @Override
+public String intern(String s) {
+final String exist = map.putIfAbsent(s, s);
+  return (exist == null) ? s : exist;
+}
+  }

Review Comment:
   Missing newline at end.



##
opennlp-tools/src/main/java/opennlp/tools/util/jvm/Interner.java:
##
@@ -0,0 +1,23 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.tools.util.jvm;
+
+public interface Interner {

Review Comment:
   Add a simple javadoc, like "String interner"?



##
opennlp-tools/src/main/java/opennlp/tools/util/jvm/Interner.java:
##
@@ -0,0 +1,23 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package 

[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning

2023-12-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799849#comment-17799849
 ] 

ASF GitHub Bot commented on OPENNLP-421:


kinow commented on PR #568:
URL: https://github.com/apache/opennlp/pull/568#issuecomment-1867746846

   I did, like the previous time. I invalidated the cache and restart IntelliJ, 
deleted the Run configuration… that's really odd — although it's Friday 
afternoon, so my :brain: is probably giving up on me already. Will try again 
later.




> Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
> --
>
> Key: OPENNLP-421
> URL: https://issues.apache.org/jira/browse/OPENNLP-421
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Name Finder
>Affects Versions: tools-1.5.2-incubating
> Environment: RedHat 5, JDK 1.6.0_29
>Reporter: Jay Hacker
>Assignee: Martin Wiesner
>Priority: Major
>  Labels: performance
> Fix For: 2.3.2
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The current implementation of StringList:
> https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup
>  
> calls intern() on every String.  Presumably this is an attempt to reduce 
> memory usage for duplicate tokens.  Interned Strings are stored in the JVM's 
> permanent generation, which has a small fixed size (seems to be about 83 MB 
> on modern 64-bit JVMs: 
> [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]).
>   Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen 
> space.  
> The size of the PermGen can be increased with the -XX:MaxPermSize= option to 
> the JVM.  However, this option is non-standard and not well known, and it 
> would be nice if OpenNLP worked out of the box without deep JVM tuning.
> This immediate problem could be fixed by simply not interning Strings.  
> Looking at the Dictionary and DictionaryNameFinder code as a whole, however, 
> there is a huge amount of room for performance improvement.  Currently, 
> DictionaryNameFinder.find works something like this:
> for every token in every tokenlist in the dictionary:
> copy it into a "meta dictionary" of single tokens
> for every possible subsequence of tokens in the sentence:// of which 
> there are O(N^2)
> copy the sequence into a new array
> if the last token is in the "meta dictionary":
> make a StringList from the tokens
> look it up in the dictionary
> Dictionary itself is very heavyweight: it's a Set, which 
> wraps StringList, which wraps Array.  Every entry in the dictionary 
> requires at least four allocated objects (in addition to the Strings): Array, 
> StringList, StringListWrapper, and HashMap.Entry.  Even contains and remove 
> allocate new objects!
> From this comment in DictionaryNameFinder:
> // TODO: improve performance here
> It seems like improvements would be welcome.  :)  Removing some of the object 
> overhead would more than make up for interning strings.  Should I create a 
> new Jira ticket to propose a more efficient design?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning

2023-12-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799843#comment-17799843
 ] 

ASF GitHub Bot commented on OPENNLP-421:


rzo1 commented on PR #568:
URL: https://github.com/apache/opennlp/pull/568#issuecomment-1867718995

   Did you activate the JMH profile, so it adds the related dependencies?




> Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
> --
>
> Key: OPENNLP-421
> URL: https://issues.apache.org/jira/browse/OPENNLP-421
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Name Finder
>Affects Versions: tools-1.5.2-incubating
> Environment: RedHat 5, JDK 1.6.0_29
>Reporter: Jay Hacker
>Assignee: Martin Wiesner
>Priority: Major
>  Labels: performance
> Fix For: 2.3.2
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The current implementation of StringList:
> https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup
>  
> calls intern() on every String.  Presumably this is an attempt to reduce 
> memory usage for duplicate tokens.  Interned Strings are stored in the JVM's 
> permanent generation, which has a small fixed size (seems to be about 83 MB 
> on modern 64-bit JVMs: 
> [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]).
>   Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen 
> space.  
> The size of the PermGen can be increased with the -XX:MaxPermSize= option to 
> the JVM.  However, this option is non-standard and not well known, and it 
> would be nice if OpenNLP worked out of the box without deep JVM tuning.
> This immediate problem could be fixed by simply not interning Strings.  
> Looking at the Dictionary and DictionaryNameFinder code as a whole, however, 
> there is a huge amount of room for performance improvement.  Currently, 
> DictionaryNameFinder.find works something like this:
> for every token in every tokenlist in the dictionary:
> copy it into a "meta dictionary" of single tokens
> for every possible subsequence of tokens in the sentence:// of which 
> there are O(N^2)
> copy the sequence into a new array
> if the last token is in the "meta dictionary":
> make a StringList from the tokens
> look it up in the dictionary
> Dictionary itself is very heavyweight: it's a Set, which 
> wraps StringList, which wraps Array.  Every entry in the dictionary 
> requires at least four allocated objects (in addition to the Strings): Array, 
> StringList, StringListWrapper, and HashMap.Entry.  Even contains and remove 
> allocate new objects!
> From this comment in DictionaryNameFinder:
> // TODO: improve performance here
> It seems like improvements would be welcome.  :)  Removing some of the object 
> overhead would more than make up for interning strings.  Should I create a 
> new Jira ticket to propose a more efficient design?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning

2023-12-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799841#comment-17799841
 ] 

ASF GitHub Bot commented on OPENNLP-421:


kinow commented on PR #568:
URL: https://github.com/apache/opennlp/pull/568#issuecomment-1867709096

   I'm working until noon today, then I should have time to test it again 
:tada: (sorry the delay)
   
   Trying the branch again after rebuilding it, but I keep getting
   
   ```java
   # Run progress: 0.74% complete, ETA 00:01:51
   # Fork: 1 of 1
   
   
   java.lang.NoClassDefFoundError: opennlp/tools/jmh/ExecutionPlan
at java.base/java.lang.Class.forName0(Native Method)
at java.base/java.lang.Class.forName(Class.java:375)
at org.openjdk.jmh.util.ClassUtils.loadClass(ClassUtils.java:73)
   ...
   ```




> Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
> --
>
> Key: OPENNLP-421
> URL: https://issues.apache.org/jira/browse/OPENNLP-421
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Name Finder
>Affects Versions: tools-1.5.2-incubating
> Environment: RedHat 5, JDK 1.6.0_29
>Reporter: Jay Hacker
>Assignee: Martin Wiesner
>Priority: Major
>  Labels: performance
> Fix For: 2.3.2
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The current implementation of StringList:
> https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup
>  
> calls intern() on every String.  Presumably this is an attempt to reduce 
> memory usage for duplicate tokens.  Interned Strings are stored in the JVM's 
> permanent generation, which has a small fixed size (seems to be about 83 MB 
> on modern 64-bit JVMs: 
> [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]).
>   Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen 
> space.  
> The size of the PermGen can be increased with the -XX:MaxPermSize= option to 
> the JVM.  However, this option is non-standard and not well known, and it 
> would be nice if OpenNLP worked out of the box without deep JVM tuning.
> This immediate problem could be fixed by simply not interning Strings.  
> Looking at the Dictionary and DictionaryNameFinder code as a whole, however, 
> there is a huge amount of room for performance improvement.  Currently, 
> DictionaryNameFinder.find works something like this:
> for every token in every tokenlist in the dictionary:
> copy it into a "meta dictionary" of single tokens
> for every possible subsequence of tokens in the sentence:// of which 
> there are O(N^2)
> copy the sequence into a new array
> if the last token is in the "meta dictionary":
> make a StringList from the tokens
> look it up in the dictionary
> Dictionary itself is very heavyweight: it's a Set, which 
> wraps StringList, which wraps Array.  Every entry in the dictionary 
> requires at least four allocated objects (in addition to the Strings): Array, 
> StringList, StringListWrapper, and HashMap.Entry.  Even contains and remove 
> allocate new objects!
> From this comment in DictionaryNameFinder:
> // TODO: improve performance here
> It seems like improvements would be welcome.  :)  Removing some of the object 
> overhead would more than make up for interning strings.  Should I create a 
> new Jira ticket to propose a more efficient design?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning

2023-12-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799691#comment-17799691
 ] 

ASF GitHub Bot commented on OPENNLP-421:


rzo1 commented on PR #568:
URL: https://github.com/apache/opennlp/pull/568#issuecomment-1867340084

   If this code is integrated, there is another branch based on this PR ( 
https://github.com/apache/opennlp/commit/7d55599a8ff35adc462a05b882bb3260c7943e35
 ) which will remove the unnecessary creation of `StringListWrapper` objects. 




> Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
> --
>
> Key: OPENNLP-421
> URL: https://issues.apache.org/jira/browse/OPENNLP-421
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Name Finder
>Affects Versions: tools-1.5.2-incubating
> Environment: RedHat 5, JDK 1.6.0_29
>Reporter: Jay Hacker
>Assignee: Richard Zowalla
>Priority: Minor
>  Labels: performance
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The current implementation of StringList:
> https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup
>  
> calls intern() on every String.  Presumably this is an attempt to reduce 
> memory usage for duplicate tokens.  Interned Strings are stored in the JVM's 
> permanent generation, which has a small fixed size (seems to be about 83 MB 
> on modern 64-bit JVMs: 
> [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]).
>   Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen 
> space.  
> The size of the PermGen can be increased with the -XX:MaxPermSize= option to 
> the JVM.  However, this option is non-standard and not well known, and it 
> would be nice if OpenNLP worked out of the box without deep JVM tuning.
> This immediate problem could be fixed by simply not interning Strings.  
> Looking at the Dictionary and DictionaryNameFinder code as a whole, however, 
> there is a huge amount of room for performance improvement.  Currently, 
> DictionaryNameFinder.find works something like this:
> for every token in every tokenlist in the dictionary:
> copy it into a "meta dictionary" of single tokens
> for every possible subsequence of tokens in the sentence:// of which 
> there are O(N^2)
> copy the sequence into a new array
> if the last token is in the "meta dictionary":
> make a StringList from the tokens
> look it up in the dictionary
> Dictionary itself is very heavyweight: it's a Set, which 
> wraps StringList, which wraps Array.  Every entry in the dictionary 
> requires at least four allocated objects (in addition to the Strings): Array, 
> StringList, StringListWrapper, and HashMap.Entry.  Even contains and remove 
> allocate new objects!
> From this comment in DictionaryNameFinder:
> // TODO: improve performance here
> It seems like improvements would be welcome.  :)  Removing some of the object 
> overhead would more than make up for interning strings.  Should I create a 
> new Jira ticket to propose a more efficient design?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning

2023-12-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17798258#comment-17798258
 ] 

ASF GitHub Bot commented on OPENNLP-421:


mawiesne commented on code in PR #568:
URL: https://github.com/apache/opennlp/pull/568#discussion_r1430325906


##
opennlp-tools/src/main/java/opennlp/tools/util/StringList.java:
##
@@ -33,19 +35,19 @@ public class StringList implements Iterable {
* Initializes a {@link StringList} instance.
* 
* Note: 
-   * Token String will be replaced by identical internal String object.
+   * Token String will be interened via {@link StringInterners}.
*
* @param singleToken One single token
*/
   public StringList(String singleToken) {
-tokens = new String[]{singleToken.intern()};
+tokens = new String[]{StringInterners.intern(singleToken)};
   }
 
   /**
* Initializes a {@link StringList} instance.
* 
* Note: 
-   * Token Strings will be replaced by identical internal String object.
+   * Token Strings will be interened via {@link StringInterners}.

Review Comment:
   Should read `interned`, minus the extra `e`.



##
opennlp-tools/src/main/java/opennlp/tools/util/StringList.java:
##
@@ -33,19 +35,19 @@ public class StringList implements Iterable {
* Initializes a {@link StringList} instance.
* 
* Note: 
-   * Token String will be replaced by identical internal String object.
+   * Token String will be interened via {@link StringInterners}.

Review Comment:
   Should read `interned`, minus the extra `e`.



##
opennlp-tools/src/main/java/opennlp/tools/util/jvm/CHMStringDeduplicator.java:
##
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.util.jvm;
+
+
+import java.util.Map;
+import java.util.concurrent.ConcurrentHashMap;
+import java.util.concurrent.ThreadLocalRandom;
+
+import opennlp.tools.commons.Internal;
+import opennlp.tools.commons.ThreadSafe;
+
+/**
+ * A {@link StringInterner} implementation by Aleksey Shipilëv with relaxed 
canonical requirements.
+ * It is a probabilistic deduplication implementation with a default 
probability of {@code 0.5}.
+ * Users may implement a custom class with a different probability value.
+ * 
+ * Origin:
+ * https://shipilev.net/talks/joker-Oct2014-string-catechism.pdf;>
+ * https://shipilev.net/talks/joker-Oct2014-string-catechism.pdf
+ * 

Review Comment:
   No extra `` required here.



##
opennlp-tools/src/main/java/opennlp/tools/util/jvm/CHMStringDeduplicator.java:
##
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.util.jvm;
+
+
+import java.util.Map;
+import java.util.concurrent.ConcurrentHashMap;
+import java.util.concurrent.ThreadLocalRandom;
+
+import opennlp.tools.commons.Internal;
+import opennlp.tools.commons.ThreadSafe;
+
+/**
+ * A {@link StringInterner} implementation by Aleksey Shipilëv with relaxed 
canonical requirements.
+ * It is a probabilistic deduplication implementation with a default 
probability of {@code 0.5}.
+ * Users may implement a custom class with a different probability value.
+ * 
+ * Origin:
+ * https://shipilev.net/talks/joker-Oct2014-string-catechism.pdf;>
+ * https://shipilev.net/talks/joker-Oct2014-string-catechism.pdf
+ * 
+ */
+@Internal
+@ThreadSafe
+class CHMStringDeduplicator implements 

[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning

2023-12-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17798227#comment-17798227
 ] 

ASF GitHub Bot commented on OPENNLP-421:


rzo1 commented on PR #568:
URL: https://github.com/apache/opennlp/pull/568#issuecomment-1860649753

   Basically, it is now ready for review.




> Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
> --
>
> Key: OPENNLP-421
> URL: https://issues.apache.org/jira/browse/OPENNLP-421
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Name Finder
>Affects Versions: tools-1.5.2-incubating
> Environment: RedHat 5, JDK 1.6.0_29
>Reporter: Jay Hacker
>Assignee: Richard Zowalla
>Priority: Minor
>  Labels: performance
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The current implementation of StringList:
> https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup
>  
> calls intern() on every String.  Presumably this is an attempt to reduce 
> memory usage for duplicate tokens.  Interned Strings are stored in the JVM's 
> permanent generation, which has a small fixed size (seems to be about 83 MB 
> on modern 64-bit JVMs: 
> [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]).
>   Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen 
> space.  
> The size of the PermGen can be increased with the -XX:MaxPermSize= option to 
> the JVM.  However, this option is non-standard and not well known, and it 
> would be nice if OpenNLP worked out of the box without deep JVM tuning.
> This immediate problem could be fixed by simply not interning Strings.  
> Looking at the Dictionary and DictionaryNameFinder code as a whole, however, 
> there is a huge amount of room for performance improvement.  Currently, 
> DictionaryNameFinder.find works something like this:
> for every token in every tokenlist in the dictionary:
> copy it into a "meta dictionary" of single tokens
> for every possible subsequence of tokens in the sentence:// of which 
> there are O(N^2)
> copy the sequence into a new array
> if the last token is in the "meta dictionary":
> make a StringList from the tokens
> look it up in the dictionary
> Dictionary itself is very heavyweight: it's a Set, which 
> wraps StringList, which wraps Array.  Every entry in the dictionary 
> requires at least four allocated objects (in addition to the Strings): Array, 
> StringList, StringListWrapper, and HashMap.Entry.  Even contains and remove 
> allocate new objects!
> From this comment in DictionaryNameFinder:
> // TODO: improve performance here
> It seems like improvements would be welcome.  :)  Removing some of the object 
> overhead would more than make up for interning strings.  Should I create a 
> new Jira ticket to propose a more efficient design?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning

2023-12-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17798226#comment-17798226
 ] 

ASF GitHub Bot commented on OPENNLP-421:


rzo1 commented on PR #568:
URL: https://github.com/apache/opennlp/pull/568#issuecomment-1860649042

   Here is the other benchmark:
   
   ```bash
   Benchmark 
(internerClazz)   (size)   Mode  Cnt  Score  Error  Units
   StringListBenchmark.newWithArrayConstructor  
opennlp.tools.util.jvm.CHMStringDeduplicator1  thrpt   25   
45717169,968 ±  4894532,156  ops/s
   StringListBenchmark.newWithArrayConstructor  
opennlp.tools.util.jvm.CHMStringDeduplicator  100  thrpt   25 
544627,570 ± 9900,671  ops/s
   StringListBenchmark.newWithArrayConstructor  
opennlp.tools.util.jvm.CHMStringDeduplicator1  thrpt   25   
3876,808 ±  211,760  ops/s
   StringListBenchmark.newWithArrayConstructor  
opennlp.tools.util.jvm.CHMStringDeduplicator  100  thrpt   25 
11,190 ±1,009  ops/s
   StringListBenchmark.newWithArrayConstructor  
opennlp.tools.util.jvm.CHMStringInterner1  thrpt   25   43774046,611 ± 
11369890,960  ops/s
   StringListBenchmark.newWithArrayConstructor  
opennlp.tools.util.jvm.CHMStringInterner  100  thrpt   25 557563,692 ±  
  12776,109  ops/s
   StringListBenchmark.newWithArrayConstructor  
opennlp.tools.util.jvm.CHMStringInterner1  thrpt   25   2951,642 ±  
189,985  ops/s
   StringListBenchmark.newWithArrayConstructor  
opennlp.tools.util.jvm.CHMStringInterner  100  thrpt   25  8,232 ±  
  0,847  ops/s
   StringListBenchmark.newWithArrayConstructor   
opennlp.tools.util.jvm.HMStringInterner1  thrpt   25   63103971,334 ±  
1958932,641  ops/s
   StringListBenchmark.newWithArrayConstructor   
opennlp.tools.util.jvm.HMStringInterner  100  thrpt   25 854268,631 ±   
 22261,811  ops/s
   StringListBenchmark.newWithArrayConstructor   
opennlp.tools.util.jvm.HMStringInterner1  thrpt   25   3191,138 ±   
   241,943  ops/s
   StringListBenchmark.newWithArrayConstructor   
opennlp.tools.util.jvm.HMStringInterner  100  thrpt   25  8,891 ±   
 1,040  ops/s
   StringListBenchmark.newWithArrayConstructor  
opennlp.tools.util.jvm.JvmStringInterner1  thrpt   258881121,516 ±  
 108767,585  ops/s
   StringListBenchmark.newWithArrayConstructor  
opennlp.tools.util.jvm.JvmStringInterner  100  thrpt   25  81073,526 ±  
342,867  ops/s
   StringListBenchmark.newWithArrayConstructor  
opennlp.tools.util.jvm.JvmStringInterner1  thrpt   25729,416 ±  
  6,782  ops/s
   StringListBenchmark.newWithArrayConstructor  
opennlp.tools.util.jvm.JvmStringInterner  100  thrpt   25  2,346 ±  
  0,076  ops/s
   StringListBenchmark.newWithArrayConstructor 
opennlp.tools.util.jvm.NoOpStringInterner1  thrpt   25  132530907,915 ± 
  368546,226  ops/s
   StringListBenchmark.newWithArrayConstructor 
opennlp.tools.util.jvm.NoOpStringInterner  100  thrpt   256446247,778 ± 
   14485,272  ops/s
   StringListBenchmark.newWithArrayConstructor 
opennlp.tools.util.jvm.NoOpStringInterner1  thrpt   25  53441,292 ± 
 107,629  ops/s
   StringListBenchmark.newWithArrayConstructor 
opennlp.tools.util.jvm.NoOpStringInterner  100  thrpt   25124,158 ± 
   1,921  ops/s
   ```
   https://gist.github.com/rzo1/1e85f65ff5d152e168c58e72b2de5375




> Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
> --
>
> Key: OPENNLP-421
> URL: https://issues.apache.org/jira/browse/OPENNLP-421
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Name Finder
>Affects Versions: tools-1.5.2-incubating
> Environment: RedHat 5, JDK 1.6.0_29
>Reporter: Jay Hacker
>Assignee: Richard Zowalla
>Priority: Minor
>  Labels: performance
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The current implementation of StringList:
> https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup
>  
> calls intern() on every String.  Presumably this is an attempt to reduce 
> memory usage for duplicate tokens.  Interned Strings are stored in the JVM's 
> permanent generation, which has a small fixed size (seems to be about 83 MB 
> on modern 64-bit JVMs: 
> [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]).
>   Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen 
> space.  
> The size of the PermGen can 

[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning

2023-12-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17798214#comment-17798214
 ] 

ASF GitHub Bot commented on OPENNLP-421:


jzonthemtn commented on PR #568:
URL: https://github.com/apache/opennlp/pull/568#issuecomment-1860559806

   Love it. Curious to see the impact on things like training.




> Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
> --
>
> Key: OPENNLP-421
> URL: https://issues.apache.org/jira/browse/OPENNLP-421
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Name Finder
>Affects Versions: tools-1.5.2-incubating
> Environment: RedHat 5, JDK 1.6.0_29
>Reporter: Jay Hacker
>Assignee: Richard Zowalla
>Priority: Minor
>  Labels: performance
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The current implementation of StringList:
> https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup
>  
> calls intern() on every String.  Presumably this is an attempt to reduce 
> memory usage for duplicate tokens.  Interned Strings are stored in the JVM's 
> permanent generation, which has a small fixed size (seems to be about 83 MB 
> on modern 64-bit JVMs: 
> [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]).
>   Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen 
> space.  
> The size of the PermGen can be increased with the -XX:MaxPermSize= option to 
> the JVM.  However, this option is non-standard and not well known, and it 
> would be nice if OpenNLP worked out of the box without deep JVM tuning.
> This immediate problem could be fixed by simply not interning Strings.  
> Looking at the Dictionary and DictionaryNameFinder code as a whole, however, 
> there is a huge amount of room for performance improvement.  Currently, 
> DictionaryNameFinder.find works something like this:
> for every token in every tokenlist in the dictionary:
> copy it into a "meta dictionary" of single tokens
> for every possible subsequence of tokens in the sentence:// of which 
> there are O(N^2)
> copy the sequence into a new array
> if the last token is in the "meta dictionary":
> make a StringList from the tokens
> look it up in the dictionary
> Dictionary itself is very heavyweight: it's a Set, which 
> wraps StringList, which wraps Array.  Every entry in the dictionary 
> requires at least four allocated objects (in addition to the Strings): Array, 
> StringList, StringListWrapper, and HashMap.Entry.  Even contains and remove 
> allocate new objects!
> From this comment in DictionaryNameFinder:
> // TODO: improve performance here
> It seems like improvements would be welcome.  :)  Removing some of the object 
> overhead would more than make up for interning strings.  Should I create a 
> new Jira ticket to propose a more efficient design?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning

2023-12-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17798123#comment-17798123
 ] 

ASF GitHub Bot commented on OPENNLP-421:


rzo1 commented on PR #568:
URL: https://github.com/apache/opennlp/pull/568#issuecomment-1860086040

   Here is the JMH result for the interner impls only:
   
   ```bash
   Benchmark (size)   Mode  Cnt  Score  
   Error  Units
   StringDeduplicationBenchmark.chm   1  thrpt   25   36070236,872 ± 
1204122,063  ops/s
   StringDeduplicationBenchmark.chm 100  thrpt   25 354445,896 ±
2893,432  ops/s
   StringDeduplicationBenchmark.chm   1  thrpt   25   2323,076 ±
  25,596  ops/s
   StringDeduplicationBenchmark.chm 100  thrpt   25 13,137 ±
   0,199  ops/s
   StringDeduplicationBenchmark.chmd051  thrpt   25   45289035,126 ±  
143794,468  ops/s
   StringDeduplicationBenchmark.chmd05  100  thrpt   25 433279,497 ±
3165,815  ops/s
   StringDeduplicationBenchmark.chmd051  thrpt   25   2692,779 ±
   8,173  ops/s
   StringDeduplicationBenchmark.chmd05  100  thrpt   25 13,413 ±
   0,155  ops/s
   StringDeduplicationBenchmark.hm1  thrpt   25   35123958,776 ±  
472485,649  ops/s
   StringDeduplicationBenchmark.hm  100  thrpt   25 371997,311 ±
6780,622  ops/s
   StringDeduplicationBenchmark.hm1  thrpt   25   2311,588 ±
 115,117  ops/s
   StringDeduplicationBenchmark.hm  100  thrpt   25 14,073 ±
   0,068  ops/s
   StringDeduplicationBenchmark.intern1  thrpt   25   10040026,472 ±   
77470,327  ops/s
   StringDeduplicationBenchmark.intern  100  thrpt   25  87644,048 ±
 844,053  ops/s
   StringDeduplicationBenchmark.intern1  thrpt   25764,752 ±
  34,300  ops/s
   StringDeduplicationBenchmark.intern  100  thrpt   25  2,956 ±
   0,024  ops/s
   StringDeduplicationBenchmark.noop  1  thrpt   25  148719353,102 ±  
743493,703  ops/s
   StringDeduplicationBenchmark.noop100  thrpt   251491614,947 ±
1587,406  ops/s
   StringDeduplicationBenchmark.noop  1  thrpt   25   9848,732 ±
  11,076  ops/s
   StringDeduplicationBenchmark.noop100  thrpt   25 78,726 ±
   0,064  ops/s
   ```
   
   https://gist.github.com/rzo1/754b15381041b28718cf44073f4986c5




> Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
> --
>
> Key: OPENNLP-421
> URL: https://issues.apache.org/jira/browse/OPENNLP-421
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Name Finder
>Affects Versions: tools-1.5.2-incubating
> Environment: RedHat 5, JDK 1.6.0_29
>Reporter: Jay Hacker
>Assignee: Richard Zowalla
>Priority: Minor
>  Labels: performance
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The current implementation of StringList:
> https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup
>  
> calls intern() on every String.  Presumably this is an attempt to reduce 
> memory usage for duplicate tokens.  Interned Strings are stored in the JVM's 
> permanent generation, which has a small fixed size (seems to be about 83 MB 
> on modern 64-bit JVMs: 
> [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]).
>   Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen 
> space.  
> The size of the PermGen can be increased with the -XX:MaxPermSize= option to 
> the JVM.  However, this option is non-standard and not well known, and it 
> would be nice if OpenNLP worked out of the box without deep JVM tuning.
> This immediate problem could be fixed by simply not interning Strings.  
> Looking at the Dictionary and DictionaryNameFinder code as a whole, however, 
> there is a huge amount of room for performance improvement.  Currently, 
> DictionaryNameFinder.find works something like this:
> for every token in every tokenlist in the dictionary:
> copy it into a "meta dictionary" of single tokens
> for every possible subsequence of tokens in the sentence:// of which 
> there are O(N^2)
> copy the sequence into a new array
> if the last token is in the "meta dictionary":
> make a StringList from the tokens
> look it up in the dictionary
> Dictionary itself is very heavyweight: it's a Set, which 
> wraps StringList, which wraps Array.  Every entry in the dictionary 
> requires at least four allocated objects (in addition to the Strings): Array, 
> StringList, StringListWrapper, and HashMap.Entry.  Even 

[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning

2023-12-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17798058#comment-17798058
 ] 

ASF GitHub Bot commented on OPENNLP-421:


mawiesne commented on code in PR #568:
URL: https://github.com/apache/opennlp/pull/568#discussion_r1429738970


##
opennlp-tools/src/main/java/opennlp/tools/util/jvm/StringInterners.java:
##
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.tools.util.jvm;
+
+import java.lang.reflect.Constructor;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Provides string interning utility methods. Interning mechanism can be 
configured via the
+ * system property {@code opennlp.interner.class} by specifying an 
implementation via its
+ * full qualified classname. It needs to implement {@link StringInterner}.
+ * 
+ * If not specified by the user, the default interner is {@link 
CHMStringInterner}.
+ */
+public class StringInterners {
+
+  private static final Logger LOGGER = 
LoggerFactory.getLogger(StringInterners.class);
+  private static final StringInterner INTERNER;
+
+  static {
+
+final String clazzName = System.getProperty("opennlp.interner.class",

Review Comment:
   Ah, ic.





> Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
> --
>
> Key: OPENNLP-421
> URL: https://issues.apache.org/jira/browse/OPENNLP-421
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Name Finder
>Affects Versions: tools-1.5.2-incubating
> Environment: RedHat 5, JDK 1.6.0_29
>Reporter: Jay Hacker
>Assignee: Richard Zowalla
>Priority: Minor
>  Labels: performance
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The current implementation of StringList:
> https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup
>  
> calls intern() on every String.  Presumably this is an attempt to reduce 
> memory usage for duplicate tokens.  Interned Strings are stored in the JVM's 
> permanent generation, which has a small fixed size (seems to be about 83 MB 
> on modern 64-bit JVMs: 
> [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]).
>   Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen 
> space.  
> The size of the PermGen can be increased with the -XX:MaxPermSize= option to 
> the JVM.  However, this option is non-standard and not well known, and it 
> would be nice if OpenNLP worked out of the box without deep JVM tuning.
> This immediate problem could be fixed by simply not interning Strings.  
> Looking at the Dictionary and DictionaryNameFinder code as a whole, however, 
> there is a huge amount of room for performance improvement.  Currently, 
> DictionaryNameFinder.find works something like this:
> for every token in every tokenlist in the dictionary:
> copy it into a "meta dictionary" of single tokens
> for every possible subsequence of tokens in the sentence:// of which 
> there are O(N^2)
> copy the sequence into a new array
> if the last token is in the "meta dictionary":
> make a StringList from the tokens
> look it up in the dictionary
> Dictionary itself is very heavyweight: it's a Set, which 
> wraps StringList, which wraps Array.  Every entry in the dictionary 
> requires at least four allocated objects (in addition to the Strings): Array, 
> StringList, StringListWrapper, and HashMap.Entry.  Even contains and remove 
> allocate new objects!
> From this comment in DictionaryNameFinder:
> // TODO: improve performance here
> It seems like improvements would be welcome.  :)  Removing some of the object 
> overhead would more than make up for interning strings.  Should I create a 
> new Jira ticket to propose a more efficient design?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning

2023-12-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17798056#comment-17798056
 ] 

ASF GitHub Bot commented on OPENNLP-421:


rzo1 commented on code in PR #568:
URL: https://github.com/apache/opennlp/pull/568#discussion_r1429734990


##
opennlp-tools/src/main/java/opennlp/tools/util/jvm/StringInterners.java:
##
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.tools.util.jvm;
+
+import java.lang.reflect.Constructor;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Provides string interning utility methods. Interning mechanism can be 
configured via the
+ * system property {@code opennlp.interner.class} by specifying an 
implementation via its
+ * full qualified classname. It needs to implement {@link StringInterner}.
+ * 
+ * If not specified by the user, the default interner is {@link 
CHMStringInterner}.
+ */
+public class StringInterners {
+
+  private static final Logger LOGGER = 
LoggerFactory.getLogger(StringInterners.class);
+  private static final StringInterner INTERNER;
+
+  static {
+
+final String clazzName = System.getProperty("opennlp.interner.class",

Review Comment:
   ```java
   final String clazzName = System.getProperty("opennlp.interner.class",
   CHMStringInterner.class.getCanonicalName());
   ```
   
   uses
   
   ```java
   /**
* Gets the system property indicated by the specified key.
*
* First, if there is a security manager, its
* {@code checkPropertyAccess} method is called with the
* {@code key} as its argument.
* 
* If there is no current set of system properties, a set of system
* properties is first created and initialized in the same manner as
* for the {@code getProperties} method.
*
* @param  key   the name of the system property.
* @param  def   a default value.
* @return the string value of the system property,
* or the default value if there is no property with that 
key.
*
* @throws SecurityException  if a security manager exists and its
* {@code checkPropertyAccess} method doesn't allow
* access to the specified system property.
* @throws NullPointerException if {@code key} is {@code null}.
* @throws IllegalArgumentException if {@code key} is empty.
* @see#setProperty
* @see
java.lang.SecurityManager#checkPropertyAccess(java.lang.String)
* @seejava.lang.System#getProperties()
*/
   public static String getProperty(String key, String def) 
   ```
   will return `CHMStringInterner.class.getCanonicalName()` if the property is 
not defined.
   





> Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
> --
>
> Key: OPENNLP-421
> URL: https://issues.apache.org/jira/browse/OPENNLP-421
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Name Finder
>Affects Versions: tools-1.5.2-incubating
> Environment: RedHat 5, JDK 1.6.0_29
>Reporter: Jay Hacker
>Assignee: Richard Zowalla
>Priority: Minor
>  Labels: performance
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The current implementation of StringList:
> https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup
>  
> calls intern() on every String.  Presumably this is an attempt to reduce 
> memory usage for duplicate tokens.  Interned Strings are stored in the JVM's 
> permanent generation, which has a small fixed size (seems to be about 83 MB 
> on modern 64-bit JVMs: 
> [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]).
>   Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen 
> space.  
> The size of the 

[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning

2023-12-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17798057#comment-17798057
 ] 

ASF GitHub Bot commented on OPENNLP-421:


rzo1 commented on code in PR #568:
URL: https://github.com/apache/opennlp/pull/568#discussion_r1429734990


##
opennlp-tools/src/main/java/opennlp/tools/util/jvm/StringInterners.java:
##
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.tools.util.jvm;
+
+import java.lang.reflect.Constructor;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Provides string interning utility methods. Interning mechanism can be 
configured via the
+ * system property {@code opennlp.interner.class} by specifying an 
implementation via its
+ * full qualified classname. It needs to implement {@link StringInterner}.
+ * 
+ * If not specified by the user, the default interner is {@link 
CHMStringInterner}.
+ */
+public class StringInterners {
+
+  private static final Logger LOGGER = 
LoggerFactory.getLogger(StringInterners.class);
+  private static final StringInterner INTERNER;
+
+  static {
+
+final String clazzName = System.getProperty("opennlp.interner.class",

Review Comment:
   ```java
   final String clazzName = System.getProperty("opennlp.interner.class",
   CHMStringInterner.class.getCanonicalName());
   ```
   
   uses
   
   ```java
   /**
* Gets the system property indicated by the specified key.
*
* First, if there is a security manager, its
* {@code checkPropertyAccess} method is called with the
* {@code key} as its argument.
* 
* If there is no current set of system properties, a set of system
* properties is first created and initialized in the same manner as
* for the {@code getProperties} method.
*
* @param  key   the name of the system property.
* @param  def   a default value.
* @return the string value of the system property,
* or the default value if there is no property with that 
key.
*
* @throws SecurityException  if a security manager exists and its
* {@code checkPropertyAccess} method doesn't allow
* access to the specified system property.
* @throws NullPointerException if {@code key} is {@code null}.
* @throws IllegalArgumentException if {@code key} is empty.
* @see#setProperty
* @see
java.lang.SecurityManager#checkPropertyAccess(java.lang.String)
* @seejava.lang.System#getProperties()
*/
   public static String getProperty(String key, String def) 
   ```
   which will return `CHMStringInterner.class.getCanonicalName()` if the 
property `opennlp.interner.class` is not defined.
   





> Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
> --
>
> Key: OPENNLP-421
> URL: https://issues.apache.org/jira/browse/OPENNLP-421
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Name Finder
>Affects Versions: tools-1.5.2-incubating
> Environment: RedHat 5, JDK 1.6.0_29
>Reporter: Jay Hacker
>Assignee: Richard Zowalla
>Priority: Minor
>  Labels: performance
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The current implementation of StringList:
> https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup
>  
> calls intern() on every String.  Presumably this is an attempt to reduce 
> memory usage for duplicate tokens.  Interned Strings are stored in the JVM's 
> permanent generation, which has a small fixed size (seems to be about 83 MB 
> on modern 64-bit JVMs: 
> [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]).
>   Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen 
> 

[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning

2023-12-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17798055#comment-17798055
 ] 

ASF GitHub Bot commented on OPENNLP-421:


mawiesne commented on code in PR #568:
URL: https://github.com/apache/opennlp/pull/568#discussion_r1429730195


##
opennlp-tools/src/main/java/opennlp/tools/util/jvm/StringInterners.java:
##
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.tools.util.jvm;
+
+import java.lang.reflect.Constructor;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Provides string interning utility methods. Interning mechanism can be 
configured via the
+ * system property {@code opennlp.interner.class} by specifying an 
implementation via its
+ * full qualified classname. It needs to implement {@link StringInterner}.
+ * 
+ * If not specified by the user, the default interner is {@link 
CHMStringInterner}.
+ */
+public class StringInterners {
+
+  private static final Logger LOGGER = 
LoggerFactory.getLogger(StringInterners.class);
+  private static final StringInterner INTERNER;
+
+  static {
+
+final String clazzName = System.getProperty("opennlp.interner.class",

Review Comment:
   What is the result if the property is not set / found at runtime? 
   How to ensure that the default interner is set "active"? 
   
   Am I missing something?





> Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
> --
>
> Key: OPENNLP-421
> URL: https://issues.apache.org/jira/browse/OPENNLP-421
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Name Finder
>Affects Versions: tools-1.5.2-incubating
> Environment: RedHat 5, JDK 1.6.0_29
>Reporter: Jay Hacker
>Assignee: Richard Zowalla
>Priority: Minor
>  Labels: performance
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The current implementation of StringList:
> https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup
>  
> calls intern() on every String.  Presumably this is an attempt to reduce 
> memory usage for duplicate tokens.  Interned Strings are stored in the JVM's 
> permanent generation, which has a small fixed size (seems to be about 83 MB 
> on modern 64-bit JVMs: 
> [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]).
>   Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen 
> space.  
> The size of the PermGen can be increased with the -XX:MaxPermSize= option to 
> the JVM.  However, this option is non-standard and not well known, and it 
> would be nice if OpenNLP worked out of the box without deep JVM tuning.
> This immediate problem could be fixed by simply not interning Strings.  
> Looking at the Dictionary and DictionaryNameFinder code as a whole, however, 
> there is a huge amount of room for performance improvement.  Currently, 
> DictionaryNameFinder.find works something like this:
> for every token in every tokenlist in the dictionary:
> copy it into a "meta dictionary" of single tokens
> for every possible subsequence of tokens in the sentence:// of which 
> there are O(N^2)
> copy the sequence into a new array
> if the last token is in the "meta dictionary":
> make a StringList from the tokens
> look it up in the dictionary
> Dictionary itself is very heavyweight: it's a Set, which 
> wraps StringList, which wraps Array.  Every entry in the dictionary 
> requires at least four allocated objects (in addition to the Strings): Array, 
> StringList, StringListWrapper, and HashMap.Entry.  Even contains and remove 
> allocate new objects!
> From this comment in DictionaryNameFinder:
> // TODO: improve performance here
> It seems like improvements would be welcome.  :)  Removing some of the object 
> overhead would more than make up for interning strings.  Should I create a 
> new 

[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning

2023-12-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17798043#comment-17798043
 ] 

ASF GitHub Bot commented on OPENNLP-421:


rzo1 commented on PR #568:
URL: https://github.com/apache/opennlp/pull/568#issuecomment-1859746321

   Hi all,
   
   I added related interner / dedup implementations inspired / based on the 
code provided by @mawiesne from Aleksey.
   
   I added the following interner impls:
   
   - CHMStringInterner (as provided by Aleksey), thread-safe
   - CHMStringDeduplicator (as provided by Aleksey in his talk), thread-safe -> 
relaxes the canonical requirements on interning. It is more or less a 
probabilistic deduplication
   - HMStringInterner (as provided by Aleksey), not thread-safe
   - JvmStringInterner -> relies on `String.intern()` - can be used to get the 
previous OpenNLP behaviour ;)
   - NoOpStringInterner -> doesn't actually intern/dedup Strings. 
   
   The implementation of the static `StringInterners` is based on how Hadoop is 
doing it [1]. The default interner used in OpenNLP is now: `CHMStringInterner`
   I added a system property `opennlp.interner.class`, which can be used to 
specify the interner implementation which will be used at runtime:
   
   - If people want the old behaviour back: they can. 
   - If people do not want interning at all: they can.
   - If people want probabilistic dedup: they can.
   
   Currently, an updated benchmark of the different impls is running as well as 
a full eval build with the default.
   
   Will update this PR with the JMH results in a few hours.
   
   
   - [1] 
https://github.com/c9n/hadoop/blob/master/hadoop-common-project%2Fhadoop-common%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fhadoop%2Futil%2FStringInterner.java




> Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
> --
>
> Key: OPENNLP-421
> URL: https://issues.apache.org/jira/browse/OPENNLP-421
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Name Finder
>Affects Versions: tools-1.5.2-incubating
> Environment: RedHat 5, JDK 1.6.0_29
>Reporter: Jay Hacker
>Assignee: Richard Zowalla
>Priority: Minor
>  Labels: performance
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The current implementation of StringList:
> https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup
>  
> calls intern() on every String.  Presumably this is an attempt to reduce 
> memory usage for duplicate tokens.  Interned Strings are stored in the JVM's 
> permanent generation, which has a small fixed size (seems to be about 83 MB 
> on modern 64-bit JVMs: 
> [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]).
>   Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen 
> space.  
> The size of the PermGen can be increased with the -XX:MaxPermSize= option to 
> the JVM.  However, this option is non-standard and not well known, and it 
> would be nice if OpenNLP worked out of the box without deep JVM tuning.
> This immediate problem could be fixed by simply not interning Strings.  
> Looking at the Dictionary and DictionaryNameFinder code as a whole, however, 
> there is a huge amount of room for performance improvement.  Currently, 
> DictionaryNameFinder.find works something like this:
> for every token in every tokenlist in the dictionary:
> copy it into a "meta dictionary" of single tokens
> for every possible subsequence of tokens in the sentence:// of which 
> there are O(N^2)
> copy the sequence into a new array
> if the last token is in the "meta dictionary":
> make a StringList from the tokens
> look it up in the dictionary
> Dictionary itself is very heavyweight: it's a Set, which 
> wraps StringList, which wraps Array.  Every entry in the dictionary 
> requires at least four allocated objects (in addition to the Strings): Array, 
> StringList, StringListWrapper, and HashMap.Entry.  Even contains and remove 
> allocate new objects!
> From this comment in DictionaryNameFinder:
> // TODO: improve performance here
> It seems like improvements would be welcome.  :)  Removing some of the object 
> overhead would more than make up for interning strings.  Should I create a 
> new Jira ticket to propose a more efficient design?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning

2023-12-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17797421#comment-17797421
 ] 

ASF GitHub Bot commented on OPENNLP-421:


rzo1 commented on PR #568:
URL: https://github.com/apache/opennlp/pull/568#issuecomment-1858815410

   Ack.




> Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
> --
>
> Key: OPENNLP-421
> URL: https://issues.apache.org/jira/browse/OPENNLP-421
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Name Finder
>Affects Versions: tools-1.5.2-incubating
> Environment: RedHat 5, JDK 1.6.0_29
>Reporter: Jay Hacker
>Assignee: Richard Zowalla
>Priority: Minor
>  Labels: performance
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The current implementation of StringList:
> https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup
>  
> calls intern() on every String.  Presumably this is an attempt to reduce 
> memory usage for duplicate tokens.  Interned Strings are stored in the JVM's 
> permanent generation, which has a small fixed size (seems to be about 83 MB 
> on modern 64-bit JVMs: 
> [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]).
>   Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen 
> space.  
> The size of the PermGen can be increased with the -XX:MaxPermSize= option to 
> the JVM.  However, this option is non-standard and not well known, and it 
> would be nice if OpenNLP worked out of the box without deep JVM tuning.
> This immediate problem could be fixed by simply not interning Strings.  
> Looking at the Dictionary and DictionaryNameFinder code as a whole, however, 
> there is a huge amount of room for performance improvement.  Currently, 
> DictionaryNameFinder.find works something like this:
> for every token in every tokenlist in the dictionary:
> copy it into a "meta dictionary" of single tokens
> for every possible subsequence of tokens in the sentence:// of which 
> there are O(N^2)
> copy the sequence into a new array
> if the last token is in the "meta dictionary":
> make a StringList from the tokens
> look it up in the dictionary
> Dictionary itself is very heavyweight: it's a Set, which 
> wraps StringList, which wraps Array.  Every entry in the dictionary 
> requires at least four allocated objects (in addition to the Strings): Array, 
> StringList, StringListWrapper, and HashMap.Entry.  Even contains and remove 
> allocate new objects!
> From this comment in DictionaryNameFinder:
> // TODO: improve performance here
> It seems like improvements would be welcome.  :)  Removing some of the object 
> overhead would more than make up for interning strings.  Should I create a 
> new Jira ticket to propose a more efficient design?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning

2023-12-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17797419#comment-17797419
 ] 

ASF GitHub Bot commented on OPENNLP-421:


mawiesne commented on PR #568:
URL: https://github.com/apache/opennlp/pull/568#issuecomment-1858813771

   > Do you want to take over here (in which case I would re-assign the Jira) 
or shall I add it in the next days? 
   
   Feel free to proceed after I've pushed the basic string dedup benchmark, 
leave Jira as is.




> Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
> --
>
> Key: OPENNLP-421
> URL: https://issues.apache.org/jira/browse/OPENNLP-421
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Name Finder
>Affects Versions: tools-1.5.2-incubating
> Environment: RedHat 5, JDK 1.6.0_29
>Reporter: Jay Hacker
>Assignee: Richard Zowalla
>Priority: Minor
>  Labels: performance
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The current implementation of StringList:
> https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup
>  
> calls intern() on every String.  Presumably this is an attempt to reduce 
> memory usage for duplicate tokens.  Interned Strings are stored in the JVM's 
> permanent generation, which has a small fixed size (seems to be about 83 MB 
> on modern 64-bit JVMs: 
> [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]).
>   Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen 
> space.  
> The size of the PermGen can be increased with the -XX:MaxPermSize= option to 
> the JVM.  However, this option is non-standard and not well known, and it 
> would be nice if OpenNLP worked out of the box without deep JVM tuning.
> This immediate problem could be fixed by simply not interning Strings.  
> Looking at the Dictionary and DictionaryNameFinder code as a whole, however, 
> there is a huge amount of room for performance improvement.  Currently, 
> DictionaryNameFinder.find works something like this:
> for every token in every tokenlist in the dictionary:
> copy it into a "meta dictionary" of single tokens
> for every possible subsequence of tokens in the sentence:// of which 
> there are O(N^2)
> copy the sequence into a new array
> if the last token is in the "meta dictionary":
> make a StringList from the tokens
> look it up in the dictionary
> Dictionary itself is very heavyweight: it's a Set, which 
> wraps StringList, which wraps Array.  Every entry in the dictionary 
> requires at least four allocated objects (in addition to the Strings): Array, 
> StringList, StringListWrapper, and HashMap.Entry.  Even contains and remove 
> allocate new objects!
> From this comment in DictionaryNameFinder:
> // TODO: improve performance here
> It seems like improvements would be welcome.  :)  Removing some of the object 
> overhead would more than make up for interning strings.  Should I create a 
> new Jira ticket to propose a more efficient design?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning

2023-12-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17797418#comment-17797418
 ] 

ASF GitHub Bot commented on OPENNLP-421:


rzo1 commented on PR #568:
URL: https://github.com/apache/opennlp/pull/568#issuecomment-1858809818

   Looks great, @mawiesne that it also applies to Java 17. Think we should give 
it a try and add our own `CHMInterner`. Do you want to take over here (in which 
case I would re-assign the Jira) or shall I add it in the next days? Just want 
to make sure no duplicate work is done here ;-)




> Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
> --
>
> Key: OPENNLP-421
> URL: https://issues.apache.org/jira/browse/OPENNLP-421
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Name Finder
>Affects Versions: tools-1.5.2-incubating
> Environment: RedHat 5, JDK 1.6.0_29
>Reporter: Jay Hacker
>Assignee: Richard Zowalla
>Priority: Minor
>  Labels: performance
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The current implementation of StringList:
> https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup
>  
> calls intern() on every String.  Presumably this is an attempt to reduce 
> memory usage for duplicate tokens.  Interned Strings are stored in the JVM's 
> permanent generation, which has a small fixed size (seems to be about 83 MB 
> on modern 64-bit JVMs: 
> [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]).
>   Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen 
> space.  
> The size of the PermGen can be increased with the -XX:MaxPermSize= option to 
> the JVM.  However, this option is non-standard and not well known, and it 
> would be nice if OpenNLP worked out of the box without deep JVM tuning.
> This immediate problem could be fixed by simply not interning Strings.  
> Looking at the Dictionary and DictionaryNameFinder code as a whole, however, 
> there is a huge amount of room for performance improvement.  Currently, 
> DictionaryNameFinder.find works something like this:
> for every token in every tokenlist in the dictionary:
> copy it into a "meta dictionary" of single tokens
> for every possible subsequence of tokens in the sentence:// of which 
> there are O(N^2)
> copy the sequence into a new array
> if the last token is in the "meta dictionary":
> make a StringList from the tokens
> look it up in the dictionary
> Dictionary itself is very heavyweight: it's a Set, which 
> wraps StringList, which wraps Array.  Every entry in the dictionary 
> requires at least four allocated objects (in addition to the Strings): Array, 
> StringList, StringListWrapper, and HashMap.Entry.  Even contains and remove 
> allocate new objects!
> From this comment in DictionaryNameFinder:
> // TODO: improve performance here
> It seems like improvements would be welcome.  :)  Removing some of the object 
> overhead would more than make up for interning strings.  Should I create a 
> new Jira ticket to propose a more efficient design?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning

2023-12-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17797387#comment-17797387
 ] 

ASF GitHub Bot commented on OPENNLP-421:


mawiesne commented on PR #568:
URL: https://github.com/apache/opennlp/pull/568#issuecomment-1858779780

   I reproduced the basic String deduplication benchmark by Shipilev (Java 8) 
with Java 17 (MacOS); the results are consistent with his findings:
   
   ```
   Benchmark (size)   Mode  Cnt Score   
  Error  Units
   StringDeduplicationBenchmark.chm   1  thrpt   25  71093243,405 ± 
1437503,007  ops/s
   StringDeduplicationBenchmark.chm 100  thrpt   25698295,053 ±   
10420,119  ops/s
   StringDeduplicationBenchmark.chm   1  thrpt   25  3389,917 ± 
 56,170  ops/s
   StringDeduplicationBenchmark.chm 100  thrpt   2522,715 ± 
  0,120  ops/s
   StringDeduplicationBenchmark.hm1  thrpt   25  61916588,259 ± 
1271732,213  ops/s
   StringDeduplicationBenchmark.hm  100  thrpt   25714613,401 ±   
10472,821  ops/s
   StringDeduplicationBenchmark.hm1  thrpt   25  3201,984 ± 
 18,741  ops/s
   StringDeduplicationBenchmark.hm  100  thrpt   2523,081 ± 
  0,122  ops/s
   StringDeduplicationBenchmark.intern1  thrpt   25  11033749,712 ±   
35746,520  ops/s
   StringDeduplicationBenchmark.intern  100  thrpt   25109538,438 ± 
193,911  ops/s
   StringDeduplicationBenchmark.intern1  thrpt   25   997,533 ± 
  3,149  ops/s
   StringDeduplicationBenchmark.intern  100  thrpt   25 6,024 ± 
  0,026  ops/s
   ```




> Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
> --
>
> Key: OPENNLP-421
> URL: https://issues.apache.org/jira/browse/OPENNLP-421
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Name Finder
>Affects Versions: tools-1.5.2-incubating
> Environment: RedHat 5, JDK 1.6.0_29
>Reporter: Jay Hacker
>Assignee: Richard Zowalla
>Priority: Minor
>  Labels: performance
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The current implementation of StringList:
> https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup
>  
> calls intern() on every String.  Presumably this is an attempt to reduce 
> memory usage for duplicate tokens.  Interned Strings are stored in the JVM's 
> permanent generation, which has a small fixed size (seems to be about 83 MB 
> on modern 64-bit JVMs: 
> [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]).
>   Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen 
> space.  
> The size of the PermGen can be increased with the -XX:MaxPermSize= option to 
> the JVM.  However, this option is non-standard and not well known, and it 
> would be nice if OpenNLP worked out of the box without deep JVM tuning.
> This immediate problem could be fixed by simply not interning Strings.  
> Looking at the Dictionary and DictionaryNameFinder code as a whole, however, 
> there is a huge amount of room for performance improvement.  Currently, 
> DictionaryNameFinder.find works something like this:
> for every token in every tokenlist in the dictionary:
> copy it into a "meta dictionary" of single tokens
> for every possible subsequence of tokens in the sentence:// of which 
> there are O(N^2)
> copy the sequence into a new array
> if the last token is in the "meta dictionary":
> make a StringList from the tokens
> look it up in the dictionary
> Dictionary itself is very heavyweight: it's a Set, which 
> wraps StringList, which wraps Array.  Every entry in the dictionary 
> requires at least four allocated objects (in addition to the Strings): Array, 
> StringList, StringListWrapper, and HashMap.Entry.  Even contains and remove 
> allocate new objects!
> From this comment in DictionaryNameFinder:
> // TODO: improve performance here
> It seems like improvements would be welcome.  :)  Removing some of the object 
> overhead would more than make up for interning strings.  Should I create a 
> new Jira ticket to propose a more efficient design?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning

2023-12-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17797377#comment-17797377
 ] 

ASF GitHub Bot commented on OPENNLP-421:


kinow commented on PR #568:
URL: https://github.com/apache/opennlp/pull/568#issuecomment-1858767471

   > I think, we c/should consider using a `ConcurrentHashMap`-based 
deduplicator approach to replace "intern()" calls, as investigated and 
explained by Shipilev in his blog post here:
   > 
   > https://shipilev.net/jvm/anatomy-quarks/10-string-intern/
   > 
   > The removal of the String deduplication, as found in the code segments of 
this PR, could otherwise lead to more required RAM at runtime - as @kinow 
already expressed.
   > 
   > Wdyt? @jzonthemtn @rzo1
   > 
   > (A) Removal (PR as is now), or (B) Replacement with concurrent-capable 
custom deduplicator?
   
   (B) looks interesting, +1 for trying that out, and for @rzo1 suggestion to 
add that on top of this one.
   
   >`CHMInterner`
   
   First time I've seen an intern made in the code (intentionally like that). 
Sounds like a good idea! You do the intern without using the heap. Not sure if 
it will perform as fast as String.intern, but I think it might work! Thanks 
@mawiesne 




> Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
> --
>
> Key: OPENNLP-421
> URL: https://issues.apache.org/jira/browse/OPENNLP-421
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Name Finder
>Affects Versions: tools-1.5.2-incubating
> Environment: RedHat 5, JDK 1.6.0_29
>Reporter: Jay Hacker
>Assignee: Richard Zowalla
>Priority: Minor
>  Labels: performance
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The current implementation of StringList:
> https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup
>  
> calls intern() on every String.  Presumably this is an attempt to reduce 
> memory usage for duplicate tokens.  Interned Strings are stored in the JVM's 
> permanent generation, which has a small fixed size (seems to be about 83 MB 
> on modern 64-bit JVMs: 
> [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]).
>   Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen 
> space.  
> The size of the PermGen can be increased with the -XX:MaxPermSize= option to 
> the JVM.  However, this option is non-standard and not well known, and it 
> would be nice if OpenNLP worked out of the box without deep JVM tuning.
> This immediate problem could be fixed by simply not interning Strings.  
> Looking at the Dictionary and DictionaryNameFinder code as a whole, however, 
> there is a huge amount of room for performance improvement.  Currently, 
> DictionaryNameFinder.find works something like this:
> for every token in every tokenlist in the dictionary:
> copy it into a "meta dictionary" of single tokens
> for every possible subsequence of tokens in the sentence:// of which 
> there are O(N^2)
> copy the sequence into a new array
> if the last token is in the "meta dictionary":
> make a StringList from the tokens
> look it up in the dictionary
> Dictionary itself is very heavyweight: it's a Set, which 
> wraps StringList, which wraps Array.  Every entry in the dictionary 
> requires at least four allocated objects (in addition to the Strings): Array, 
> StringList, StringListWrapper, and HashMap.Entry.  Even contains and remove 
> allocate new objects!
> From this comment in DictionaryNameFinder:
> // TODO: improve performance here
> It seems like improvements would be welcome.  :)  Removing some of the object 
> overhead would more than make up for interning strings.  Should I create a 
> new Jira ticket to propose a more efficient design?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning

2023-12-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17797371#comment-17797371
 ] 

ASF GitHub Bot commented on OPENNLP-421:


rzo1 commented on PR #568:
URL: https://github.com/apache/opennlp/pull/568#issuecomment-1858754762

   Let's make it a draft and add B. Saving memory is always a good thing.




> Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
> --
>
> Key: OPENNLP-421
> URL: https://issues.apache.org/jira/browse/OPENNLP-421
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Name Finder
>Affects Versions: tools-1.5.2-incubating
> Environment: RedHat 5, JDK 1.6.0_29
>Reporter: Jay Hacker
>Assignee: Richard Zowalla
>Priority: Minor
>  Labels: performance
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The current implementation of StringList:
> https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup
>  
> calls intern() on every String.  Presumably this is an attempt to reduce 
> memory usage for duplicate tokens.  Interned Strings are stored in the JVM's 
> permanent generation, which has a small fixed size (seems to be about 83 MB 
> on modern 64-bit JVMs: 
> [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]).
>   Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen 
> space.  
> The size of the PermGen can be increased with the -XX:MaxPermSize= option to 
> the JVM.  However, this option is non-standard and not well known, and it 
> would be nice if OpenNLP worked out of the box without deep JVM tuning.
> This immediate problem could be fixed by simply not interning Strings.  
> Looking at the Dictionary and DictionaryNameFinder code as a whole, however, 
> there is a huge amount of room for performance improvement.  Currently, 
> DictionaryNameFinder.find works something like this:
> for every token in every tokenlist in the dictionary:
> copy it into a "meta dictionary" of single tokens
> for every possible subsequence of tokens in the sentence:// of which 
> there are O(N^2)
> copy the sequence into a new array
> if the last token is in the "meta dictionary":
> make a StringList from the tokens
> look it up in the dictionary
> Dictionary itself is very heavyweight: it's a Set, which 
> wraps StringList, which wraps Array.  Every entry in the dictionary 
> requires at least four allocated objects (in addition to the Strings): Array, 
> StringList, StringListWrapper, and HashMap.Entry.  Even contains and remove 
> allocate new objects!
> From this comment in DictionaryNameFinder:
> // TODO: improve performance here
> It seems like improvements would be welcome.  :)  Removing some of the object 
> overhead would more than make up for interning strings.  Should I create a 
> new Jira ticket to propose a more efficient design?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning

2023-12-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17797369#comment-17797369
 ] 

ASF GitHub Bot commented on OPENNLP-421:


rzo1 commented on PR #568:
URL: https://github.com/apache/opennlp/pull/568#issuecomment-1858753109

   I am OK with A or B. Feel free to add B on-top of this PR.




> Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
> --
>
> Key: OPENNLP-421
> URL: https://issues.apache.org/jira/browse/OPENNLP-421
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Name Finder
>Affects Versions: tools-1.5.2-incubating
> Environment: RedHat 5, JDK 1.6.0_29
>Reporter: Jay Hacker
>Assignee: Richard Zowalla
>Priority: Minor
>  Labels: performance
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The current implementation of StringList:
> https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup
>  
> calls intern() on every String.  Presumably this is an attempt to reduce 
> memory usage for duplicate tokens.  Interned Strings are stored in the JVM's 
> permanent generation, which has a small fixed size (seems to be about 83 MB 
> on modern 64-bit JVMs: 
> [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]).
>   Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen 
> space.  
> The size of the PermGen can be increased with the -XX:MaxPermSize= option to 
> the JVM.  However, this option is non-standard and not well known, and it 
> would be nice if OpenNLP worked out of the box without deep JVM tuning.
> This immediate problem could be fixed by simply not interning Strings.  
> Looking at the Dictionary and DictionaryNameFinder code as a whole, however, 
> there is a huge amount of room for performance improvement.  Currently, 
> DictionaryNameFinder.find works something like this:
> for every token in every tokenlist in the dictionary:
> copy it into a "meta dictionary" of single tokens
> for every possible subsequence of tokens in the sentence:// of which 
> there are O(N^2)
> copy the sequence into a new array
> if the last token is in the "meta dictionary":
> make a StringList from the tokens
> look it up in the dictionary
> Dictionary itself is very heavyweight: it's a Set, which 
> wraps StringList, which wraps Array.  Every entry in the dictionary 
> requires at least four allocated objects (in addition to the Strings): Array, 
> StringList, StringListWrapper, and HashMap.Entry.  Even contains and remove 
> allocate new objects!
> From this comment in DictionaryNameFinder:
> // TODO: improve performance here
> It seems like improvements would be welcome.  :)  Removing some of the object 
> overhead would more than make up for interning strings.  Should I create a 
> new Jira ticket to propose a more efficient design?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning

2023-12-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17797367#comment-17797367
 ] 

ASF GitHub Bot commented on OPENNLP-421:


mawiesne commented on PR #568:
URL: https://github.com/apache/opennlp/pull/568#issuecomment-1858750219

   I think, we c/should consider using a `ConcurrentHashMap`-based deduplicator 
approach as a replacement for "intern()" calls, as investigated and explained 
by Shipilev in his blog post here:
   
   https://shipilev.net/jvm/anatomy-quarks/10-string-intern/
   
   The removal of the String deduplication, as found in the code segments of 
this PR, could otherwise lead to more required RAM at runtime - as @kinow 
already expressed.




> Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
> --
>
> Key: OPENNLP-421
> URL: https://issues.apache.org/jira/browse/OPENNLP-421
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Name Finder
>Affects Versions: tools-1.5.2-incubating
> Environment: RedHat 5, JDK 1.6.0_29
>Reporter: Jay Hacker
>Assignee: Richard Zowalla
>Priority: Minor
>  Labels: performance
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The current implementation of StringList:
> https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup
>  
> calls intern() on every String.  Presumably this is an attempt to reduce 
> memory usage for duplicate tokens.  Interned Strings are stored in the JVM's 
> permanent generation, which has a small fixed size (seems to be about 83 MB 
> on modern 64-bit JVMs: 
> [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]).
>   Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen 
> space.  
> The size of the PermGen can be increased with the -XX:MaxPermSize= option to 
> the JVM.  However, this option is non-standard and not well known, and it 
> would be nice if OpenNLP worked out of the box without deep JVM tuning.
> This immediate problem could be fixed by simply not interning Strings.  
> Looking at the Dictionary and DictionaryNameFinder code as a whole, however, 
> there is a huge amount of room for performance improvement.  Currently, 
> DictionaryNameFinder.find works something like this:
> for every token in every tokenlist in the dictionary:
> copy it into a "meta dictionary" of single tokens
> for every possible subsequence of tokens in the sentence:// of which 
> there are O(N^2)
> copy the sequence into a new array
> if the last token is in the "meta dictionary":
> make a StringList from the tokens
> look it up in the dictionary
> Dictionary itself is very heavyweight: it's a Set, which 
> wraps StringList, which wraps Array.  Every entry in the dictionary 
> requires at least four allocated objects (in addition to the Strings): Array, 
> StringList, StringListWrapper, and HashMap.Entry.  Even contains and remove 
> allocate new objects!
> From this comment in DictionaryNameFinder:
> // TODO: improve performance here
> It seems like improvements would be welcome.  :)  Removing some of the object 
> overhead would more than make up for interning strings.  Should I create a 
> new Jira ticket to propose a more efficient design?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning

2023-12-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17797137#comment-17797137
 ] 

ASF GitHub Bot commented on OPENNLP-421:


rzo1 commented on PR #568:
URL: https://github.com/apache/opennlp/pull/568#issuecomment-1857681741

   @mawiesne and myself just started an eval run from the office. Let's see if 
we see anything ;-)




> Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
> --
>
> Key: OPENNLP-421
> URL: https://issues.apache.org/jira/browse/OPENNLP-421
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Name Finder
>Affects Versions: tools-1.5.2-incubating
> Environment: RedHat 5, JDK 1.6.0_29
>Reporter: Jay Hacker
>Assignee: Richard Zowalla
>Priority: Minor
>  Labels: performance
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The current implementation of StringList:
> https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup
>  
> calls intern() on every String.  Presumably this is an attempt to reduce 
> memory usage for duplicate tokens.  Interned Strings are stored in the JVM's 
> permanent generation, which has a small fixed size (seems to be about 83 MB 
> on modern 64-bit JVMs: 
> [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]).
>   Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen 
> space.  
> The size of the PermGen can be increased with the -XX:MaxPermSize= option to 
> the JVM.  However, this option is non-standard and not well known, and it 
> would be nice if OpenNLP worked out of the box without deep JVM tuning.
> This immediate problem could be fixed by simply not interning Strings.  
> Looking at the Dictionary and DictionaryNameFinder code as a whole, however, 
> there is a huge amount of room for performance improvement.  Currently, 
> DictionaryNameFinder.find works something like this:
> for every token in every tokenlist in the dictionary:
> copy it into a "meta dictionary" of single tokens
> for every possible subsequence of tokens in the sentence:// of which 
> there are O(N^2)
> copy the sequence into a new array
> if the last token is in the "meta dictionary":
> make a StringList from the tokens
> look it up in the dictionary
> Dictionary itself is very heavyweight: it's a Set, which 
> wraps StringList, which wraps Array.  Every entry in the dictionary 
> requires at least four allocated objects (in addition to the Strings): Array, 
> StringList, StringListWrapper, and HashMap.Entry.  Even contains and remove 
> allocate new objects!
> From this comment in DictionaryNameFinder:
> // TODO: improve performance here
> It seems like improvements would be welcome.  :)  Removing some of the object 
> overhead would more than make up for interning strings.  Should I create a 
> new Jira ticket to propose a more efficient design?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning

2023-12-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17797054#comment-17797054
 ] 

ASF GitHub Bot commented on OPENNLP-421:


rzo1 commented on PR #568:
URL: https://github.com/apache/opennlp/pull/568#issuecomment-1857359451

   @kinow thanks for the details and the flight recording :-) 
   
   I think, that the current use of String interning was an attempt to optimize 
string comparisons by using the string constant pool (until it explodes). I 
guess, that the idea was also to save memory.
   
   There is a really nice blog article which summarizes the pros/cons: 
https://www.codecentric.de/wissens-hub/blog/save-memory-by-using-string-intern-in-java
   
   The state, that 
   
   > Only use String.intern() on Strings you know are occurring multiple times, 
and only do it to save memory
   
   I think, that we can remove a lot of the current object creation related to 
the use of `StringList`, `StringListWrapper` or the creation of `StringList` 
objects just for the sake of running a `contains(...)`, which should make up 
for the overhead due to the removal of String interning. 
   




> Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
> --
>
> Key: OPENNLP-421
> URL: https://issues.apache.org/jira/browse/OPENNLP-421
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Name Finder
>Affects Versions: tools-1.5.2-incubating
> Environment: RedHat 5, JDK 1.6.0_29
>Reporter: Jay Hacker
>Assignee: Richard Zowalla
>Priority: Minor
>  Labels: performance
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The current implementation of StringList:
> https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup
>  
> calls intern() on every String.  Presumably this is an attempt to reduce 
> memory usage for duplicate tokens.  Interned Strings are stored in the JVM's 
> permanent generation, which has a small fixed size (seems to be about 83 MB 
> on modern 64-bit JVMs: 
> [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]).
>   Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen 
> space.  
> The size of the PermGen can be increased with the -XX:MaxPermSize= option to 
> the JVM.  However, this option is non-standard and not well known, and it 
> would be nice if OpenNLP worked out of the box without deep JVM tuning.
> This immediate problem could be fixed by simply not interning Strings.  
> Looking at the Dictionary and DictionaryNameFinder code as a whole, however, 
> there is a huge amount of room for performance improvement.  Currently, 
> DictionaryNameFinder.find works something like this:
> for every token in every tokenlist in the dictionary:
> copy it into a "meta dictionary" of single tokens
> for every possible subsequence of tokens in the sentence:// of which 
> there are O(N^2)
> copy the sequence into a new array
> if the last token is in the "meta dictionary":
> make a StringList from the tokens
> look it up in the dictionary
> Dictionary itself is very heavyweight: it's a Set, which 
> wraps StringList, which wraps Array.  Every entry in the dictionary 
> requires at least four allocated objects (in addition to the Strings): Array, 
> StringList, StringListWrapper, and HashMap.Entry.  Even contains and remove 
> allocate new objects!
> From this comment in DictionaryNameFinder:
> // TODO: improve performance here
> It seems like improvements would be welcome.  :)  Removing some of the object 
> overhead would more than make up for interning strings.  Should I create a 
> new Jira ticket to propose a more efficient design?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning

2023-12-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796940#comment-17796940
 ] 

ASF GitHub Bot commented on OPENNLP-421:


kinow commented on PR #568:
URL: https://github.com/apache/opennlp/pull/568#issuecomment-1856892659

   Thanks for the GC numbers. The new code seems a bit heavier on the GC and 
mem allocations, but I think that was expected.
   
   A bit late here, but I decided to give it a try. Excuse if I say anything 
silly, I can check these results again tomorrow after having some :coffee: . I 
opened your branch on IntelliJ, then made the jmh profile active by default 
(how do you tell IntelliJ to include the deps of a profile? anywho), 
invalidated cache, etc..
   
   Executed the tests just to make sure they were running, as you already 
provided the results. Then executed with the profiler, to generate the `.jfr` 
file. Saved the `jmh` source folder, checked out `main`, then patched the 
`pom.xml` files, and executed the same test again, to get another `.jfr` for 
the `main` branch (so earlier file is for this branch, older is for main).
   
   Opened both in IntelliJ and selected the `.jfr` file from this branch to 
compare with the one from `main`. Not sure if they show a fair comparison as 
the difference in throughput might exacerbate some number, but in case you find 
it useful, @rzo1 . 
   
   ## CPU samples
   
   
![flamegraph](https://github.com/apache/opennlp/assets/304786/5d715064-5ea0-4dbf-8d89-f8c277ad0a1f)
   
   I think it just confirms the change of calls to intern functions.
   
   
![image](https://github.com/apache/opennlp/assets/304786/26235942-e711-4ed3-b242-18098b9ddca1)
   
   Also the increase in GC calls.
   
   
![image](https://github.com/apache/opennlp/assets/304786/759928d2-1674-426a-a80a-24a39bf90ff8)
   
   And I think the Random calls are from JMH for having more/less samples due 
to higher throughput.
   
   
![image](https://github.com/apache/opennlp/assets/304786/a0eb0c07-8feb-421f-bd1d-01f56135479e)
   
   ## Memory allocations
   
   
![flamegraph](https://github.com/apache/opennlp/assets/304786/7e20ff7b-3b5d-45be-9984-67229c9441c6)
   
   With a lot more Strings, as expected, as well as bytes due to the array copy 
calls.
   
   
![image](https://github.com/apache/opennlp/assets/304786/d3f203ff-1bbc-4d33-8c70-c2ed93861443)
   
   
![image](https://github.com/apache/opennlp/assets/304786/43ba112d-2402-4c68-966f-f2e335334f74)
   
   Unfortunately I don't have any better or more practical test to compare 
memory (maybe a single run of a large example would be better for comparing 
memory, instead of multiple jmh runs…). But I don't think this should really be 
a problem. I wonder if that was some premature optimization or if at some point 
someone had memory issues. But a change that's easy to revert if needed, and 
that way users won't have to modify heap settings. So I am inclined to approve 
it! :tada: 
   
   Will wait until tomorrow so you, Martin, Jeff, others had some time to 
review :sleeping_bed: 
   
   Thanks!
   
   Bruno




> Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
> --
>
> Key: OPENNLP-421
> URL: https://issues.apache.org/jira/browse/OPENNLP-421
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Name Finder
>Affects Versions: tools-1.5.2-incubating
> Environment: RedHat 5, JDK 1.6.0_29
>Reporter: Jay Hacker
>Assignee: Richard Zowalla
>Priority: Minor
>  Labels: performance
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The current implementation of StringList:
> https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup
>  
> calls intern() on every String.  Presumably this is an attempt to reduce 
> memory usage for duplicate tokens.  Interned Strings are stored in the JVM's 
> permanent generation, which has a small fixed size (seems to be about 83 MB 
> on modern 64-bit JVMs: 
> [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]).
>   Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen 
> space.  
> The size of the PermGen can be increased with the -XX:MaxPermSize= option to 
> the JVM.  However, this option is non-standard and not well known, and it 
> would be nice if OpenNLP worked out of the box without deep JVM tuning.
> This immediate problem could be fixed by simply not interning Strings.  
> Looking at the Dictionary and DictionaryNameFinder code as a whole, however, 
> there is a huge amount of room for performance improvement.  Currently, 
> DictionaryNameFinder.find works something like this:
> for every token in every 

[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning

2023-12-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796894#comment-17796894
 ] 

ASF GitHub Bot commented on OPENNLP-421:


rzo1 commented on PR #568:
URL: https://github.com/apache/opennlp/pull/568#issuecomment-1856512985

   Added GC Profiling output to JMH. Here it is
   
   # Old Code
   
   ```
   BenchmarkMode  Cnt   
  Score   Error   Units
   StringListBenchmark.newWithArrayConstructor thrpt   10   
  1,104 ± 0,311   ops/s
   StringListBenchmark.newWithArrayConstructor:gc.alloc.rate   thrpt   10   
 73,739 ±14,797  MB/sec
   StringListBenchmark.newWithArrayConstructor:gc.alloc.rate.norm  thrpt   10  
92007982,187 ± 37715,232B/op
   StringListBenchmark.newWithArrayConstructor:gc.countthrpt   10   
 68,000  counts
   StringListBenchmark.newWithArrayConstructor:gc.time thrpt   10   
   4396,000  ms
   ```
   
   
   # New Code 
   
   ```
   BenchmarkMode  Cnt   
  Score Error   Units
   StringListBenchmark.newWithArrayConstructor thrpt   10   
471,103 ± 170,077   ops/s
   StringListBenchmark.newWithArrayConstructor:gc.alloc.rate   thrpt   10   
443,527 ±  37,567  MB/sec
   StringListBenchmark.newWithArrayConstructor:gc.alloc.rate.norm  thrpt   10  
9242,655 ±   0,953B/op
   StringListBenchmark.newWithArrayConstructor:gc.countthrpt   10   
277,000counts
   StringListBenchmark.newWithArrayConstructor:gc.time thrpt   10   
  14432,000ms
   ``




> Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
> --
>
> Key: OPENNLP-421
> URL: https://issues.apache.org/jira/browse/OPENNLP-421
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Name Finder
>Affects Versions: tools-1.5.2-incubating
> Environment: RedHat 5, JDK 1.6.0_29
>Reporter: Jay Hacker
>Assignee: Richard Zowalla
>Priority: Minor
>  Labels: performance
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The current implementation of StringList:
> https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup
>  
> calls intern() on every String.  Presumably this is an attempt to reduce 
> memory usage for duplicate tokens.  Interned Strings are stored in the JVM's 
> permanent generation, which has a small fixed size (seems to be about 83 MB 
> on modern 64-bit JVMs: 
> [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]).
>   Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen 
> space.  
> The size of the PermGen can be increased with the -XX:MaxPermSize= option to 
> the JVM.  However, this option is non-standard and not well known, and it 
> would be nice if OpenNLP worked out of the box without deep JVM tuning.
> This immediate problem could be fixed by simply not interning Strings.  
> Looking at the Dictionary and DictionaryNameFinder code as a whole, however, 
> there is a huge amount of room for performance improvement.  Currently, 
> DictionaryNameFinder.find works something like this:
> for every token in every tokenlist in the dictionary:
> copy it into a "meta dictionary" of single tokens
> for every possible subsequence of tokens in the sentence:// of which 
> there are O(N^2)
> copy the sequence into a new array
> if the last token is in the "meta dictionary":
> make a StringList from the tokens
> look it up in the dictionary
> Dictionary itself is very heavyweight: it's a Set, which 
> wraps StringList, which wraps Array.  Every entry in the dictionary 
> requires at least four allocated objects (in addition to the Strings): Array, 
> StringList, StringListWrapper, and HashMap.Entry.  Even contains and remove 
> allocate new objects!
> From this comment in DictionaryNameFinder:
> // TODO: improve performance here
> It seems like improvements would be welcome.  :)  Removing some of the object 
> overhead would more than make up for interning strings.  Should I create a 
> new Jira ticket to propose a more efficient design?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning

2023-12-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796893#comment-17796893
 ] 

ASF GitHub Bot commented on OPENNLP-421:


rzo1 commented on code in PR #568:
URL: https://github.com/apache/opennlp/pull/568#discussion_r1427220452


##
opennlp-tools/src/jmh/java/opennlp/tools/jmh/ExecutionPlan.java:
##
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.jmh;
+
+import java.util.Random;
+
+import org.openjdk.jmh.annotations.Level;
+import org.openjdk.jmh.annotations.Scope;
+import org.openjdk.jmh.annotations.Setup;
+import org.openjdk.jmh.annotations.State;
+
+@State(Scope.Benchmark)
+public class ExecutionPlan {
+
+  private static final Random RANDOM = new Random(42);
+
+  public final String[] strings = new String[1_000_000];
+
+  @Setup(Level.Invocation)
+  public void setUp() {
+for (int i = 0; i < 1_000_000; i++) {
+  strings[i] = generateRandomString(15);
+}
+  }
+
+  private static String generateRandomString(int length) {
+final String characters = 
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";

Review Comment:
   did





> Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
> --
>
> Key: OPENNLP-421
> URL: https://issues.apache.org/jira/browse/OPENNLP-421
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Name Finder
>Affects Versions: tools-1.5.2-incubating
> Environment: RedHat 5, JDK 1.6.0_29
>Reporter: Jay Hacker
>Assignee: Richard Zowalla
>Priority: Minor
>  Labels: performance
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The current implementation of StringList:
> https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup
>  
> calls intern() on every String.  Presumably this is an attempt to reduce 
> memory usage for duplicate tokens.  Interned Strings are stored in the JVM's 
> permanent generation, which has a small fixed size (seems to be about 83 MB 
> on modern 64-bit JVMs: 
> [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]).
>   Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen 
> space.  
> The size of the PermGen can be increased with the -XX:MaxPermSize= option to 
> the JVM.  However, this option is non-standard and not well known, and it 
> would be nice if OpenNLP worked out of the box without deep JVM tuning.
> This immediate problem could be fixed by simply not interning Strings.  
> Looking at the Dictionary and DictionaryNameFinder code as a whole, however, 
> there is a huge amount of room for performance improvement.  Currently, 
> DictionaryNameFinder.find works something like this:
> for every token in every tokenlist in the dictionary:
> copy it into a "meta dictionary" of single tokens
> for every possible subsequence of tokens in the sentence:// of which 
> there are O(N^2)
> copy the sequence into a new array
> if the last token is in the "meta dictionary":
> make a StringList from the tokens
> look it up in the dictionary
> Dictionary itself is very heavyweight: it's a Set, which 
> wraps StringList, which wraps Array.  Every entry in the dictionary 
> requires at least four allocated objects (in addition to the Strings): Array, 
> StringList, StringListWrapper, and HashMap.Entry.  Even contains and remove 
> allocate new objects!
> From this comment in DictionaryNameFinder:
> // TODO: improve performance here
> It seems like improvements would be welcome.  :)  Removing some of the object 
> overhead would more than make up for interning strings.  Should I create a 
> new Jira ticket to propose a more efficient design?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning

2023-12-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796886#comment-17796886
 ] 

ASF GitHub Bot commented on OPENNLP-421:


mawiesne commented on code in PR #568:
URL: https://github.com/apache/opennlp/pull/568#discussion_r1427176318


##
opennlp-tools/src/jmh/java/opennlp/tools/jmh/ExecutionPlan.java:
##
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.jmh;
+
+import java.util.Random;
+
+import org.openjdk.jmh.annotations.Level;
+import org.openjdk.jmh.annotations.Scope;
+import org.openjdk.jmh.annotations.Setup;
+import org.openjdk.jmh.annotations.State;
+
+@State(Scope.Benchmark)
+public class ExecutionPlan {
+
+  private static final Random RANDOM = new Random(42);
+
+  public final String[] strings = new String[1_000_000];
+
+  @Setup(Level.Invocation)
+  public void setUp() {
+for (int i = 0; i < 1_000_000; i++) {
+  strings[i] = generateRandomString(15);
+}
+  }
+
+  private static String generateRandomString(int length) {
+final String characters = 
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";

Review Comment:
   Can we make this a constant? By nature, it's pretty "stable", yet recreated 
every time the private helper method is called.





> Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
> --
>
> Key: OPENNLP-421
> URL: https://issues.apache.org/jira/browse/OPENNLP-421
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Name Finder
>Affects Versions: tools-1.5.2-incubating
> Environment: RedHat 5, JDK 1.6.0_29
>Reporter: Jay Hacker
>Assignee: Richard Zowalla
>Priority: Minor
>  Labels: performance
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The current implementation of StringList:
> https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup
>  
> calls intern() on every String.  Presumably this is an attempt to reduce 
> memory usage for duplicate tokens.  Interned Strings are stored in the JVM's 
> permanent generation, which has a small fixed size (seems to be about 83 MB 
> on modern 64-bit JVMs: 
> [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]).
>   Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen 
> space.  
> The size of the PermGen can be increased with the -XX:MaxPermSize= option to 
> the JVM.  However, this option is non-standard and not well known, and it 
> would be nice if OpenNLP worked out of the box without deep JVM tuning.
> This immediate problem could be fixed by simply not interning Strings.  
> Looking at the Dictionary and DictionaryNameFinder code as a whole, however, 
> there is a huge amount of room for performance improvement.  Currently, 
> DictionaryNameFinder.find works something like this:
> for every token in every tokenlist in the dictionary:
> copy it into a "meta dictionary" of single tokens
> for every possible subsequence of tokens in the sentence:// of which 
> there are O(N^2)
> copy the sequence into a new array
> if the last token is in the "meta dictionary":
> make a StringList from the tokens
> look it up in the dictionary
> Dictionary itself is very heavyweight: it's a Set, which 
> wraps StringList, which wraps Array.  Every entry in the dictionary 
> requires at least four allocated objects (in addition to the Strings): Array, 
> StringList, StringListWrapper, and HashMap.Entry.  Even contains and remove 
> allocate new objects!
> From this comment in DictionaryNameFinder:
> // TODO: improve performance here
> It seems like improvements would be welcome.  :)  Removing some of the object 
> overhead would more than make up for interning strings.  Should I create a 
> new Jira ticket to propose a more efficient design?



--
This message was sent by Atlassian 

[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning

2023-12-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796745#comment-17796745
 ] 

ASF GitHub Bot commented on OPENNLP-421:


rzo1 commented on PR #568:
URL: https://github.com/apache/opennlp/pull/568#issuecomment-1855883632

   > If I understood the issue and read the benchmark gist results correctly, 
with the new code JMH benchmark starts faster, then the old code catchs up and 
plateaus around 1.2K ops/s. But in overall the non interned version still 
process more ops/s, with the advantage of not having the JVM max memory issue.
   
   It is `,`, so we have (if you look at the mean values):
   
   - 1,211 ± 0,149  ops/s (old code)
   
   vs.
   
   - 468,404 ± 140,482  ops/s (new code)
   
   Systemarray.copy(...) is the reason for the speed improvement from merely 1 
ops/s to 328 op/s (worst case)




> Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
> --
>
> Key: OPENNLP-421
> URL: https://issues.apache.org/jira/browse/OPENNLP-421
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Name Finder
>Affects Versions: tools-1.5.2-incubating
> Environment: RedHat 5, JDK 1.6.0_29
>Reporter: Jay Hacker
>Assignee: Richard Zowalla
>Priority: Minor
>  Labels: performance
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The current implementation of StringList:
> https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup
>  
> calls intern() on every String.  Presumably this is an attempt to reduce 
> memory usage for duplicate tokens.  Interned Strings are stored in the JVM's 
> permanent generation, which has a small fixed size (seems to be about 83 MB 
> on modern 64-bit JVMs: 
> [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]).
>   Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen 
> space.  
> The size of the PermGen can be increased with the -XX:MaxPermSize= option to 
> the JVM.  However, this option is non-standard and not well known, and it 
> would be nice if OpenNLP worked out of the box without deep JVM tuning.
> This immediate problem could be fixed by simply not interning Strings.  
> Looking at the Dictionary and DictionaryNameFinder code as a whole, however, 
> there is a huge amount of room for performance improvement.  Currently, 
> DictionaryNameFinder.find works something like this:
> for every token in every tokenlist in the dictionary:
> copy it into a "meta dictionary" of single tokens
> for every possible subsequence of tokens in the sentence:// of which 
> there are O(N^2)
> copy the sequence into a new array
> if the last token is in the "meta dictionary":
> make a StringList from the tokens
> look it up in the dictionary
> Dictionary itself is very heavyweight: it's a Set, which 
> wraps StringList, which wraps Array.  Every entry in the dictionary 
> requires at least four allocated objects (in addition to the Strings): Array, 
> StringList, StringListWrapper, and HashMap.Entry.  Even contains and remove 
> allocate new objects!
> From this comment in DictionaryNameFinder:
> // TODO: improve performance here
> It seems like improvements would be welcome.  :)  Removing some of the object 
> overhead would more than make up for interning strings.  Should I create a 
> new Jira ticket to propose a more efficient design?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning

2023-12-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796739#comment-17796739
 ] 

ASF GitHub Bot commented on OPENNLP-421:


kinow commented on code in PR #568:
URL: https://github.com/apache/opennlp/pull/568#discussion_r1426703504


##
opennlp-tools/src/jmh/java/opennlp/tools/jmh/ExecutionPlan.java:
##
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.jmh;
+
+import java.util.Random;
+
+import org.openjdk.jmh.annotations.Level;
+import org.openjdk.jmh.annotations.Scope;
+import org.openjdk.jmh.annotations.Setup;
+import org.openjdk.jmh.annotations.State;
+
+@State(Scope.Benchmark)
+public class ExecutionPlan {
+
+  private static final Random RANDOM = new Random(42);
+
+  public final String[] strings = new String[1_000_000];
+
+  @Setup(Level.Invocation)
+  public void setUp() {
+for (int i = 0; i < 1_000_000; i++) {
+  strings[i] = generateRandomString(15);
+}
+  }
+
+  private static String generateRandomString(int length) {
+final String characters = 
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
+final StringBuilder randomString = new StringBuilder();
+
+for (int i = 0; i < length; i++) {
+  int index = RANDOM.nextInt(characters.length());
+  randomString.append(characters.charAt(index));
+}
+
+return randomString.toString();
+  }
+}

Review Comment:
   Missing newline, but not important.





> Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
> --
>
> Key: OPENNLP-421
> URL: https://issues.apache.org/jira/browse/OPENNLP-421
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Name Finder
>Affects Versions: tools-1.5.2-incubating
> Environment: RedHat 5, JDK 1.6.0_29
>Reporter: Jay Hacker
>Assignee: Richard Zowalla
>Priority: Minor
>  Labels: performance
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The current implementation of StringList:
> https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup
>  
> calls intern() on every String.  Presumably this is an attempt to reduce 
> memory usage for duplicate tokens.  Interned Strings are stored in the JVM's 
> permanent generation, which has a small fixed size (seems to be about 83 MB 
> on modern 64-bit JVMs: 
> [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]).
>   Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen 
> space.  
> The size of the PermGen can be increased with the -XX:MaxPermSize= option to 
> the JVM.  However, this option is non-standard and not well known, and it 
> would be nice if OpenNLP worked out of the box without deep JVM tuning.
> This immediate problem could be fixed by simply not interning Strings.  
> Looking at the Dictionary and DictionaryNameFinder code as a whole, however, 
> there is a huge amount of room for performance improvement.  Currently, 
> DictionaryNameFinder.find works something like this:
> for every token in every tokenlist in the dictionary:
> copy it into a "meta dictionary" of single tokens
> for every possible subsequence of tokens in the sentence:// of which 
> there are O(N^2)
> copy the sequence into a new array
> if the last token is in the "meta dictionary":
> make a StringList from the tokens
> look it up in the dictionary
> Dictionary itself is very heavyweight: it's a Set, which 
> wraps StringList, which wraps Array.  Every entry in the dictionary 
> requires at least four allocated objects (in addition to the Strings): Array, 
> StringList, StringListWrapper, and HashMap.Entry.  Even contains and remove 
> allocate new objects!
> From this comment in DictionaryNameFinder:
> // TODO: improve performance here
> It seems like improvements would be welcome.  :)  Removing some of the object 

[jira] [Commented] (OPENNLP-421) Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning

2023-12-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796698#comment-17796698
 ] 

ASF GitHub Bot commented on OPENNLP-421:


rzo1 opened a new pull request, #568:
URL: https://github.com/apache/opennlp/pull/568

   ### For all changes:
   - [x] Is there a JIRA ticket associated with this PR? Is it referenced 
in the commit message?
   
   - [x] Does your PR title start with OPENNLP- where  is the JIRA 
number you are trying to resolve? Pay particular attention to the hyphen "-" 
character.
   
   - [x] Has your PR been rebased against the latest commit within the target 
branch (typically main)?
   
   - [ ] Is your initial contribution a single, squashed commit?
   
   ### For code changes:
   - [x] Have you ensured that the full suite of tests is executed via mvn 
clean install at the root opennlp folder?
   - [ ] Have you written or updated unit tests to verify your changes?
   - [x] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)? 
   - [ ] If applicable, have you updated the LICENSE file, including the main 
LICENSE file in opennlp folder?
   - [ ] If applicable, have you updated the NOTICE file, including the main 
NOTICE file found in opennlp folder?
   
   ### For documentation related changes:
   - [ ] Have you ensured that format looks appropriate for the output in which 
it is rendered?
   
   ### Note:
   
   I found 
[OPENNLP-421](https://issues.apache.org/jira/projects/OPENNLP/issues/OPENNLP-421)
 in our issue tracker and added a simple JMH benchmark to look into the impact 
of interning strings and our multiple copy-operations in `StringList`.
   
   - Here is a JMH Benchmark for the vanilla version of OpenNLP: 
https://gist.github.com/rzo1/d2ba3e48c6bc190977baf9ee42388823
   
   - Here is a JMH Benchmark with String interning removed from `StringList` 
and usage of `System.arraycopy(...)`: 
https://gist.github.com/rzo1/327c84ebac18b62baf927f1a87ec7480
   
   
   As stated by the original issue opener, String interning was most likely 
used because of:
   
   > Presumably this is an attempt to reduce memory usage for duplicate tokens. 
Interned Strings are stored in the JVM's permanent generation, which has a 
small fixed size (seems to be about 83 MB on modern 64-bit JVMs: 
https://www.oracle.com/java/technologies/javase/vmoptions-jsp.html)
   
   So if we have huge `StringLists` or `Dictionaries`, users need to increase 
`-XX:MaxPermSize=` option to avoid an OutOfMemoryException. 
   
   There might also be room for improvement in the `Dictionary` implemention as 
we mess around with `StringLists`, `StringListWrappers`, etc. - but that would 
be something to look into next.
   
   Just want to have some thoughts on the interning removal here ;-)
   




> Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
> --
>
> Key: OPENNLP-421
> URL: https://issues.apache.org/jira/browse/OPENNLP-421
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Name Finder
>Affects Versions: tools-1.5.2-incubating
> Environment: RedHat 5, JDK 1.6.0_29
>Reporter: Jay Hacker
>Assignee: Richard Zowalla
>Priority: Minor
>  Labels: performance
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The current implementation of StringList:
> https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup
>  
> calls intern() on every String.  Presumably this is an attempt to reduce 
> memory usage for duplicate tokens.  Interned Strings are stored in the JVM's 
> permanent generation, which has a small fixed size (seems to be about 83 MB 
> on modern 64-bit JVMs: 
> [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]).
>   Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen 
> space.  
> The size of the PermGen can be increased with the -XX:MaxPermSize= option to 
> the JVM.  However, this option is non-standard and not well known, and it 
> would be nice if OpenNLP worked out of the box without deep JVM tuning.
> This immediate problem could be fixed by simply not interning Strings.  
> Looking at the Dictionary and DictionaryNameFinder code as a whole, however, 
> there is a huge amount of room for performance improvement.  Currently, 
> DictionaryNameFinder.find works something like this:
> for every token in every tokenlist in the dictionary:
> copy it into a "meta dictionary" of single tokens
> for every possible subsequence of tokens in the sentence:// of which 
> there