[ https://issues.apache.org/jira/browse/OPENNLP-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746440#comment-17746440 ]
ASF GitHub Bot commented on OPENNLP-1505: ----------------------------------------- rzo1 commented on code in PR #543: URL: https://github.com/apache/opennlp/pull/543#discussion_r1272217185 ########## opennlp-tools/src/main/java/opennlp/tools/util/StringUtil.java: ########## @@ -79,6 +81,30 @@ public static String toLowerCase(CharSequence string) { return new String(cp, 0, cp.length); } + public static CharBuffer toLowerCaseCharBuffer(CharSequence sequence) { + CharBuffer result = CharBuffer.allocate(sequence.length()); + for (int cp : sequence.codePoints().toArray()) { + for (char c : Character.toChars(Character.toLowerCase(cp))) { + result.append(c); + } + } + result.clear(); + return result; + } + + /* + public static CharBuffer toLowerCaseCharBuffer(CharSequence string) { Review Comment: Old code? Should be removed before merging this. ########## opennlp-tools/src/main/java/opennlp/tools/util/StringUtil.java: ########## @@ -79,6 +81,30 @@ public static String toLowerCase(CharSequence string) { return new String(cp, 0, cp.length); } + public static CharBuffer toLowerCaseCharBuffer(CharSequence sequence) { + CharBuffer result = CharBuffer.allocate(sequence.length()); + for (int cp : sequence.codePoints().toArray()) { + for (char c : Character.toChars(Character.toLowerCase(cp))) { + result.append(c); + } + } + result.clear(); + return result; + } + + /* + public static CharBuffer toLowerCaseCharBuffer(CharSequence string) { Review Comment: Experimental code? Should be removed before merging this. > Reduce object creation in NGramCharModel and StringUtil > ------------------------------------------------------- > > Key: OPENNLP-1505 > URL: https://issues.apache.org/jira/browse/OPENNLP-1505 > Project: OpenNLP > Issue Type: Improvement > Components: Language Detector > Affects Versions: 2.2.0 > Reporter: Martin Wiesner > Assignee: Martin Wiesner > Priority: Major > Fix For: 2.2.1 > > > During a profiling session, I noticed that many tests in > opennlp.tools.langdetect take quite some time for execution. Digging deeper > into those tests, it quickly became obvious that StringUtil#toLowerCase() was > creating new Strings for every call of this method (see > NGramCharModel#add(...) lines 99 to 108. > Being called in NGramCharModel quite frequently, this resulted in creation of > millions of String objects during building ngrams for given input. > Aims: > * Reduce objection creation and thus creation of millions of string objects > * Improve runtime of the langdetect tests (and potentially others) > Idea: > * Use (Heap)CharBuffer instead of String so that underlying char arrays can > be re-used, instead of copying the chars over to a new string for each > "toLowerCase"... > Note: > * A corresponding patch / PR should be tested with/against the Evaluation > suite. > Comments welcome. -- This message was sent by Atlassian Jira (v8.20.10#820010)