[jira] [Created] (JEXL-402) parse failed with empty return value.
Xu Pengcheng created JEXL-402: - Summary: parse failed with empty return value. Key: JEXL-402 URL: https://issues.apache.org/jira/browse/JEXL-402 Project: Commons JEXL Issue Type: Bug Affects Versions: 3.3 Reporter: Xu Pengcheng Code: {code:java} function foo() { return; }{code} got exception: parsing error in '{' but {code:java} function foo() { return null; }{code} is ok -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [commons-text] garydgregory merged pull request #452: TEXT-228: Fix TextStringBuilder to over-allocate when ensuring capacity
garydgregory merged PR #452: URL: https://github.com/apache/commons-text/pull/452 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@commons.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [commons-text] garydgregory commented on pull request #452: TEXT-228: Fix TextStringBuilder to over-allocate when ensuring capacity
garydgregory commented on PR #452: URL: https://github.com/apache/commons-text/pull/452#issuecomment-1687038018 https://issues.apache.org/jira/projects/TEXT/issues/TEXT-228 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@commons.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (TEXT-228) StringTokenizer performance degradation when parsing large lines
[ https://issues.apache.org/jira/browse/TEXT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17757048#comment-17757048 ] Gary D. Gregory commented on TEXT-228: -- Ah, I see that AbstractStringBuilder throws OME in Java 11, so I now see it would be OK for us to do so in this context. > StringTokenizer performance degradation when parsing large lines > > > Key: TEXT-228 > URL: https://issues.apache.org/jira/browse/TEXT-228 > Project: Commons Text > Issue Type: Bug >Affects Versions: 1.9, 1.10.0 > Environment: Linux >Reporter: Zack Hable >Priority: Minor > > After recently upgrading from Apache Commons Text 1.9 to 1.10.0 we've noticed > our system "hangs" (or likely will take an excessively long time to process) > large lines (100MB+ in size) when splitting strings with StringTokenizer. > > Mitigation: Revert to Apache Commons Text 1.9 > > Scala version: > > {code:java} > > scala -version > Scala code runner version 2.12.14 -- Copyright 2002-2021, LAMP/EPFL and > Lightbend, Inc. > {code} > > Java version: > > {code:java} > > java -version > openjdk version "1.8.0_382" > OpenJDK Runtime Environment (build 1.8.0_382-b05) > OpenJDK 64-Bit Server VM (build 25.382-b05, mixed mode) > {code} > > > Reproduction Steps: > # Generate a sample large file > {code:java} > echo -n '"SOME TEXT WITH SPACE" "SOME TEXT WITH SPACE" ' > largefile > dd if=/dev/zero bs=100MB count=1 >> largefile > sed -ie "s/\x0/0/g" largefile > echo -n "\0" >> largefile > {code} > # Setup reproduce.scala > {code:java} > import org.apache.commons.text.StringTokenizer > val lines = scala.io.Source.fromFile("./largefile").getLines.toList > val st: StringTokenizer = new StringTokenizer(lines(0)) > val res = st.getTokenArray() > {code} > # Download Apache Commons Jars > {code:java} > wget > https://repo1.maven.org/maven2/org/apache/commons/commons-text/1.10.0/commons-text-1.10.0.jar > wget > https://repo1.maven.org/maven2/org/apache/commons/commons-text/1.9/commons-text-1.9.jar > {code} > # Run program with a 10 second timeout (1.10 seems to hang for >1 minute) > {code:java} > > time timeout 10 scala -J-Xmx2g -cp commons-text-1.9.jar reproduce.scala > timeout 10 scala -J-Xmx2g -cp commons-text-1.9.jar reproduce.scala 2.60s > user 0.83s system 121% cpu 2.818 total > > > time timeout 10 scala -J-Xmx2g -cp commons-text-1.10.0.jar reproduce.scala > timeout 10 scala -J-Xmx2g -cp commons-text-1.10.0.jar reproduce.scala 0.02s > user 0.00s system 0% cpu 10.002 total > {code} > As you notice above 1.9 takes ~3 seconds whereas 1.10 times out after 10 > seconds. I haven't come across a definite amount of time 1.10 takes, but it > seems to run for >1 minute -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TEXT-228) StringTokenizer performance degradation when parsing large lines
[ https://issues.apache.org/jira/browse/TEXT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17756957#comment-17756957 ] Alex Herbert commented on TEXT-228: --- Note: Running the 3 new memory tests locally requires 10G for the surefire forked JVM: {code:java} mvn test -Dtest=TextStringBuilderTest -DargLine=-Xmx10G mvn test -Dtest=TextStringBuilderTest -DargLine=-Xmx9G // Skips 1 mvn test -Dtest=TextStringBuilderTest -DargLine=-Xmx4G // Skips 2 mvn test -Dtest=TextStringBuilderTest -DargLine=-Xmx3G // Skips 3 {code} > StringTokenizer performance degradation when parsing large lines > > > Key: TEXT-228 > URL: https://issues.apache.org/jira/browse/TEXT-228 > Project: Commons Text > Issue Type: Bug >Affects Versions: 1.9, 1.10.0 > Environment: Linux >Reporter: Zack Hable >Priority: Minor > > After recently upgrading from Apache Commons Text 1.9 to 1.10.0 we've noticed > our system "hangs" (or likely will take an excessively long time to process) > large lines (100MB+ in size) when splitting strings with StringTokenizer. > > Mitigation: Revert to Apache Commons Text 1.9 > > Scala version: > > {code:java} > > scala -version > Scala code runner version 2.12.14 -- Copyright 2002-2021, LAMP/EPFL and > Lightbend, Inc. > {code} > > Java version: > > {code:java} > > java -version > openjdk version "1.8.0_382" > OpenJDK Runtime Environment (build 1.8.0_382-b05) > OpenJDK 64-Bit Server VM (build 25.382-b05, mixed mode) > {code} > > > Reproduction Steps: > # Generate a sample large file > {code:java} > echo -n '"SOME TEXT WITH SPACE" "SOME TEXT WITH SPACE" ' > largefile > dd if=/dev/zero bs=100MB count=1 >> largefile > sed -ie "s/\x0/0/g" largefile > echo -n "\0" >> largefile > {code} > # Setup reproduce.scala > {code:java} > import org.apache.commons.text.StringTokenizer > val lines = scala.io.Source.fromFile("./largefile").getLines.toList > val st: StringTokenizer = new StringTokenizer(lines(0)) > val res = st.getTokenArray() > {code} > # Download Apache Commons Jars > {code:java} > wget > https://repo1.maven.org/maven2/org/apache/commons/commons-text/1.10.0/commons-text-1.10.0.jar > wget > https://repo1.maven.org/maven2/org/apache/commons/commons-text/1.9/commons-text-1.9.jar > {code} > # Run program with a 10 second timeout (1.10 seems to hang for >1 minute) > {code:java} > > time timeout 10 scala -J-Xmx2g -cp commons-text-1.9.jar reproduce.scala > timeout 10 scala -J-Xmx2g -cp commons-text-1.9.jar reproduce.scala 2.60s > user 0.83s system 121% cpu 2.818 total > > > time timeout 10 scala -J-Xmx2g -cp commons-text-1.10.0.jar reproduce.scala > timeout 10 scala -J-Xmx2g -cp commons-text-1.10.0.jar reproduce.scala 0.02s > user 0.00s system 0% cpu 10.002 total > {code} > As you notice above 1.9 takes ~3 seconds whereas 1.10 times out after 10 > seconds. I haven't come across a definite amount of time 1.10 takes, but it > seems to run for >1 minute -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [commons-text] aherbert commented on a diff in pull request #452: TEXT-228: Fix TextStringBuilder to over-allocate when ensuring capacity
aherbert commented on code in PR #452: URL: https://github.com/apache/commons-text/pull/452#discussion_r1300134930 ## src/main/java/org/apache/commons/text/TextStringBuilder.java: ## @@ -1820,17 +1832,84 @@ public boolean endsWith(final String str) { /** * Tests the capacity and ensures that it is at least the size specified. * + * Note: This method can be used to minimise memory reallocations during + * repeated addition of values by pre-allocating the character buffer. + * The method ignores a negative {@code capacity} argument. + * * @param capacity the capacity to ensure * @return this, to enable chaining + * @throws OutOfMemoryError if the capacity cannot be allocated */ public TextStringBuilder ensureCapacity(final int capacity) { -// checks for overflow -if (capacity > 0 && capacity - buffer.length > 0) { -reallocate(capacity); +if (capacity > 0) { +ensureCapacityInternal(capacity); } return this; } +/** + * Ensure that the buffer is at least the size specified. The {@code capacity} argument + * is treated as an unsigned integer. + * + * This method will raise an {@link OutOfMemoryError} if the capacity is too large + * for an array, or cannot be allocated. + * + * @param capacity the capacity to ensure + * @throws OutOfMemoryError if the capacity cannot be allocated + */ +private void ensureCapacityInternal(final int capacity) { +// Check for overflow of the current buffer. +// Assumes capacity is an unsigned integer up to Integer.MAX_VALUE * 2 +// (the largest possible addition of two maximum length arrays). +if (capacity - buffer.length > 0) { +resizeBuffer(capacity); +} +} + +/** + * Resizes the buffer to at least the size specified. + * + * @param minCapacity the minimum required capacity + * @throws OutOfMemoryError if the {@code minCapacity} is negative + */ +private void resizeBuffer(final int minCapacity) { +// Overflow-conscious code treats the min and new capacity as unsigned. +final int oldCapacity = buffer.length; +int newCapacity = oldCapacity * 2; +if (Integer.compareUnsigned(newCapacity, minCapacity) < 0) { +newCapacity = minCapacity; +} +if (Integer.compareUnsigned(newCapacity, MAX_BUFFER_SIZE) > 0) { +newCapacity = createPositiveCapacity(minCapacity); +} +reallocate(newCapacity); +} + +/** + * Create a positive capacity at least as large the minimum required capacity. + * If the minimum capacity is negative then this throws an OutOfMemoryError as no array + * can be allocated. + * + * @param minCapacity the minimum capacity + * @return the capacity + * @throws OutOfMemoryError if the {@code minCapacity} is negative + */ +private static int createPositiveCapacity(final int minCapacity) { +if (minCapacity < 0) { +// overflow +throw new OutOfMemoryError("Unable to allocate array size: " + Integer.toUnsignedString(minCapacity)); Review Comment: We throw an OOM in the code I adapted from Commons Codec. Disclaimer: I wrote that code so I am only copying my own precedent. However this is the behaviour of JDK's StringBuilder. If you try to add to an array such that it would not fit in a single array you will get an OutOfMemoryError. Since we are billing the TextStringBuilder as a replacement for that then it seems appropriate to copy the behaviour. Regarding recovery by the user. This would be a situation where you are adding to the StringBuilder and then you get an out-of-memory error. So you have a situation where a builder cannot create a single String to hold all the data. The recovery is to either try to process smaller chunks of the full chars; or accept that you have to truncate the single String. If we start throwing e.g. IAE here then a user will not be able to use the same try-catch with our object that they use with a JDK StringBuilder. The catch must be modified and we lose the drop-in-replacement pattern. Note: I just updated my extra tests to use JDK's StringBuilder. They pass. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@commons.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (TEXT-228) StringTokenizer performance degradation when parsing large lines
[ https://issues.apache.org/jira/browse/TEXT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17756923#comment-17756923 ] Alex Herbert commented on TEXT-228: --- Throwing an OutOfMemoryError is reasonable practice when writing objects that allocate memory. In this case the error will arise when new chars are to added the char buffer managed by the TextStringBuilder. Whether the error arises inside Arrays.copyOf(buffer, newLength) or we throw one because we know the newLength is not possible to allocate the behaviour to the downstream user is identical. They tried to increase the size of the TextStringBuilder and ran out of memory. If we change to throwing a different runtime exception I do not see that information being useful for the user. They did something that was not possible within the memory limits of the JVM. PR for review here: [PR 452|https://github.com/apache/commons-text/pull/452] I adapted code from Commons Codec where we do a similar buffer management in the BaseNCodec for a byte[] buffer. I expect the tests that push the memory to the limit to be skipped in the CI. They do run locally if you have enough memory. The smallest memory footprint test still requires at least 2GiB. This may be executed. > StringTokenizer performance degradation when parsing large lines > > > Key: TEXT-228 > URL: https://issues.apache.org/jira/browse/TEXT-228 > Project: Commons Text > Issue Type: Bug >Affects Versions: 1.9, 1.10.0 > Environment: Linux >Reporter: Zack Hable >Priority: Minor > > After recently upgrading from Apache Commons Text 1.9 to 1.10.0 we've noticed > our system "hangs" (or likely will take an excessively long time to process) > large lines (100MB+ in size) when splitting strings with StringTokenizer. > > Mitigation: Revert to Apache Commons Text 1.9 > > Scala version: > > {code:java} > > scala -version > Scala code runner version 2.12.14 -- Copyright 2002-2021, LAMP/EPFL and > Lightbend, Inc. > {code} > > Java version: > > {code:java} > > java -version > openjdk version "1.8.0_382" > OpenJDK Runtime Environment (build 1.8.0_382-b05) > OpenJDK 64-Bit Server VM (build 25.382-b05, mixed mode) > {code} > > > Reproduction Steps: > # Generate a sample large file > {code:java} > echo -n '"SOME TEXT WITH SPACE" "SOME TEXT WITH SPACE" ' > largefile > dd if=/dev/zero bs=100MB count=1 >> largefile > sed -ie "s/\x0/0/g" largefile > echo -n "\0" >> largefile > {code} > # Setup reproduce.scala > {code:java} > import org.apache.commons.text.StringTokenizer > val lines = scala.io.Source.fromFile("./largefile").getLines.toList > val st: StringTokenizer = new StringTokenizer(lines(0)) > val res = st.getTokenArray() > {code} > # Download Apache Commons Jars > {code:java} > wget > https://repo1.maven.org/maven2/org/apache/commons/commons-text/1.10.0/commons-text-1.10.0.jar > wget > https://repo1.maven.org/maven2/org/apache/commons/commons-text/1.9/commons-text-1.9.jar > {code} > # Run program with a 10 second timeout (1.10 seems to hang for >1 minute) > {code:java} > > time timeout 10 scala -J-Xmx2g -cp commons-text-1.9.jar reproduce.scala > timeout 10 scala -J-Xmx2g -cp commons-text-1.9.jar reproduce.scala 2.60s > user 0.83s system 121% cpu 2.818 total > > > time timeout 10 scala -J-Xmx2g -cp commons-text-1.10.0.jar reproduce.scala > timeout 10 scala -J-Xmx2g -cp commons-text-1.10.0.jar reproduce.scala 0.02s > user 0.00s system 0% cpu 10.002 total > {code} > As you notice above 1.9 takes ~3 seconds whereas 1.10 times out after 10 > seconds. I haven't come across a definite amount of time 1.10 takes, but it > seems to run for >1 minute -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [commons-text] garydgregory commented on a diff in pull request #452: TEXT-228: Fix TextStringBuilder to over-allocate when ensuring capacity
garydgregory commented on code in PR #452: URL: https://github.com/apache/commons-text/pull/452#discussion_r1300107300 ## src/main/java/org/apache/commons/text/TextStringBuilder.java: ## @@ -1820,17 +1832,84 @@ public boolean endsWith(final String str) { /** * Tests the capacity and ensures that it is at least the size specified. * + * Note: This method can be used to minimise memory reallocations during + * repeated addition of values by pre-allocating the character buffer. + * The method ignores a negative {@code capacity} argument. + * * @param capacity the capacity to ensure * @return this, to enable chaining + * @throws OutOfMemoryError if the capacity cannot be allocated */ public TextStringBuilder ensureCapacity(final int capacity) { -// checks for overflow -if (capacity > 0 && capacity - buffer.length > 0) { -reallocate(capacity); +if (capacity > 0) { +ensureCapacityInternal(capacity); } return this; } +/** + * Ensure that the buffer is at least the size specified. The {@code capacity} argument + * is treated as an unsigned integer. + * + * This method will raise an {@link OutOfMemoryError} if the capacity is too large + * for an array, or cannot be allocated. + * + * @param capacity the capacity to ensure + * @throws OutOfMemoryError if the capacity cannot be allocated + */ +private void ensureCapacityInternal(final int capacity) { +// Check for overflow of the current buffer. +// Assumes capacity is an unsigned integer up to Integer.MAX_VALUE * 2 +// (the largest possible addition of two maximum length arrays). +if (capacity - buffer.length > 0) { +resizeBuffer(capacity); +} +} + +/** + * Resizes the buffer to at least the size specified. + * + * @param minCapacity the minimum required capacity + * @throws OutOfMemoryError if the {@code minCapacity} is negative + */ +private void resizeBuffer(final int minCapacity) { +// Overflow-conscious code treats the min and new capacity as unsigned. +final int oldCapacity = buffer.length; +int newCapacity = oldCapacity * 2; +if (Integer.compareUnsigned(newCapacity, minCapacity) < 0) { +newCapacity = minCapacity; +} +if (Integer.compareUnsigned(newCapacity, MAX_BUFFER_SIZE) > 0) { +newCapacity = createPositiveCapacity(minCapacity); +} +reallocate(newCapacity); +} + +/** + * Create a positive capacity at least as large the minimum required capacity. + * If the minimum capacity is negative then this throws an OutOfMemoryError as no array + * can be allocated. + * + * @param minCapacity the minimum capacity + * @return the capacity + * @throws OutOfMemoryError if the {@code minCapacity} is negative + */ +private static int createPositiveCapacity(final int minCapacity) { +if (minCapacity < 0) { +// overflow +throw new OutOfMemoryError("Unable to allocate array size: " + Integer.toUnsignedString(minCapacity)); Review Comment: Hi @aherbert Thank you for this PR :-) I think this should be an `IllegalArgumentException` because the JVM has NOT in fact run out of memory and a call site up the stack may be able to recover from the situation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@commons.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [commons-text] aherbert opened a new pull request, #452: TEXT-228: Fix TextStringBuilder to over-allocate when ensuring capacity
aherbert opened a new pull request, #452: URL: https://github.com/apache/commons-text/pull/452 This fixes a performance regression where allocation was only using the exact size necessary for the additional chars (resulting in buffer reallocation on each append). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@commons.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [commons-text] theshoeshiner commented on a diff in pull request #450: Cases API + 4 implementations (Pascal, Camel, Kebab, Snake)
theshoeshiner commented on code in PR #450: URL: https://github.com/apache/commons-text/pull/450#discussion_r133105 ## src/main/java/org/apache/commons/text/cases/CamelCase.java: ## @@ -0,0 +1,128 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.commons.text.cases; + +import java.util.LinkedList; +import java.util.List; + +import org.apache.commons.lang3.CharUtils; +import org.apache.commons.lang3.StringUtils; + +/** + * Case implementation that parses and formats strings of the form 'myCamelCase' + * + * This case separates tokens on uppercase ascii alpha characters, with the exception + * that the first token begin with a lowercase ascii alpha character. + * + */ +public class CamelCase implements Case { + +/** constant reuseable instance of this case. */ +public static final CamelCase INSTANCE = new CamelCase(); + +/** + * Constructs new CamelCase instance. + */ +public CamelCase() { +super(); +} + +/** + * Parses string tokens from a Camel Case formatted string. + * + * Parses each character of the string parameter and creates new tokens when uppercase ascii + * letters are encountered. The upppercase letter is considered part of the new token. The very + * first character of the string is an exception to this rule and must be a lowercase ascii Review Comment: It enforces the rule that the input be a camel cased string so that `format(parse()) == ` holds true. Not enforcing that rule would be a form of "loose" parsing, as opposed to the current strict logic which expects the user to know the case of the parsed string. Not enforcing this rule essentially makes this a pascal case parser, which already exists, and the API should let the user know that they're not using the correct ~parser~ Case for parsing. If we want to implement separate strict vs loose parsing then I'll give some more thought about how to expose that in a way that makes sense. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@commons.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [commons-text] theshoeshiner commented on a diff in pull request #450: Cases API + 4 implementations (Pascal, Camel, Kebab, Snake)
theshoeshiner commented on code in PR #450: URL: https://github.com/apache/commons-text/pull/450#discussion_r133105 ## src/main/java/org/apache/commons/text/cases/CamelCase.java: ## @@ -0,0 +1,128 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.commons.text.cases; + +import java.util.LinkedList; +import java.util.List; + +import org.apache.commons.lang3.CharUtils; +import org.apache.commons.lang3.StringUtils; + +/** + * Case implementation that parses and formats strings of the form 'myCamelCase' + * + * This case separates tokens on uppercase ascii alpha characters, with the exception + * that the first token begin with a lowercase ascii alpha character. + * + */ +public class CamelCase implements Case { + +/** constant reuseable instance of this case. */ +public static final CamelCase INSTANCE = new CamelCase(); + +/** + * Constructs new CamelCase instance. + */ +public CamelCase() { +super(); +} + +/** + * Parses string tokens from a Camel Case formatted string. + * + * Parses each character of the string parameter and creates new tokens when uppercase ascii + * letters are encountered. The upppercase letter is considered part of the new token. The very + * first character of the string is an exception to this rule and must be a lowercase ascii Review Comment: It enforces the rule that the input be a camel cased string so that `format(parse()) == ` holds true. Not enforcing that rule would be a form of "loose" parsing, as opposed to the current strict logic which expects the user to know the case of the parsed string. Not enforcing this rule essentially makes this a pascal case parser, which already exists, and the API should let the user know that they're not using the correct Case. If we want to implement separate strict vs loose parsing then I'll give some more thought about how to expose that in a way that makes sense. ## src/main/java/org/apache/commons/text/cases/CamelCase.java: ## @@ -0,0 +1,128 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.commons.text.cases; + +import java.util.LinkedList; +import java.util.List; + +import org.apache.commons.lang3.CharUtils; +import org.apache.commons.lang3.StringUtils; + +/** + * Case implementation that parses and formats strings of the form 'myCamelCase' + * + * This case separates tokens on uppercase ascii alpha characters, with the exception + * that the first token begin with a lowercase ascii alpha character. + * + */ +public class CamelCase implements Case { + +/** constant reuseable instance of this case. */ +public static final CamelCase INSTANCE = new CamelCase(); + +/** + * Constructs new CamelCase instance. + */ +public CamelCase() { +super(); +} + +/** + * Parses string tokens from a Camel Case formatted string. + * + * Parses each character of the string parameter and creates new tokens when uppercase ascii + * letters are encountered. The upppercase letter is considered part of the new token. The very + * first character of the string is an exception to this rule and must be a lowercase ascii Review Comment: It enforces the rule that the input be a camel cased string so that `format(parse()) == ` holds true. Not enforcing that rule would be a form of "loose" parsing, as opposed to the cur
[GitHub] [commons-text] theshoeshiner commented on a diff in pull request #450: Cases API + 4 implementations (Pascal, Camel, Kebab, Snake)
theshoeshiner commented on code in PR #450: URL: https://github.com/apache/commons-text/pull/450#discussion_r133105 ## src/main/java/org/apache/commons/text/cases/CamelCase.java: ## @@ -0,0 +1,128 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.commons.text.cases; + +import java.util.LinkedList; +import java.util.List; + +import org.apache.commons.lang3.CharUtils; +import org.apache.commons.lang3.StringUtils; + +/** + * Case implementation that parses and formats strings of the form 'myCamelCase' + * + * This case separates tokens on uppercase ascii alpha characters, with the exception + * that the first token begin with a lowercase ascii alpha character. + * + */ +public class CamelCase implements Case { + +/** constant reuseable instance of this case. */ +public static final CamelCase INSTANCE = new CamelCase(); + +/** + * Constructs new CamelCase instance. + */ +public CamelCase() { +super(); +} + +/** + * Parses string tokens from a Camel Case formatted string. + * + * Parses each character of the string parameter and creates new tokens when uppercase ascii + * letters are encountered. The upppercase letter is considered part of the new token. The very + * first character of the string is an exception to this rule and must be a lowercase ascii Review Comment: It enforces the rule that the input be a camel cased string so that `format(parse()) == ` holds true. Not enforcing that rule would be a form of "loose" parsing, as opposed to the current strict logic which expects the user to know the case of the parsed string. Not enforcing this rule essentially makes this a pascal case parser, which already exists, and the API should let the user know that they're not using the correct parser. If we want to implement separate strict vs loose parsing then I'll give some more thought about how to expose that in a way that makes sense. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@commons.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [commons-text] theshoeshiner commented on a diff in pull request #450: Cases API + 4 implementations (Pascal, Camel, Kebab, Snake)
theshoeshiner commented on code in PR #450: URL: https://github.com/apache/commons-text/pull/450#discussion_r133105 ## src/main/java/org/apache/commons/text/cases/CamelCase.java: ## @@ -0,0 +1,128 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.commons.text.cases; + +import java.util.LinkedList; +import java.util.List; + +import org.apache.commons.lang3.CharUtils; +import org.apache.commons.lang3.StringUtils; + +/** + * Case implementation that parses and formats strings of the form 'myCamelCase' + * + * This case separates tokens on uppercase ascii alpha characters, with the exception + * that the first token begin with a lowercase ascii alpha character. + * + */ +public class CamelCase implements Case { + +/** constant reuseable instance of this case. */ +public static final CamelCase INSTANCE = new CamelCase(); + +/** + * Constructs new CamelCase instance. + */ +public CamelCase() { +super(); +} + +/** + * Parses string tokens from a Camel Case formatted string. + * + * Parses each character of the string parameter and creates new tokens when uppercase ascii + * letters are encountered. The upppercase letter is considered part of the new token. The very + * first character of the string is an exception to this rule and must be a lowercase ascii Review Comment: It enforces the rule that the input be a camel cased string (so that `format(parse()) == ` holds true). Not enforcing that rule would be a form of "loose" parsing, as opposed to the current strict logic which expects the user to know the case of the parsed string. Not enforcing this rule essentially makes this a pascal case parser, which already exists, and the API should let the user know that they're not using the correct parser. If we want to implement separate strict vs loose parsing then I'll give some more thought about how to expose that in a way that makes sense. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@commons.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (TEXT-228) StringTokenizer performance degradation when parsing large lines
[ https://issues.apache.org/jira/browse/TEXT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17756856#comment-17756856 ] Gary D. Gregory commented on TEXT-228: -- Hi [~aherbert] I'm not sure we or any library code should throw {{{}OutOfMemoryError{}}}, which is reserved for the JVM as documented in the error's Javadoc. Instead, an IAE or ISE seems more appropriate. > StringTokenizer performance degradation when parsing large lines > > > Key: TEXT-228 > URL: https://issues.apache.org/jira/browse/TEXT-228 > Project: Commons Text > Issue Type: Bug >Affects Versions: 1.9, 1.10.0 > Environment: Linux >Reporter: Zack Hable >Priority: Minor > > After recently upgrading from Apache Commons Text 1.9 to 1.10.0 we've noticed > our system "hangs" (or likely will take an excessively long time to process) > large lines (100MB+ in size) when splitting strings with StringTokenizer. > > Mitigation: Revert to Apache Commons Text 1.9 > > Scala version: > > {code:java} > > scala -version > Scala code runner version 2.12.14 -- Copyright 2002-2021, LAMP/EPFL and > Lightbend, Inc. > {code} > > Java version: > > {code:java} > > java -version > openjdk version "1.8.0_382" > OpenJDK Runtime Environment (build 1.8.0_382-b05) > OpenJDK 64-Bit Server VM (build 25.382-b05, mixed mode) > {code} > > > Reproduction Steps: > # Generate a sample large file > {code:java} > echo -n '"SOME TEXT WITH SPACE" "SOME TEXT WITH SPACE" ' > largefile > dd if=/dev/zero bs=100MB count=1 >> largefile > sed -ie "s/\x0/0/g" largefile > echo -n "\0" >> largefile > {code} > # Setup reproduce.scala > {code:java} > import org.apache.commons.text.StringTokenizer > val lines = scala.io.Source.fromFile("./largefile").getLines.toList > val st: StringTokenizer = new StringTokenizer(lines(0)) > val res = st.getTokenArray() > {code} > # Download Apache Commons Jars > {code:java} > wget > https://repo1.maven.org/maven2/org/apache/commons/commons-text/1.10.0/commons-text-1.10.0.jar > wget > https://repo1.maven.org/maven2/org/apache/commons/commons-text/1.9/commons-text-1.9.jar > {code} > # Run program with a 10 second timeout (1.10 seems to hang for >1 minute) > {code:java} > > time timeout 10 scala -J-Xmx2g -cp commons-text-1.9.jar reproduce.scala > timeout 10 scala -J-Xmx2g -cp commons-text-1.9.jar reproduce.scala 2.60s > user 0.83s system 121% cpu 2.818 total > > > time timeout 10 scala -J-Xmx2g -cp commons-text-1.10.0.jar reproduce.scala > timeout 10 scala -J-Xmx2g -cp commons-text-1.10.0.jar reproduce.scala 0.02s > user 0.00s system 0% cpu 10.002 total > {code} > As you notice above 1.9 takes ~3 seconds whereas 1.10 times out after 10 > seconds. I haven't come across a definite amount of time 1.10 takes, but it > seems to run for >1 minute -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [commons-text] garydgregory commented on pull request #450: Cases API + 4 implementations (Pascal, Camel, Kebab, Snake)
garydgregory commented on PR #450: URL: https://github.com/apache/commons-text/pull/450#issuecomment-1686152071 IMO: You should follow the YAGNI principle and only make public and protected what you must, especially in a first cut. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@commons.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [commons-build-plugin] garydgregory merged pull request #174: Bump org.apache.ant:ant-launcher from 1.10.12 to 1.10.14
garydgregory merged PR #174: URL: https://github.com/apache/commons-build-plugin/pull/174 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@commons.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (TEXT-228) StringTokenizer performance degradation when parsing large lines
[ https://issues.apache.org/jira/browse/TEXT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17756840#comment-17756840 ] Alex Herbert commented on TEXT-228: --- The performance regression can be traced to commit 808ecca following a discussion in [PR 132|https://github.com/apache/commons-text/pull/132]. This changed the TextStringBuilder.ensureCapacity method from: {code:java} public TextStringBuilder ensureCapacity(final int capacity) { if (capacity > buffer.length) { reallocate(capacity * 2); } return this; } {code} To: {code:java} public TextStringBuilder ensureCapacity(final int capacity) { // checks for overflow if (capacity > 0 && capacity - buffer.length > 0) { reallocate(capacity); // <--- FYI: No capacity doubling here! } return this; } {code} This change *tries* to ensure that the check for capacity is overflow safe. However there are issues: h2. 1. Performance Previously the method would allocate double the required capacity. This is a bit excessive. However it does reduce the number of memory reallocations significantly. The new method will only allocate the memory required. There is {*}no extra buffer allocated{*}. The result is that when appending single characters (as done by the StringTokenizer) every new character requires a copy of the entire character buffer. Note: The overflow check is not the reason for the slowdown. On my machine I can use the original method and the updated method with approximately the same runtime if the updated method is changed to use reallocate(capacity * 2). Also note that this is a specific use case for appending single characters to a buffer. In this case the use of ensureCapacity(size + 1) seems overkill. I changed the append(char) method from: {code:java} public TextStringBuilder append(final char ch) { final int len = length(); ensureCapacity(len + 1); buffer[size++] = ch; return this; } {code} To: {code:java} public TextStringBuilder append(final char ch) { if (size == buffer.length) { reallocate(size * 2); } buffer[size++] = ch; return this; } {code} This made no notable speed difference to the provided StringTokenizer code. However that program has a lot of additional logic performed per character processed from the input by the tokenizer. A JMH test appending single characters to the TextStringBuilder should be used to examine more closely this potential optimisation. h2. 2. No feedback on error The method (still) does not feed back to the user if the requested change to the capacity was successful. Current unit tests actually enforce this behaviour. E.g. calling ensureCapacity(-1) is allowed. This behaviour may appear strange but is the same as the JDK's ArrayList.ensureCapacity. We should add some javadoc to indicate that only positive arguments are recognised by the ensureCapacity method. Negative arguments are currently ignored. This does create one issue in that the method is also used extensively internally when adding new chars. In this case it should raise as error if adding failed. If you try and append to a TextStringBuilder something that is too long for an array buffer, it will not fail with an OutOfMemoryError. It performs a no-op when trying to ensureCapacity and then will fail with an IndexOutOfBoundsException when actually copying the chars. E.g. this test does not pass: {code:java} @Test public void testOutOfMemoryError() { char[] chars = new char[3 * (Integer.MAX_VALUE / 8)]; final TextStringBuilder sb = new TextStringBuilder(); sb.append(chars); sb.append(chars); assertThrows(OutOfMemoryError.class, () -> sb.append(chars)); } {code} h2. 3. Overflow safe The overflow safe code is redundant. In this check: {code:java} if (capacity > 0 && capacity - buffer.length > 0) { {code} The capacity is used as a signed integer, then as an unsigned integer in the overflow safe subtraction to determine if the buffer is long enough. So it is mixing the interpretation of capacity and effectively only performing: {code:java} if (capacity > buffer.length) { {code} So the change in commit 808ecca only solved the issue raised by PR 132 due to the change from: {code:java} reallocate(capacity * 2) {code} to: {code:java} reallocate(capacity) {code} But this introduced a performance regression. h2. Solution What is required: # The existing ensureCapacity behaviour is unchanged. It should no-op when the capacity is negative. # The checks to ensure new chars can be added should be overflow safe. If overflow occurs in the capacity, then an OutOfMemoryError should be raised # Increases in the capacity should ensure that the new capacity can store at a minimum the requested capacity. This should allow extra buffer to reduce reallocations. The solution is to use an internal method to ensure the capacity when a source of k
[GitHub] [commons-text] theshoeshiner commented on pull request #450: Cases API + 4 implementations (Pascal, Camel, Kebab, Snake)
theshoeshiner commented on PR #450: URL: https://github.com/apache/commons-text/pull/450#issuecomment-1686121778 @elharo In regards to a couple of the above changes, is there a reason we want to prevent these classes from being extended? I can imagine many scenarios where one would want to add additional logic to them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@commons.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org