date:20230821

[jira] [Created] (JEXL-402) parse failed with empty return value.

2023-08-21 Thread Xu Pengcheng (Jira)

Xu Pengcheng created JEXL-402:
-

 Summary: parse failed with empty return value.
 Key: JEXL-402
 URL: https://issues.apache.org/jira/browse/JEXL-402
 Project: Commons JEXL
  Issue Type: Bug
Affects Versions: 3.3
Reporter: Xu Pengcheng


Code:
{code:java}
function foo() {
  return;
}{code}
got exception: parsing error in '{'

 

but
{code:java}
function foo() {
  return null;
}{code}
is ok



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [commons-text] garydgregory merged pull request #452: TEXT-228: Fix TextStringBuilder to over-allocate when ensuring capacity

2023-08-21 Thread via GitHub



garydgregory merged PR #452:
URL: https://github.com/apache/commons-text/pull/452


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@commons.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [commons-text] garydgregory commented on pull request #452: TEXT-228: Fix TextStringBuilder to over-allocate when ensuring capacity

2023-08-21 Thread via GitHub



garydgregory commented on PR #452:
URL: https://github.com/apache/commons-text/pull/452#issuecomment-1687038018

   https://issues.apache.org/jira/projects/TEXT/issues/TEXT-228


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@commons.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (TEXT-228) StringTokenizer performance degradation when parsing large lines

2023-08-21 Thread Gary D. Gregory (Jira)



[ 
https://issues.apache.org/jira/browse/TEXT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17757048#comment-17757048
 ] 

Gary D. Gregory commented on TEXT-228:
--

Ah, I see that AbstractStringBuilder throws OME in Java 11, so I now see it 
would be OK for us to do so in this context.

> StringTokenizer performance degradation when parsing large lines
> 
>
> Key: TEXT-228
> URL: https://issues.apache.org/jira/browse/TEXT-228
> Project: Commons Text
>  Issue Type: Bug
>Affects Versions: 1.9, 1.10.0
> Environment: Linux
>Reporter: Zack Hable
>Priority: Minor
>
> After recently upgrading from Apache Commons Text 1.9 to 1.10.0 we've noticed 
> our system "hangs" (or likely will take an excessively long time to process) 
> large lines (100MB+ in size) when splitting strings with StringTokenizer.
>  
> Mitigation: Revert to Apache Commons Text 1.9
>  
> Scala version:
>  
> {code:java}
> > scala -version
> Scala code runner version 2.12.14 -- Copyright 2002-2021, LAMP/EPFL and 
> Lightbend, Inc.
> {code}
>  
> Java version:
>  
> {code:java}
> > java -version 
> openjdk version "1.8.0_382"
> OpenJDK Runtime Environment (build 1.8.0_382-b05)
> OpenJDK 64-Bit Server VM (build 25.382-b05, mixed mode)
> {code}
>  
>  
> Reproduction Steps:
>  # Generate a sample large file
> {code:java}
> echo -n '"SOME TEXT WITH SPACE" "SOME TEXT WITH SPACE" ' > largefile
> dd if=/dev/zero bs=100MB count=1 >> largefile 
> sed -ie "s/\x0/0/g" largefile
> echo -n "\0" >> largefile
> {code}
>  # Setup reproduce.scala
> {code:java}
> import org.apache.commons.text.StringTokenizer
> val lines = scala.io.Source.fromFile("./largefile").getLines.toList
> val st: StringTokenizer = new StringTokenizer(lines(0))
> val res = st.getTokenArray()
> {code}
>  # Download Apache Commons Jars
> {code:java}
> wget 
> https://repo1.maven.org/maven2/org/apache/commons/commons-text/1.10.0/commons-text-1.10.0.jar
> wget 
> https://repo1.maven.org/maven2/org/apache/commons/commons-text/1.9/commons-text-1.9.jar
> {code}
>  # Run program with a 10 second timeout (1.10 seems to hang for >1 minute)
> {code:java}
> > time timeout 10 scala -J-Xmx2g -cp commons-text-1.9.jar reproduce.scala
> timeout 10 scala -J-Xmx2g -cp commons-text-1.9.jar reproduce.scala  2.60s 
> user 0.83s system 121% cpu 2.818 total
>  
> > time timeout 10 scala -J-Xmx2g -cp commons-text-1.10.0.jar reproduce.scala
> timeout 10 scala -J-Xmx2g -cp commons-text-1.10.0.jar reproduce.scala  0.02s 
> user 0.00s system 0% cpu 10.002 total
> {code}
> As you notice above 1.9 takes ~3 seconds whereas 1.10 times out after 10 
> seconds.  I haven't come across a definite amount of time 1.10 takes, but it 
> seems to run for >1 minute



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TEXT-228) StringTokenizer performance degradation when parsing large lines

2023-08-21 Thread Alex Herbert (Jira)



[ 
https://issues.apache.org/jira/browse/TEXT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17756957#comment-17756957
 ] 

Alex Herbert commented on TEXT-228:
---

Note: Running the 3 new memory tests locally requires 10G for the surefire 
forked JVM:
{code:java}
mvn test -Dtest=TextStringBuilderTest -DargLine=-Xmx10G
mvn test -Dtest=TextStringBuilderTest -DargLine=-Xmx9G   // Skips 1
mvn test -Dtest=TextStringBuilderTest -DargLine=-Xmx4G   // Skips 2
mvn test -Dtest=TextStringBuilderTest -DargLine=-Xmx3G   // Skips 3
{code}

> StringTokenizer performance degradation when parsing large lines
> 
>
> Key: TEXT-228
> URL: https://issues.apache.org/jira/browse/TEXT-228
> Project: Commons Text
>  Issue Type: Bug
>Affects Versions: 1.9, 1.10.0
> Environment: Linux
>Reporter: Zack Hable
>Priority: Minor
>
> After recently upgrading from Apache Commons Text 1.9 to 1.10.0 we've noticed 
> our system "hangs" (or likely will take an excessively long time to process) 
> large lines (100MB+ in size) when splitting strings with StringTokenizer.
>  
> Mitigation: Revert to Apache Commons Text 1.9
>  
> Scala version:
>  
> {code:java}
> > scala -version
> Scala code runner version 2.12.14 -- Copyright 2002-2021, LAMP/EPFL and 
> Lightbend, Inc.
> {code}
>  
> Java version:
>  
> {code:java}
> > java -version 
> openjdk version "1.8.0_382"
> OpenJDK Runtime Environment (build 1.8.0_382-b05)
> OpenJDK 64-Bit Server VM (build 25.382-b05, mixed mode)
> {code}
>  
>  
> Reproduction Steps:
>  # Generate a sample large file
> {code:java}
> echo -n '"SOME TEXT WITH SPACE" "SOME TEXT WITH SPACE" ' > largefile
> dd if=/dev/zero bs=100MB count=1 >> largefile 
> sed -ie "s/\x0/0/g" largefile
> echo -n "\0" >> largefile
> {code}
>  # Setup reproduce.scala
> {code:java}
> import org.apache.commons.text.StringTokenizer
> val lines = scala.io.Source.fromFile("./largefile").getLines.toList
> val st: StringTokenizer = new StringTokenizer(lines(0))
> val res = st.getTokenArray()
> {code}
>  # Download Apache Commons Jars
> {code:java}
> wget 
> https://repo1.maven.org/maven2/org/apache/commons/commons-text/1.10.0/commons-text-1.10.0.jar
> wget 
> https://repo1.maven.org/maven2/org/apache/commons/commons-text/1.9/commons-text-1.9.jar
> {code}
>  # Run program with a 10 second timeout (1.10 seems to hang for >1 minute)
> {code:java}
> > time timeout 10 scala -J-Xmx2g -cp commons-text-1.9.jar reproduce.scala
> timeout 10 scala -J-Xmx2g -cp commons-text-1.9.jar reproduce.scala  2.60s 
> user 0.83s system 121% cpu 2.818 total
>  
> > time timeout 10 scala -J-Xmx2g -cp commons-text-1.10.0.jar reproduce.scala
> timeout 10 scala -J-Xmx2g -cp commons-text-1.10.0.jar reproduce.scala  0.02s 
> user 0.00s system 0% cpu 10.002 total
> {code}
> As you notice above 1.9 takes ~3 seconds whereas 1.10 times out after 10 
> seconds.  I haven't come across a definite amount of time 1.10 takes, but it 
> seems to run for >1 minute



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [commons-text] aherbert commented on a diff in pull request #452: TEXT-228: Fix TextStringBuilder to over-allocate when ensuring capacity

2023-08-21 Thread via GitHub



aherbert commented on code in PR #452:
URL: https://github.com/apache/commons-text/pull/452#discussion_r1300134930


##
src/main/java/org/apache/commons/text/TextStringBuilder.java:
##
@@ -1820,17 +1832,84 @@ public boolean endsWith(final String str) {
 /**
  * Tests the capacity and ensures that it is at least the size specified.
  *
+ * Note: This method can be used to minimise memory reallocations during
+ * repeated addition of values by pre-allocating the character buffer.
+ * The method ignores a negative {@code capacity} argument.
+ *
  * @param capacity the capacity to ensure
  * @return this, to enable chaining
+ * @throws OutOfMemoryError if the capacity cannot be allocated
  */
 public TextStringBuilder ensureCapacity(final int capacity) {
-// checks for overflow
-if (capacity > 0 && capacity - buffer.length > 0) {
-reallocate(capacity);
+if (capacity > 0) {
+ensureCapacityInternal(capacity);
 }
 return this;
 }
 
+/**
+ * Ensure that the buffer is at least the size specified. The {@code 
capacity} argument
+ * is treated as an unsigned integer.
+ *
+ * This method will raise an {@link OutOfMemoryError} if the capacity 
is too large
+ * for an array, or cannot be allocated.
+ *
+ * @param capacity the capacity to ensure
+ * @throws OutOfMemoryError if the capacity cannot be allocated
+ */
+private void ensureCapacityInternal(final int capacity) {
+// Check for overflow of the current buffer.
+// Assumes capacity is an unsigned integer up to Integer.MAX_VALUE * 2
+// (the largest possible addition of two maximum length arrays).
+if (capacity - buffer.length > 0) {
+resizeBuffer(capacity);
+}
+}
+
+/**
+ * Resizes the buffer to at least the size specified.
+ *
+ * @param minCapacity the minimum required capacity
+ * @throws OutOfMemoryError if the {@code minCapacity} is negative
+ */
+private void resizeBuffer(final int minCapacity) {
+// Overflow-conscious code treats the min and new capacity as unsigned.
+final int oldCapacity = buffer.length;
+int newCapacity = oldCapacity * 2;
+if (Integer.compareUnsigned(newCapacity, minCapacity) < 0) {
+newCapacity = minCapacity;
+}
+if (Integer.compareUnsigned(newCapacity, MAX_BUFFER_SIZE) > 0) {
+newCapacity = createPositiveCapacity(minCapacity);
+}
+reallocate(newCapacity);
+}
+
+/**
+ * Create a positive capacity at least as large the minimum required 
capacity.
+ * If the minimum capacity is negative then this throws an 
OutOfMemoryError as no array
+ * can be allocated.
+ *
+ * @param minCapacity the minimum capacity
+ * @return the capacity
+ * @throws OutOfMemoryError if the {@code minCapacity} is negative
+ */
+private static int createPositiveCapacity(final int minCapacity) {
+if (minCapacity < 0) {
+// overflow
+throw new OutOfMemoryError("Unable to allocate array size: " + 
Integer.toUnsignedString(minCapacity));

Review Comment:
   We throw an OOM in the code I adapted from Commons Codec. Disclaimer: I 
wrote that code so I am only copying my own precedent. However this is the 
behaviour of JDK's StringBuilder. If you try to add to an array such that it 
would not fit in a single array you will get an OutOfMemoryError. Since we are 
billing the TextStringBuilder as a replacement for that then it seems 
appropriate to copy the behaviour.
   
   Regarding recovery by the user. This would be a situation where you are 
adding to the StringBuilder and then you get an out-of-memory error. So you 
have a situation where a builder cannot create a single String to hold all the 
data. The recovery is to either try to process smaller chunks of the full 
chars; or accept that you have to truncate the single String. If we start 
throwing e.g. IAE here then a user will not be able to use the same try-catch 
with our object that they use with a JDK StringBuilder. The catch must be 
modified and we lose the drop-in-replacement pattern.
   
   Note: I just updated my extra tests to use JDK's StringBuilder. They pass.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@commons.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (TEXT-228) StringTokenizer performance degradation when parsing large lines

2023-08-21 Thread Alex Herbert (Jira)



[ 
https://issues.apache.org/jira/browse/TEXT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17756923#comment-17756923
 ] 

Alex Herbert commented on TEXT-228:
---

Throwing an OutOfMemoryError is reasonable practice when writing objects that 
allocate memory. In this case the error will arise when new chars are to added 
the char buffer managed by the TextStringBuilder. Whether the error arises 
inside Arrays.copyOf(buffer, newLength) or we throw one because we know the 
newLength is not possible to allocate the behaviour to the downstream user is 
identical. They tried to increase the size of the TextStringBuilder and ran out 
of memory.

If we change to throwing a different runtime exception I do not see that 
information being useful for the user. They did something that was not possible 
within the memory limits of the JVM.

PR for review here: [PR 452|https://github.com/apache/commons-text/pull/452]

I adapted code from Commons Codec where we do a similar buffer management in 
the BaseNCodec for a byte[] buffer. I expect the tests that push the memory to 
the limit to be skipped in the CI. They do run locally if you have enough 
memory. The smallest memory footprint test still requires at least 2GiB. This 
may be executed.

 

> StringTokenizer performance degradation when parsing large lines
> 
>
> Key: TEXT-228
> URL: https://issues.apache.org/jira/browse/TEXT-228
> Project: Commons Text
>  Issue Type: Bug
>Affects Versions: 1.9, 1.10.0
> Environment: Linux
>Reporter: Zack Hable
>Priority: Minor
>
> After recently upgrading from Apache Commons Text 1.9 to 1.10.0 we've noticed 
> our system "hangs" (or likely will take an excessively long time to process) 
> large lines (100MB+ in size) when splitting strings with StringTokenizer.
>  
> Mitigation: Revert to Apache Commons Text 1.9
>  
> Scala version:
>  
> {code:java}
> > scala -version
> Scala code runner version 2.12.14 -- Copyright 2002-2021, LAMP/EPFL and 
> Lightbend, Inc.
> {code}
>  
> Java version:
>  
> {code:java}
> > java -version 
> openjdk version "1.8.0_382"
> OpenJDK Runtime Environment (build 1.8.0_382-b05)
> OpenJDK 64-Bit Server VM (build 25.382-b05, mixed mode)
> {code}
>  
>  
> Reproduction Steps:
>  # Generate a sample large file
> {code:java}
> echo -n '"SOME TEXT WITH SPACE" "SOME TEXT WITH SPACE" ' > largefile
> dd if=/dev/zero bs=100MB count=1 >> largefile 
> sed -ie "s/\x0/0/g" largefile
> echo -n "\0" >> largefile
> {code}
>  # Setup reproduce.scala
> {code:java}
> import org.apache.commons.text.StringTokenizer
> val lines = scala.io.Source.fromFile("./largefile").getLines.toList
> val st: StringTokenizer = new StringTokenizer(lines(0))
> val res = st.getTokenArray()
> {code}
>  # Download Apache Commons Jars
> {code:java}
> wget 
> https://repo1.maven.org/maven2/org/apache/commons/commons-text/1.10.0/commons-text-1.10.0.jar
> wget 
> https://repo1.maven.org/maven2/org/apache/commons/commons-text/1.9/commons-text-1.9.jar
> {code}
>  # Run program with a 10 second timeout (1.10 seems to hang for >1 minute)
> {code:java}
> > time timeout 10 scala -J-Xmx2g -cp commons-text-1.9.jar reproduce.scala
> timeout 10 scala -J-Xmx2g -cp commons-text-1.9.jar reproduce.scala  2.60s 
> user 0.83s system 121% cpu 2.818 total
>  
> > time timeout 10 scala -J-Xmx2g -cp commons-text-1.10.0.jar reproduce.scala
> timeout 10 scala -J-Xmx2g -cp commons-text-1.10.0.jar reproduce.scala  0.02s 
> user 0.00s system 0% cpu 10.002 total
> {code}
> As you notice above 1.9 takes ~3 seconds whereas 1.10 times out after 10 
> seconds.  I haven't come across a definite amount of time 1.10 takes, but it 
> seems to run for >1 minute



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [commons-text] garydgregory commented on a diff in pull request #452: TEXT-228: Fix TextStringBuilder to over-allocate when ensuring capacity

2023-08-21 Thread via GitHub



garydgregory commented on code in PR #452:
URL: https://github.com/apache/commons-text/pull/452#discussion_r1300107300


##
src/main/java/org/apache/commons/text/TextStringBuilder.java:
##
@@ -1820,17 +1832,84 @@ public boolean endsWith(final String str) {
 /**
  * Tests the capacity and ensures that it is at least the size specified.
  *
+ * Note: This method can be used to minimise memory reallocations during
+ * repeated addition of values by pre-allocating the character buffer.
+ * The method ignores a negative {@code capacity} argument.
+ *
  * @param capacity the capacity to ensure
  * @return this, to enable chaining
+ * @throws OutOfMemoryError if the capacity cannot be allocated
  */
 public TextStringBuilder ensureCapacity(final int capacity) {
-// checks for overflow
-if (capacity > 0 && capacity - buffer.length > 0) {
-reallocate(capacity);
+if (capacity > 0) {
+ensureCapacityInternal(capacity);
 }
 return this;
 }
 
+/**
+ * Ensure that the buffer is at least the size specified. The {@code 
capacity} argument
+ * is treated as an unsigned integer.
+ *
+ * This method will raise an {@link OutOfMemoryError} if the capacity 
is too large
+ * for an array, or cannot be allocated.
+ *
+ * @param capacity the capacity to ensure
+ * @throws OutOfMemoryError if the capacity cannot be allocated
+ */
+private void ensureCapacityInternal(final int capacity) {
+// Check for overflow of the current buffer.
+// Assumes capacity is an unsigned integer up to Integer.MAX_VALUE * 2
+// (the largest possible addition of two maximum length arrays).
+if (capacity - buffer.length > 0) {
+resizeBuffer(capacity);
+}
+}
+
+/**
+ * Resizes the buffer to at least the size specified.
+ *
+ * @param minCapacity the minimum required capacity
+ * @throws OutOfMemoryError if the {@code minCapacity} is negative
+ */
+private void resizeBuffer(final int minCapacity) {
+// Overflow-conscious code treats the min and new capacity as unsigned.
+final int oldCapacity = buffer.length;
+int newCapacity = oldCapacity * 2;
+if (Integer.compareUnsigned(newCapacity, minCapacity) < 0) {
+newCapacity = minCapacity;
+}
+if (Integer.compareUnsigned(newCapacity, MAX_BUFFER_SIZE) > 0) {
+newCapacity = createPositiveCapacity(minCapacity);
+}
+reallocate(newCapacity);
+}
+
+/**
+ * Create a positive capacity at least as large the minimum required 
capacity.
+ * If the minimum capacity is negative then this throws an 
OutOfMemoryError as no array
+ * can be allocated.
+ *
+ * @param minCapacity the minimum capacity
+ * @return the capacity
+ * @throws OutOfMemoryError if the {@code minCapacity} is negative
+ */
+private static int createPositiveCapacity(final int minCapacity) {
+if (minCapacity < 0) {
+// overflow
+throw new OutOfMemoryError("Unable to allocate array size: " + 
Integer.toUnsignedString(minCapacity));

Review Comment:
   Hi @aherbert 
   Thank you for this PR :-)
   I think this should be an `IllegalArgumentException` because the JVM has NOT 
in fact run out of memory and a call site up the stack may be able to recover 
from the situation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@commons.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [commons-text] aherbert opened a new pull request, #452: TEXT-228: Fix TextStringBuilder to over-allocate when ensuring capacity

2023-08-21 Thread via GitHub



aherbert opened a new pull request, #452:
URL: https://github.com/apache/commons-text/pull/452

   This fixes a performance regression where allocation was only using the 
exact size necessary for the additional chars (resulting in buffer reallocation 
on each append).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@commons.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [commons-text] theshoeshiner commented on a diff in pull request #450: Cases API + 4 implementations (Pascal, Camel, Kebab, Snake)

2023-08-21 Thread via GitHub



theshoeshiner commented on code in PR #450:
URL: https://github.com/apache/commons-text/pull/450#discussion_r133105


##
src/main/java/org/apache/commons/text/cases/CamelCase.java:
##
@@ -0,0 +1,128 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.text.cases;
+
+import java.util.LinkedList;
+import java.util.List;
+
+import org.apache.commons.lang3.CharUtils;
+import org.apache.commons.lang3.StringUtils;
+
+/**
+ * Case implementation that parses and formats strings of the form 
'myCamelCase'
+ * 
+ * This case separates tokens on uppercase ascii alpha characters, with the 
exception
+ * that the first token begin with a lowercase ascii alpha character.
+ * 
+ */
+public class CamelCase implements Case {
+
+/** constant reuseable instance of this case. */
+public static final CamelCase INSTANCE = new CamelCase();
+
+/**
+ * Constructs new CamelCase instance.
+ */
+public CamelCase() {
+super();
+}
+
+/**
+ * Parses string tokens from a Camel Case formatted string.
+ * 
+ * Parses each character of the string parameter and creates new tokens 
when uppercase ascii
+ * letters are encountered. The upppercase letter is considered part of 
the new token. The very
+ * first character of the string is an exception to this rule and must be 
a lowercase ascii

Review Comment:
   It enforces the rule that the input be a camel cased string so that 
`format(parse()) == ` holds true. Not enforcing that rule would 
be a form of "loose" parsing, as opposed to the current strict logic which 
expects the user to know the case of the parsed string. Not enforcing this rule 
essentially makes this a pascal case parser, which already exists, and the API 
should let the user know that they're not using the correct ~parser~ Case for 
parsing.
   
   If we want to implement separate strict vs loose parsing then I'll give some 
more thought about how to expose that in a way that makes sense.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@commons.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [commons-text] theshoeshiner commented on a diff in pull request #450: Cases API + 4 implementations (Pascal, Camel, Kebab, Snake)

2023-08-21 Thread via GitHub



theshoeshiner commented on code in PR #450:
URL: https://github.com/apache/commons-text/pull/450#discussion_r133105


##
src/main/java/org/apache/commons/text/cases/CamelCase.java:
##
@@ -0,0 +1,128 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.text.cases;
+
+import java.util.LinkedList;
+import java.util.List;
+
+import org.apache.commons.lang3.CharUtils;
+import org.apache.commons.lang3.StringUtils;
+
+/**
+ * Case implementation that parses and formats strings of the form 
'myCamelCase'
+ * 
+ * This case separates tokens on uppercase ascii alpha characters, with the 
exception
+ * that the first token begin with a lowercase ascii alpha character.
+ * 
+ */
+public class CamelCase implements Case {
+
+/** constant reuseable instance of this case. */
+public static final CamelCase INSTANCE = new CamelCase();
+
+/**
+ * Constructs new CamelCase instance.
+ */
+public CamelCase() {
+super();
+}
+
+/**
+ * Parses string tokens from a Camel Case formatted string.
+ * 
+ * Parses each character of the string parameter and creates new tokens 
when uppercase ascii
+ * letters are encountered. The upppercase letter is considered part of 
the new token. The very
+ * first character of the string is an exception to this rule and must be 
a lowercase ascii

Review Comment:
   It enforces the rule that the input be a camel cased string so that 
`format(parse()) == ` holds true. Not enforcing that rule would 
be a form of "loose" parsing, as opposed to the current strict logic which 
expects the user to know the case of the parsed string. Not enforcing this rule 
essentially makes this a pascal case parser, which already exists, and the API 
should let the user know that they're not using the correct Case.
   
   If we want to implement separate strict vs loose parsing then I'll give some 
more thought about how to expose that in a way that makes sense.



##
src/main/java/org/apache/commons/text/cases/CamelCase.java:
##
@@ -0,0 +1,128 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.text.cases;
+
+import java.util.LinkedList;
+import java.util.List;
+
+import org.apache.commons.lang3.CharUtils;
+import org.apache.commons.lang3.StringUtils;
+
+/**
+ * Case implementation that parses and formats strings of the form 
'myCamelCase'
+ * 
+ * This case separates tokens on uppercase ascii alpha characters, with the 
exception
+ * that the first token begin with a lowercase ascii alpha character.
+ * 
+ */
+public class CamelCase implements Case {
+
+/** constant reuseable instance of this case. */
+public static final CamelCase INSTANCE = new CamelCase();
+
+/**
+ * Constructs new CamelCase instance.
+ */
+public CamelCase() {
+super();
+}
+
+/**
+ * Parses string tokens from a Camel Case formatted string.
+ * 
+ * Parses each character of the string parameter and creates new tokens 
when uppercase ascii
+ * letters are encountered. The upppercase letter is considered part of 
the new token. The very
+ * first character of the string is an exception to this rule and must be 
a lowercase ascii

Review Comment:
   It enforces the rule that the input be a camel cased string so that 
`format(parse()) == ` holds true. Not enforcing that rule would 
be a form of "loose" parsing, as opposed to the cur

[GitHub] [commons-text] theshoeshiner commented on a diff in pull request #450: Cases API + 4 implementations (Pascal, Camel, Kebab, Snake)

2023-08-21 Thread via GitHub



theshoeshiner commented on code in PR #450:
URL: https://github.com/apache/commons-text/pull/450#discussion_r133105


##
src/main/java/org/apache/commons/text/cases/CamelCase.java:
##
@@ -0,0 +1,128 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.text.cases;
+
+import java.util.LinkedList;
+import java.util.List;
+
+import org.apache.commons.lang3.CharUtils;
+import org.apache.commons.lang3.StringUtils;
+
+/**
+ * Case implementation that parses and formats strings of the form 
'myCamelCase'
+ * 
+ * This case separates tokens on uppercase ascii alpha characters, with the 
exception
+ * that the first token begin with a lowercase ascii alpha character.
+ * 
+ */
+public class CamelCase implements Case {
+
+/** constant reuseable instance of this case. */
+public static final CamelCase INSTANCE = new CamelCase();
+
+/**
+ * Constructs new CamelCase instance.
+ */
+public CamelCase() {
+super();
+}
+
+/**
+ * Parses string tokens from a Camel Case formatted string.
+ * 
+ * Parses each character of the string parameter and creates new tokens 
when uppercase ascii
+ * letters are encountered. The upppercase letter is considered part of 
the new token. The very
+ * first character of the string is an exception to this rule and must be 
a lowercase ascii

Review Comment:
   It enforces the rule that the input be a camel cased string so that 
`format(parse()) == ` holds true. Not enforcing that rule would 
be a form of "loose" parsing, as opposed to the current strict logic which 
expects the user to know the case of the parsed string. Not enforcing this rule 
essentially makes this a pascal case parser, which already exists, and the API 
should let the user know that they're not using the correct parser.
   
   If we want to implement separate strict vs loose parsing then I'll give some 
more thought about how to expose that in a way that makes sense.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@commons.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [commons-text] theshoeshiner commented on a diff in pull request #450: Cases API + 4 implementations (Pascal, Camel, Kebab, Snake)

2023-08-21 Thread via GitHub



theshoeshiner commented on code in PR #450:
URL: https://github.com/apache/commons-text/pull/450#discussion_r133105


##
src/main/java/org/apache/commons/text/cases/CamelCase.java:
##
@@ -0,0 +1,128 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.text.cases;
+
+import java.util.LinkedList;
+import java.util.List;
+
+import org.apache.commons.lang3.CharUtils;
+import org.apache.commons.lang3.StringUtils;
+
+/**
+ * Case implementation that parses and formats strings of the form 
'myCamelCase'
+ * 
+ * This case separates tokens on uppercase ascii alpha characters, with the 
exception
+ * that the first token begin with a lowercase ascii alpha character.
+ * 
+ */
+public class CamelCase implements Case {
+
+/** constant reuseable instance of this case. */
+public static final CamelCase INSTANCE = new CamelCase();
+
+/**
+ * Constructs new CamelCase instance.
+ */
+public CamelCase() {
+super();
+}
+
+/**
+ * Parses string tokens from a Camel Case formatted string.
+ * 
+ * Parses each character of the string parameter and creates new tokens 
when uppercase ascii
+ * letters are encountered. The upppercase letter is considered part of 
the new token. The very
+ * first character of the string is an exception to this rule and must be 
a lowercase ascii

Review Comment:
   It enforces the rule that the input be a camel cased string (so that 
`format(parse()) == ` holds true). Not enforcing that rule would 
be a form of "loose" parsing, as opposed to the current strict logic which 
expects the user to know the case of the parsed string. Not enforcing this rule 
essentially makes this a pascal case parser, which already exists, and the API 
should let the user know that they're not using the correct parser.
   
   If we want to implement separate strict vs loose parsing then I'll give some 
more thought about how to expose that in a way that makes sense.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@commons.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (TEXT-228) StringTokenizer performance degradation when parsing large lines

2023-08-21 Thread Gary D. Gregory (Jira)



[ 
https://issues.apache.org/jira/browse/TEXT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17756856#comment-17756856
 ] 

Gary D. Gregory commented on TEXT-228:
--

Hi [~aherbert] 

I'm not sure we or any library code should throw {{{}OutOfMemoryError{}}}, 
which is reserved for the JVM as documented in the error's Javadoc. Instead, an 
IAE or ISE seems more appropriate.

> StringTokenizer performance degradation when parsing large lines
> 
>
> Key: TEXT-228
> URL: https://issues.apache.org/jira/browse/TEXT-228
> Project: Commons Text
>  Issue Type: Bug
>Affects Versions: 1.9, 1.10.0
> Environment: Linux
>Reporter: Zack Hable
>Priority: Minor
>
> After recently upgrading from Apache Commons Text 1.9 to 1.10.0 we've noticed 
> our system "hangs" (or likely will take an excessively long time to process) 
> large lines (100MB+ in size) when splitting strings with StringTokenizer.
>  
> Mitigation: Revert to Apache Commons Text 1.9
>  
> Scala version:
>  
> {code:java}
> > scala -version
> Scala code runner version 2.12.14 -- Copyright 2002-2021, LAMP/EPFL and 
> Lightbend, Inc.
> {code}
>  
> Java version:
>  
> {code:java}
> > java -version 
> openjdk version "1.8.0_382"
> OpenJDK Runtime Environment (build 1.8.0_382-b05)
> OpenJDK 64-Bit Server VM (build 25.382-b05, mixed mode)
> {code}
>  
>  
> Reproduction Steps:
>  # Generate a sample large file
> {code:java}
> echo -n '"SOME TEXT WITH SPACE" "SOME TEXT WITH SPACE" ' > largefile
> dd if=/dev/zero bs=100MB count=1 >> largefile 
> sed -ie "s/\x0/0/g" largefile
> echo -n "\0" >> largefile
> {code}
>  # Setup reproduce.scala
> {code:java}
> import org.apache.commons.text.StringTokenizer
> val lines = scala.io.Source.fromFile("./largefile").getLines.toList
> val st: StringTokenizer = new StringTokenizer(lines(0))
> val res = st.getTokenArray()
> {code}
>  # Download Apache Commons Jars
> {code:java}
> wget 
> https://repo1.maven.org/maven2/org/apache/commons/commons-text/1.10.0/commons-text-1.10.0.jar
> wget 
> https://repo1.maven.org/maven2/org/apache/commons/commons-text/1.9/commons-text-1.9.jar
> {code}
>  # Run program with a 10 second timeout (1.10 seems to hang for >1 minute)
> {code:java}
> > time timeout 10 scala -J-Xmx2g -cp commons-text-1.9.jar reproduce.scala
> timeout 10 scala -J-Xmx2g -cp commons-text-1.9.jar reproduce.scala  2.60s 
> user 0.83s system 121% cpu 2.818 total
>  
> > time timeout 10 scala -J-Xmx2g -cp commons-text-1.10.0.jar reproduce.scala
> timeout 10 scala -J-Xmx2g -cp commons-text-1.10.0.jar reproduce.scala  0.02s 
> user 0.00s system 0% cpu 10.002 total
> {code}
> As you notice above 1.9 takes ~3 seconds whereas 1.10 times out after 10 
> seconds.  I haven't come across a definite amount of time 1.10 takes, but it 
> seems to run for >1 minute



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [commons-text] garydgregory commented on pull request #450: Cases API + 4 implementations (Pascal, Camel, Kebab, Snake)

2023-08-21 Thread via GitHub



garydgregory commented on PR #450:
URL: https://github.com/apache/commons-text/pull/450#issuecomment-1686152071

   IMO: You should follow the YAGNI principle and only make public and 
protected what you must, especially in a first cut.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@commons.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [commons-build-plugin] garydgregory merged pull request #174: Bump org.apache.ant:ant-launcher from 1.10.12 to 1.10.14

2023-08-21 Thread via GitHub



garydgregory merged PR #174:
URL: https://github.com/apache/commons-build-plugin/pull/174


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@commons.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (TEXT-228) StringTokenizer performance degradation when parsing large lines

2023-08-21 Thread Alex Herbert (Jira)



[ 
https://issues.apache.org/jira/browse/TEXT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17756840#comment-17756840
 ] 

Alex Herbert commented on TEXT-228:
---

The performance regression can be traced to commit 808ecca following a 
discussion in [PR 132|https://github.com/apache/commons-text/pull/132].

This changed the TextStringBuilder.ensureCapacity method from:
{code:java}
public TextStringBuilder ensureCapacity(final int capacity) {
if (capacity > buffer.length) {
reallocate(capacity * 2);
}
return this;
}
{code}
 
To:
{code:java}
public TextStringBuilder ensureCapacity(final int capacity) {
// checks for overflow
if (capacity > 0 && capacity - buffer.length > 0) {
reallocate(capacity);  //  <--- FYI: No capacity doubling here!
}
return this;
}
{code}
This change *tries* to ensure that the check for capacity is overflow safe. 
However there are issues:
h2. 1. Performance

Previously the method would allocate double the required capacity. This is a 
bit excessive. However it does reduce the number of memory reallocations 
significantly.

The new method will only allocate the memory required. There is {*}no extra 
buffer allocated{*}.

The result is that when appending single characters (as done by the 
StringTokenizer) every new character requires a copy of the entire character 
buffer.

Note: The overflow check is not the reason for the slowdown. On my machine I 
can use the original method and the updated method with approximately the same 
runtime if the updated method is changed to use reallocate(capacity * 2).

Also note that this is a specific use case for appending single characters to a 
buffer. In this case the use of ensureCapacity(size + 1) seems overkill. I 
changed the append(char) method from:
{code:java}
public TextStringBuilder append(final char ch) {
final int len = length();
ensureCapacity(len + 1);
buffer[size++] = ch;
return this;
}
{code}
To:
{code:java}
public TextStringBuilder append(final char ch) {
if (size == buffer.length) {
reallocate(size * 2);
}
buffer[size++] = ch;
return this;
}
{code}
This made no notable speed difference to the provided StringTokenizer code. 
However that program has a lot of additional logic performed per character 
processed from the input by the tokenizer. A JMH test appending single 
characters to the TextStringBuilder should be used to examine more closely this 
potential optimisation.
h2. 2. No feedback on error

The method (still) does not feed back to the user if the requested change to 
the capacity was successful.

Current unit tests actually enforce this behaviour. E.g. calling 
ensureCapacity(-1) is allowed.

This behaviour may appear strange but is the same as the JDK's 
ArrayList.ensureCapacity. We should add some javadoc to indicate that only 
positive arguments are recognised by the ensureCapacity method. Negative 
arguments are currently ignored.

This does create one issue in that the method is also used extensively 
internally when adding new chars. In this case it should raise as error if 
adding failed. If you try and append to a TextStringBuilder something that is 
too long for an array buffer, it will not fail with an OutOfMemoryError. It 
performs a no-op when trying to ensureCapacity and then will fail with an 
IndexOutOfBoundsException when actually copying the chars. E.g. this test does 
not pass:
{code:java}
@Test
public void testOutOfMemoryError() {
char[] chars = new char[3 * (Integer.MAX_VALUE / 8)];
final TextStringBuilder sb = new TextStringBuilder();
sb.append(chars);
sb.append(chars);
assertThrows(OutOfMemoryError.class, () -> sb.append(chars));
}
{code}
h2. 3. Overflow safe

The overflow safe code is redundant. In this check:
{code:java}
if (capacity > 0 && capacity - buffer.length > 0) { {code}
The capacity is used as a signed integer, then as an unsigned integer in the 
overflow safe subtraction to determine if the buffer is long enough. So it is 
mixing the interpretation of capacity and effectively only performing:
{code:java}
if (capacity > buffer.length) { {code}
So the change in commit 808ecca only solved the issue raised by PR 132 due to 
the change from:
{code:java}
reallocate(capacity * 2) {code}
to:
{code:java}
reallocate(capacity) {code}
But this introduced a performance regression.
h2. Solution

What is required:
 # The existing ensureCapacity behaviour is unchanged. It should no-op when the 
capacity is negative.
 # The checks to ensure new chars can be added should be overflow safe. If 
overflow occurs in the capacity, then an OutOfMemoryError should be raised
 # Increases in the capacity should ensure that the new capacity can store at a 
minimum the requested capacity. This should allow extra buffer to reduce 
reallocations.

The solution is to use an internal method to ensure the capacity when a source 
of k

[GitHub] [commons-text] theshoeshiner commented on pull request #450: Cases API + 4 implementations (Pascal, Camel, Kebab, Snake)

2023-08-21 Thread via GitHub



theshoeshiner commented on PR #450:
URL: https://github.com/apache/commons-text/pull/450#issuecomment-1686121778

   @elharo 
   In regards to a couple of the above changes, is there a reason we want to 
prevent these classes from being extended? I can imagine many scenarios where 
one would want to add additional logic to them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@commons.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (JEXL-402) parse failed with empty return value.

[GitHub] [commons-text] garydgregory merged pull request #452: TEXT-228: Fix TextStringBuilder to over-allocate when ensuring capacity

[GitHub] [commons-text] garydgregory commented on pull request #452: TEXT-228: Fix TextStringBuilder to over-allocate when ensuring capacity

[jira] [Commented] (TEXT-228) StringTokenizer performance degradation when parsing large lines

[jira] [Commented] (TEXT-228) StringTokenizer performance degradation when parsing large lines

[GitHub] [commons-text] aherbert commented on a diff in pull request #452: TEXT-228: Fix TextStringBuilder to over-allocate when ensuring capacity

[jira] [Commented] (TEXT-228) StringTokenizer performance degradation when parsing large lines

[GitHub] [commons-text] garydgregory commented on a diff in pull request #452: TEXT-228: Fix TextStringBuilder to over-allocate when ensuring capacity

[GitHub] [commons-text] aherbert opened a new pull request, #452: TEXT-228: Fix TextStringBuilder to over-allocate when ensuring capacity

[GitHub] [commons-text] theshoeshiner commented on a diff in pull request #450: Cases API + 4 implementations (Pascal, Camel, Kebab, Snake)

[GitHub] [commons-text] theshoeshiner commented on a diff in pull request #450: Cases API + 4 implementations (Pascal, Camel, Kebab, Snake)

[GitHub] [commons-text] theshoeshiner commented on a diff in pull request #450: Cases API + 4 implementations (Pascal, Camel, Kebab, Snake)

[GitHub] [commons-text] theshoeshiner commented on a diff in pull request #450: Cases API + 4 implementations (Pascal, Camel, Kebab, Snake)

[jira] [Commented] (TEXT-228) StringTokenizer performance degradation when parsing large lines

[GitHub] [commons-text] garydgregory commented on pull request #450: Cases API + 4 implementations (Pascal, Camel, Kebab, Snake)

[GitHub] [commons-build-plugin] garydgregory merged pull request #174: Bump org.apache.ant:ant-launcher from 1.10.12 to 1.10.14

[jira] [Commented] (TEXT-228) StringTokenizer performance degradation when parsing large lines

[GitHub] [commons-text] theshoeshiner commented on pull request #450: Cases API + 4 implementations (Pascal, Camel, Kebab, Snake)

18 matches

Site Navigation

Mail list logo

Footer information