ppkarwasz commented on code in PR #1327:
URL: https://github.com/apache/commons-lang/pull/1327#discussion_r1872983036
##########
src/main/java/org/apache/commons/lang3/StringUtils.java:
##########
@@ -2888,17 +2876,20 @@ public static int indexOfAnyBut(final CharSequence seq,
final CharSequence searc
if (isEmpty(seq) || isEmpty(searchChars)) {
return INDEX_NOT_FOUND;
}
- final int strLen = seq.length();
- for (int i = 0; i < strLen; i++) {
- final char ch = seq.charAt(i);
- final boolean chFound = CharSequenceUtils.indexOf(searchChars, ch,
0) >= 0;
- if (i + 1 < strLen && Character.isHighSurrogate(ch)) {
- final char ch2 = seq.charAt(i + 1);
- if (chFound && CharSequenceUtils.indexOf(searchChars, ch2, 0)
< 0) {
- return i;
- }
- } else if (!chFound) {
- return i;
+ final Set<Integer> seqSetCodePoints =
seq.codePoints().boxed().collect(Collectors.toSet()); // JDK >=10:
Collectors::toUnmodifiableSet
+ final Set<Integer> searchSetCodePoints =
searchChars.codePoints().boxed()
+ .collect(Collectors.toSet()); // JDK >=10:
Collectors::toUnmodifiableSet
+ final Set<Integer> complSetCodePoints =
seqSetCodePoints.stream().filter(((Predicate<Integer>)
searchSetCodePoints::contains).negate()) // JDK >=11:
Predicate.not(searchSetCodePoints::contains)
+ .collect(Collectors.toSet()); // JDK >=10:
Collectors::toUnmodifiableSet
Review Comment:
This creates 3 temporary sets. Since `seq` is potentially a very long
sequence, we can end up with a very large memory footprint, while the previous
code was garbage-free.
The advantage of `Stream`s (or `for` loops :wink:) is that you can exit
early without consuming `seq.codePoints()`.
Using Java 9, we could implement this method using:
```java
Set<Integer> searchCodePoints =
searchChars.codePoints().boxed().collect(Collectors.toSet());
int idx = seq.codePoints()
.takeWhile(c -> !searchCodePoints.contains(c))
.reduce(0, (count, codePoint) -> count +
Character.charCount(codePoint));
return idx < seq.length() ? idx : INDEX_NOT_FOUND;
```
##########
src/main/java/org/apache/commons/lang3/StringUtils.java:
##########
@@ -2839,31 +2845,13 @@ public static int indexOfAnyBut(final CharSequence cs,
final char... searchChars
if (isEmpty(cs) || ArrayUtils.isEmpty(searchChars)) {
return INDEX_NOT_FOUND;
}
- final int csLen = cs.length();
- final int csLast = csLen - 1;
- final int searchLen = searchChars.length;
- final int searchLast = searchLen - 1;
- outer:
- for (int i = 0; i < csLen; i++) {
- final char ch = cs.charAt(i);
- for (int j = 0; j < searchLen; j++) {
- if (searchChars[j] == ch) {
- if (i >= csLast || j >= searchLast ||
!Character.isHighSurrogate(ch)) {
- continue outer;
- }
- if (searchChars[j + 1] == cs.charAt(i + 1)) {
- continue outer;
- }
- }
- }
- return i;
- }
- return INDEX_NOT_FOUND;
+ return indexOfAnyBut(cs, CharBuffer.wrap(searchChars));
Review Comment:
This looks like a valid improvement! :100:
##########
src/main/java/org/apache/commons/lang3/StringUtils.java:
##########
@@ -2888,17 +2876,20 @@ public static int indexOfAnyBut(final CharSequence seq,
final CharSequence searc
if (isEmpty(seq) || isEmpty(searchChars)) {
return INDEX_NOT_FOUND;
}
- final int strLen = seq.length();
- for (int i = 0; i < strLen; i++) {
- final char ch = seq.charAt(i);
- final boolean chFound = CharSequenceUtils.indexOf(searchChars, ch,
0) >= 0;
- if (i + 1 < strLen && Character.isHighSurrogate(ch)) {
- final char ch2 = seq.charAt(i + 1);
- if (chFound && CharSequenceUtils.indexOf(searchChars, ch2, 0)
< 0) {
- return i;
- }
- } else if (!chFound) {
- return i;
Review Comment:
I think we just need to replace this code with:
```java
int codePoint;
for (int i = 0; i < strLen; i += Character.charCount(codePoint)) {
codePoint = Character.codePointAt(seq, i);
if (CharSequenceUtils.indexOf(searchChars, codePoint, 0) == -1) {
return i;
}
}
```
This way, surrogate pairs `HL` in `seq` are only looked up once in
`searchChars`.
Not sure how to handle the case, when a surrogate character appears alone in
`seq`, but IMHO in this case the `seq` string is corrupted and undefined
behavior is justified.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]