Re: RFR: 8345668: ZoneOffset.ofTotalSeconds performance regression [v3]

2024-12-30 Thread Brett Okken
On Fri, 27 Dec 2024 23:24:46 GMT, Chen Liang  wrote:

>>> For sure we should use result of `putIfAbsent`
>> 
>> Drive-by comment...
>> 
>> From what i can infer, the performance regression being addressed here is 
>> caused in part by the fact that (for example) 
>> `ConcurrentHashMap.computeIfAbsent()` provides an atomicity guarantee, which 
>> is a stronger requirement than is necessary here, and therefore by splitting 
>> up that call up into two separate, non-atomic `get()` and `put()` calls we 
>> get (counter-intuitively) faster execution time, even though there are more 
>> lines of code. Note `putIfAbsent()` also guarantees atomicity, so the same 
>> problem of slowness caused by "unnecessary atomicity" might occur with it as 
>> well.
>
> Indeed, just noticed that both `computeIfAbsent` and `putIfAbsent` may 
> acquire the lock when the key is present, while `get` never acquires a lock 
> for read-only access.
> 
> Maybe the implementation was written back when locking was less costly (with 
> biased locking, etc.). Now we might have to reconsider locking until we know 
> for sure a plain get fails.

This scenario is discussed in Effective Java by Joshua Block. His observation 
then (java 5/6 time frame?) was optimistically calling `get` first and only 
calling `putIfAbsent` or `computeIfAbsent` if the `get` returned `null` was 
250% faster, and this is because calls to put/compute ifAbsent have contention. 
There have been changes made to those methods since then to try to avoid 
synchronization when the key is already present, but the observation seems to 
confirm that the optimistic `get` call first is still faster (though a much 
smaller difference).

My comment was not to revert back to the prior change of just calling 
`computeIfAbsent`, but rather just to change the (expected rare) case when the 
first `get` returns `null` to replace the `putIfAbsent` and second `get` call 
with a single `computeIfAbsent` (or utilize the result of `putIfAbsent` to 
avoid the second call to `get`).

-

PR Review Comment: https://git.openjdk.org/jdk/pull/22854#discussion_r1899699668


Re: RFR: 8345668: ZoneOffset.ofTotalSeconds performance regression [v3]

2024-12-26 Thread Brett Okken
On Fri, 20 Dec 2024 21:06:16 GMT, Naoto Sato  wrote:

>> The change made in 
>> [JDK-8288723](https://bugs.openjdk.org/browse/JDK-8288723) seems innocuous, 
>> but it caused this performance regression. Partially reverting the change 
>> (ones that involve `computeIfAbsent()`) to the original. Provided a 
>> benchmark that iterates the call to `ZoneOffset.ofTotalSeconds(0)` 1,000 
>> times, which improves the operation time from 3,946ns to 2,241ns.
>
> Naoto Sato has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Fixed compile error

src/java.base/share/classes/java/time/format/DateTimeTextProvider.java line 316:

> 314: store = createStore(field, locale);
> 315: CACHE.putIfAbsent(key, store);
> 316: store = CACHE.get(key);

should this be 
`store = CACHE.computeIfAbsent(key, e -> createStore(e.getKey(), 
e.getValue()));`

That still allow the optimistic/concurrent get call to succeed most of the time 
(when already cached) but reduce the interactions with the map when a value is 
created/set/accessed the first time.

Alternatively, the result of `putIfAbsent` could be checked/used to avoid the 
second call to `get`.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/22854#discussion_r1898012555


Re: Request for Comments: Adding bulk-read method "CharSequence.getChars(int srcBegin, int srcEnd, char[] dst, int dstBegin)"

2024-10-26 Thread Brett Okken
+1 to option A.

Option C is too limiting on uses for the target. It could be added as a
default which calls through to option A, if there is meaningful demand.

Is it a follow up change/PR for various places which would benefit from
using the new method?

On Fri, Oct 25, 2024 at 1:34 PM Markus Karg  wrote:

> I hereby request for comments on the proposal to generalize the existing
> method "String.getChars()"'s signature to become a new default interface
> method "CharSequence.getChars()".
>
>
>
> Problem
>
>
>
> For performance reasons, many CharSequence implementations, in particular
> String, StringBuilder, StringBuffer and CharBuffer, provide a way to
> bulk-read a complete region of their characters content into a provided
> char array.
>
> Unfortunately, there is no _uniform_ way to perform this, and it is not
> guaranteed that there is bulk-reading implemented with _any_ CharSequence,
> in particular custom ones.
>
> While String, StringBuilder and StringBuffer all share the same getChars()
> method signature for this purpose, CharBuffer's way to perform the very
> same is the get() method.
>
> Other implementations have other method signatures, or do not have any
> solution to this problem at all.
>
> In particular, there is no method in their common interface, CharSequence,
> to perform such an bulk-optimized read, as CharSequence only allows to read
> one character after the next in a sequential way, either by iterating over
> charAt() from 0 to length(), or by consuming the chars() Stream.
>
>
>
> As a result, code that wants to read from CharSequence in an
> implementation-agnostic, but still bulk-optimized way, needs to know _each_
> possible implementation's specific method!
>
> Effectively this results in code like this (real-world example taken from
> the implementation of Reader.of(CharSequence) in JDK 24):
>
>
>
> switch (cs) {
>
>case String s -> s.getChars(next, next + n, cbuf, off);
>
>case StringBuilder sb -> sb.getChars(next, next + n, cbuf,
> off);
>
>case StringBuffer sb -> sb.getChars(next, next + n, cbuf,
> off);
>
>case CharBuffer cb -> cb.get(next, cbuf, off, n);
>
>default -> {
>
>   for (int i = 0; i < n; i++)
>
>  cbuf[off + i] =
> cs.charAt(next + i);
>
>}
>
> }
>
>
>
> The problem with this code is that it is bound and limited to exactly that
> given set of CharSequence implementations.
>
> If a future CharSequence implementation shall get accessed in an
> bulk-optimized way, the switch expression has to get extended and
> recompiled _every time_.
>
> If some custom CharSequence implementation is used that this code is not
> aware of, sequential read is applied, even if that implementation _does_
> provide some bulk-read method!
>
>
>
> Solution
>
>
>
> There are several possible alternative solutions:
>
> * (A) CharSequence.getChars(int srcBegin, int srcEnd, char[] dst, int
> dstBegin) - As this signature is already supported by String, StringBuffer
> and StringBuilder, I hereby propose to add this signature to CharSequence
> and provide a default implementation that iterates over charAt(int) from 0
> to length().
>
> * (B) Alternatively the same default method could be implemented using the
> chars() Stream - I assume that might run slower, but correct me if I am
> wrong.
>
> * (C) Alternatively we could go with the signature get(char[] dst, int
> offset, int length) - Only CharBuffer implements that already, so more
> changes are needed and more duplicate methods will exist in the end.
>
> * (D) Alternatively we could come up with a totally different signature -
> That would be most fair to all existing implementations, but in the end it
> will imply the most changes and the most duplicate methods.
>
> * (E) We could give up the idea and live with the situation as-is. - I
> assume only few people really prefer that outcome.
>
>
>
> Please tell me if I missed a viable option!
>
>
>
> As a side benefit of CharSequence.getChars(), its existence might trigger
> implementors to provide bulk-reading if not done yet, at least for those
> cases where it is actually feasible.
>
> In the same way it might trigger callers of Reader to start making use of
> bulk reading, at least in those cases where it does make sense but
> application authors were reluctant to implement the switch-case shown above.
>
>
>
> Hence, if nobody vetoes, I will file Jira Issue, PR and CSR for
> "CharSequence.getChars()" (alternative A) in the next days.
>
>
>
> -Markus
>


Re: RFR: 8341566: Adding factory for non-synchronized CharSequence Reader [v3]

2024-10-06 Thread Brett Okken
On Sun, 6 Oct 2024 17:44:53 GMT, Markus KARG  wrote:

>> This Pull Requests proposes an implementation for 
>> [JDK-8341566](https://bugs.openjdk.org/browse/JDK-8341566): Adding the new 
>> method `public static Reader Reader.of(CharSequence)` will return an 
>> anonymous, non-synchronized implementation of a `Reader` for each kind of 
>> `CharSequence` implementation. It is optimized for `String`, 
>> `StringBuilder`, `StringBuffer` and `CharBuffer`.
>> 
>> In addition, this Pull Request proposes to replace the implementation of 
>> `StringReader` to become a simple synchronized wrapper around 
>> `Reader.of(CharSequence)` for the case of `String` sources. To ensure 
>> correctness, this PR...
>> * ...simply moved the **original code** of `StringBuilder` to become the 
>> de-facto implementation of `Reader.of()`, then stripped synchronized from it 
>> on the left hand, but kept just a synchronized wrapper on the right hand. 
>> Then added a `switch` for optimizations within the original code, at the 
>> exact location where previously just an optimization for `String` lived in.
>> * ...added tests for all methods (`Of.java`), and applied that test upon the 
>> modified `StringBuilder`.
>> 
>> Wherever new JavaDocs were added, existing phrases from other code locations 
>> have been copied and adapted, to best match the same wording.
>
> Markus KARG has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   fixup! Reader.of(String)
>   
>   Dropping non-public JavaDocs

src/java.base/share/classes/java/io/Reader.java line 203:

> 201: int n = Math.min(length - next, len);
> 202: switch (cs) {
> 203: case String s -> s.getChars(next, next + n, cbuf, 
> off);

There was some discussion on the mailing list of introducing a method to 
CharSequence for bulk getChars. Doing that would help both here and in 
Appendable/Writer implementations like StringWriter, PrintWriter, and 
OutputStreamWriter which currently convert to a String to then write.

https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/io/Writer.java#L367

https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/io/OutputStreamWriter.java#L253

-

PR Review Comment: https://git.openjdk.org/jdk/pull/21371#discussion_r1789217494


Re: RFR: 8339531: Improve performance of MemorySegment::mismatch [v9]

2024-09-13 Thread Brett Okken
On Wed, 11 Sep 2024 13:15:57 GMT, Per Minborg  wrote:

>> src/java.base/share/classes/jdk/internal/foreign/SegmentBulkOperations.java 
>> line 244:
>> 
>>> 242: return (Architecture.isLittleEndian()
>>> 243: ? Long.numberOfTrailingZeros(x)
>>> 244: : Long.numberOfLeadingZeros(x)) / 8;
>> 
>> Would there be value in having a constant LongToIntFunction lambda to 
>> capture this logic?
>> 
>> 
>> private static final LongToIntFunction LONG_LEADING_ZEROS = 
>> Architecture.isLittleEndian() ? Long::numberOfTrailingZeros : 
>> Long::numberOfLeadingZeros;
>
> Performance-wise, I suspect there would not be that much of a difference. The 
> compiler will see that Architecture.isLittleEndian provides a stable value 
> and can optimize and eliminate code (I think). I am unsure which is easier to 
> read.

part of the value would be if it allowed just removing this method entirely and 
moving the logic above:


if (s != d) {
return start + offset + mismatch(s, d);
}


would become:


if (s != d) {
return start + offset + (LONG_LEADING_ZEROS.apply(s^d) / 8);
}

-

PR Review Comment: https://git.openjdk.org/jdk/pull/20848#discussion_r1758828018


Re: RFR: 8339531: Improve performance of MemorySegment::mismatch [v9]

2024-09-06 Thread Brett Okken
On Thu, 5 Sep 2024 17:47:16 GMT, Per Minborg  wrote:

>> This PR proposes to improve the performance of `MemorySegment::mismatch` by 
>> using Java code rather than transitioning to native code.
>
> Per Minborg has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Fix errors in a benchmark

src/java.base/share/classes/jdk/internal/foreign/SegmentBulkOperations.java 
line 244:

> 242: return (Architecture.isLittleEndian()
> 243: ? Long.numberOfTrailingZeros(x)
> 244: : Long.numberOfLeadingZeros(x)) / 8;

Would there be value in having a constant LongToIntFunction lambda to capture 
this logic?


private static final LongToIntFunction LEADING_ZEROS = 
Architecture.isLittleEndian() ? Long::numberOfTrailingZeros : 
Long::numberOfLeadingZeros;

-

PR Review Comment: https://git.openjdk.org/jdk/pull/20848#discussion_r1747281751


Re: RFR: 8336831: Optimize StringConcatHelper.simpleConcat [v5]

2024-07-20 Thread Brett Okken
On Sat, 20 Jul 2024 13:11:33 GMT, Brett Okken  wrote:

>> Shaojin Wen has updated the pull request incrementally with two additional 
>> commits since the last revision:
>> 
>>  - Update src/java.base/share/classes/java/lang/String.java
>>
>>Co-authored-by: Chen Liang 
>>  - add comments
>
> src/java.base/share/classes/java/lang/StringConcatHelper.java line 387:
> 
>> 385: byte[] buf = newArray(newLength);
>> 386: s1.getBytes(buf, 0, coder);
>> 387: s2.getBytes(buf, s1.length(), coder);
> 
> Does s1 length need to be shifted by coder for dstBegin?

Nope - I see the index is char based

https://github.com/openjdk/jdk/blob/69901157e4dae9018abd727956f60fd11b8fa252/src/java.base/share/classes/java/lang/String.java#L4821C15-L4821C23

-

PR Review Comment: https://git.openjdk.org/jdk/pull/20253#discussion_r1685432480


Re: RFR: 8336831: Optimize StringConcatHelper.simpleConcat [v5]

2024-07-20 Thread Brett Okken
On Fri, 19 Jul 2024 21:42:09 GMT, Shaojin Wen  wrote:

>> Currently simpleConcat is implemented using mix and prepend, but in this 
>> simple scenario, it can be implemented in a simpler way and can improve 
>> performance.
>
> Shaojin Wen has updated the pull request incrementally with two additional 
> commits since the last revision:
> 
>  - Update src/java.base/share/classes/java/lang/String.java
>
>Co-authored-by: Chen Liang 
>  - add comments

src/java.base/share/classes/java/lang/StringConcatHelper.java line 387:

> 385: byte[] buf = newArray(newLength);
> 386: s1.getBytes(buf, 0, coder);
> 387: s2.getBytes(buf, s1.length(), coder);

Does s1 length need to be shifted by coder for dstBegin?

-

PR Review Comment: https://git.openjdk.org/jdk/pull/20253#discussion_r1685431381


Re: RFR: 8336741: Optimize LocalTime.toString with StringBuilder.repeat

2024-07-18 Thread Brett Okken
On Thu, 18 Jul 2024 11:32:47 GMT, Shaojin Wen  wrote:

> class LocalTime {
> public String toString() {
> // ...
> if (nanoValue % 1000_000 == 0) {
> buf.append(Integer.toString((nanoValue / 1000_000) + 
> 1000).substring(1));
> } else if (nanoValue % 1000 == 0) {
> buf.append(Integer.toString((nanoValue / 1000) + 
> 1000_000).substring(1));
> } else {
> buf.append(Integer.toString((nanoValue) + 
> 1000_000_000).substring(1));
> }
>// ...
> }
> }
> 
> Currently, LocalTime.toString handles nanos by adding a value and then 
> subString(1) to fill it with zeros. Using StringBuilder.repeat is more 
> concise and has better performance.

src/java.base/share/classes/java/time/LocalTime.java line 1653:

> 1651: int digits;
> 1652: if (nanoValue % 1000_000 == 0) {
> 1653: digits = nanoValue / 1000_000;

should it be `1_000_000` - breaking on every 3 digits?

-

PR Review Comment: https://git.openjdk.org/jdk/pull/20232#discussion_r1683016529


Re: RFR: 8333833: UUID toString removes the use of ByteArrayLittleEndian [v2]

2024-06-08 Thread Brett Okken
On Sat, 8 Jun 2024 06:40:39 GMT, Shaojin Wen  wrote:

>> After PR #16245, C2 optimizes stores into primitive arrays by combining 
>> values ​​into larger stores. In the UUID.toString method, 
>> ByteArrayLittleEndian can be removed, making the code more elegant and 
>> faster.
>
> Shaojin Wen has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   add comments

src/java.base/share/classes/jdk/internal/util/HexDigits.java line 122:

> 120:  * @param value to convert
> 121:  */
> 122: public static void putHex(byte[] buffer, int off, int i) {

Should there be 2 methods - for 2 and 4 bytes respectively?
Does c2 optimize 8 byte writes as well?

-

PR Review Comment: https://git.openjdk.org/jdk/pull/19610#discussion_r1632068305


Re: RFR: 8333833: UUID toString removes the use of ByteArrayLittleEndian [v2]

2024-06-08 Thread Brett Okken
On Sat, 8 Jun 2024 06:40:39 GMT, Shaojin Wen  wrote:

>> After PR #16245, C2 optimizes stores into primitive arrays by combining 
>> values ​​into larger stores. In the UUID.toString method, 
>> ByteArrayLittleEndian can be removed, making the code more elegant and 
>> faster.
>
> Shaojin Wen has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   add comments

src/java.base/share/classes/jdk/internal/util/HexDigits.java line 117:

> 115: 
> 116: /**
> 117:  * Insert the int into the buffer as 4 hexadecimal digits

4 hex digits is 2 bytes, right? 
should this take a short instead?
At a minimum, seems documentation should be clear as to which 2 bytes are used.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/19610#discussion_r1632064197


Re: large longs to string

2024-04-24 Thread Brett Okken
Claes,

> IIUC this optimization leans on 4 long divs being slower than 1 long div + 4 
> int divs

Exactly. I think there is also some benefit from unrolling the 4 int
digit pair operations.

> which might not be true on all platforms, nor stay true in the future

Agreed, but I am not sure how to reason about that.

> Long values will in practice likely be biased towards lower values, so it’s 
> important that any optimization to .. longer values doesn’t regress inputs in 
> the int range. Since it’s all guarded by a test that is already there there 
> shouldn’t be much room for a difference, but adding code can cause 
> interesting issues so it’s always worth measuring to make sure. Have you run 
> any benchmark for inputs smaller than the threshold? And for a healthy mix of 
> values?

I had been focused on "longer" values, as I have uses which are
effectively guaranteed to have some bits from 53-63 set.
I added some tests for "int" values as well as a mix with 2 "longs"
and 3 "ints". This showed a pretty small regression for the int and
mixed case. That regression went away by changing back to a while loop
comparing to Integer.MIN_VALUE, and that did not give up much of the
gain.

private static final long[] ints = new long[] {Integer.MAX_VALUE,
12345, -5432, 654321, 5};
private static final long[] longs = new long[] {235789987235L,
235789987235326L, -98975433216803632L, Long.MAX_VALUE};
private static final long[] mixed = new long[] {5, Long.MIN_VALUE,
654321, 9876543210L, 543};


Benchmark (type)  Mode  Cnt   ScoreError  Units
LongToStringBenchmark.baseline   int  avgt3  24.105 ┬▒ 11.751  ns/op
LongToStringBenchmark.baseline  long  avgt3  51.223 ┬▒ 21.069  ns/op
LongToStringBenchmark.baseline   mix  avgt3  34.849 ┬▒  7.270  ns/op
LongToStringBenchmark.change int  avgt3  25.846 ┬▒  2.012  ns/op
LongToStringBenchmark.changelong  avgt3  43.053 ┬▒ 13.886  ns/op
LongToStringBenchmark.change mix  avgt3  35.826 ┬▒  2.901  ns/op
LongToStringBenchmark.changeLoop int  avgt3  24.261 ┬▒  7.325  ns/op
LongToStringBenchmark.changeLooplong  avgt3  44.254 ┬▒ 22.921  ns/op
LongToStringBenchmark.changeLoop mix  avgt3  29.501 ┬▒  8.693  ns/op


"change" is what I had proposed originally and "changeLoop" is:

while (i <= Integer.MIN_VALUE) {
  long q = i / 100_000_000;
  charPos -= 8;
  write4DigitPairs(buf, charPos, (int) ((q * 100_000_000) - i));
  v = q;
}

Brett


On Wed, Apr 24, 2024 at 2:39 PM Claes Redestad
 wrote:
>
> Hi,
>
> IIUC this optimization leans on 4 long divs being slower than 1 long div + 4 
> int divs, which might not be true on all platforms, nor stay true in the 
> future. Long values will in practice likely be biased towards lower values, 
> so it’s important that any optimization to .. longer values doesn’t regress 
> inputs in the int range. Since it’s all guarded by a test that is already 
> there there shouldn’t be much room for a difference, but adding code can 
> cause interesting issues so it’s always worth measuring to make sure. Have 
> you run any benchmark for inputs smaller than the threshold? And for a 
> healthy mix of values?
>
> Thanks!
> /Claes
>
> 24 apr. 2024 kl. 21:08 skrev Brett Okken :
>
> Is there interest in optimizing StringLatin1.getChars(long, int, byte[]) for 
> large (larger than int) long values[1]?
> We can change this to work with 8 digits at a time, which reduces the amount 
> of 64 bit arithmetic required.
>
> if (i <= -1_000_000_000) {
> long q = i / 100_000_000;
> charPos -= 8;
> write4DigitPairs(buf, charPos, (int) ((q * 100_000_000) - i));
> i = q;
> if (i <= -1_000_000_000) {
> q = i / 100_000_000;
> charPos -= 8;
> write4DigitPairs(buf, charPos, (int) ((q * 100_000_000) - i));
> i = q;
> }
> }
>
>
> A simple implementation of write4DigitPairs would just call the existing 
> writeDigitPair method 4 times:
>
>
> private static void write4DigitPairs(byte[] buf, int idx, int value) {
> int v = value;
> int v2 = v / 100;
> writeDigitPair(buf, idx + 6, v - (v2 * 100));
> v = v2;
>
> v2 = v / 100;
> writeDigitPair(buf, idx + 4, v - (v2 * 100));
> v = v2;
>
> v2 = v / 100;
> writeDigitPair(buf, idx + 2, v - (v2 * 100));
> v = v2;
>
> v2 = v / 100;
> writeDigitPair(buf, idx, v - (v2 * 100));
> }
>
> There is the option to OR the 4 short values together into a long and 
> leverage a ByteArrayLittleEndian.setLong call, but I see that the previous 
> usage of ByteArrayLittleEndian.setShort was removed[2].
>
> A small benchmark of longs which would qualify shows up to 20% improvement.
>
> Presumably a similar change could make sense for StringUTF16, but I have not 
> spent any time benchmarking it.
>
> Brett
>
> [1] - 
> https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/StringLatin1.java#L163-L168
> [2] - 
> https://github.com/openjdk/jdk/commit/913e43fea995b746fb9e1b25587d254396c7c3c9
>
>


large longs to string

2024-04-24 Thread Brett Okken
Is there interest in optimizing StringLatin1.getChars(long, int, byte[])
for large (larger than int) long values[1]?
We can change this to work with 8 digits at a time, which reduces the
amount of 64 bit arithmetic required.

if (i <= -1_000_000_000) {

long q = i / 100_000_000;

charPos -= 8;

write4DigitPairs(buf, charPos, (int) ((q * 100_000_000) - i));

i = q;

if (i <= -1_000_000_000) {

q = i / 100_000_000;

charPos -= 8;

write4DigitPairs(buf, charPos, (int) ((q * 100_000_000) - i));

i = q;

}

}


A simple implementation of write4DigitPairs would just call the existing
writeDigitPair method 4 times:


private static void write4DigitPairs(byte[] buf, int idx, int value) {

int v = value;

int v2 = v / 100;

writeDigitPair(buf, idx + 6, v - (v2 * 100));

v = v2;


v2 = v / 100;

writeDigitPair(buf, idx + 4, v - (v2 * 100));

v = v2;


v2 = v / 100;

writeDigitPair(buf, idx + 2, v - (v2 * 100));

v = v2;


v2 = v / 100;

writeDigitPair(buf, idx, v - (v2 * 100));

}

There is the option to OR the 4 short values together into a long and
leverage a ByteArrayLittleEndian.setLong call, but I see that the previous
usage of ByteArrayLittleEndian.setShort was removed[2].

A small benchmark of longs which would qualify shows up to 20% improvement.

Presumably a similar change could make sense for StringUTF16, but I have
not spent any time benchmarking it.

Brett

[1] -
https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/StringLatin1.java#L163-L168
[2] -
https://github.com/openjdk/jdk/commit/913e43fea995b746fb9e1b25587d254396c7c3c9


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases [v2]

2024-04-13 Thread Brett Okken
On Fri, 12 Apr 2024 10:19:27 GMT, Claes Redestad  wrote:

>> src/java.base/share/classes/java/lang/invoke/StringConcatFactory.java line 
>> 1430:
>> 
>>> 1428: cb.new_(STRING_BUILDER);
>>> 1429: cb.dup();
>>> 1430: cb.invokespecial(STRING_BUILDER, "", 
>>> MethodTypeDesc.ofDescriptor("()V"));
>> 
>> Would there be value in initializing to a larger capacity? Given the number 
>> of append calls, seems the default cap of 16 is unlikely to be sufficient.
>
> Possibly. I tried a few simple variants that initialized the `StringBuilder` 
> capacity at various guesses, such as sum of constant sizes + some factor 
> based on args, but across a diverse set of micros that gives both some wins 
> and some regressions. Getting the estimation just right is pretty tricky, 
> especially when size of the arguments are arbitrary (like for any 
> String/Object arg).

What are the scenarios which had regressions? 
Given the conservative growth for StringBuilder, it surprises me a bit that any 
scenario would regress.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/18690#discussion_r1564165946


Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases

2024-04-11 Thread Brett Okken
On Tue, 9 Apr 2024 12:01:49 GMT, Claes Redestad  wrote:

> This patch suggests a workaround to an issue with huge SCF MH expression 
> trees taking excessive JIT compilation resources by reviving (part of) the 
> simple bytecode-generating strategy that was originally available as an 
> all-or-nothing strategy choice. 
> 
> Instead of reintroducing a binary strategy choice I propose a threshold 
> parameter, controlled by 
> `-Djava.lang.invoke.StringConcat.highArityThreshold=`: For expressions 
> below or at this threshold there's no change, for expressions with an arity 
> above it we use the `StringBuilder`-chain bytecode generator. 
> 
> There are a few trade-offs at play here which influence the choice of 
> threshold. The simple high arity strategy will for example not see any reuse 
> of LambdaForms but strictly always generate a class per indy callsite, which 
> means we might end up with a higher total number of classes generated and 
> loaded in applications if we set this value too low. It may also produce 
> worse performance on average. On the other hand there is the observed 
> increase in C2 resource usage as expressions grow unwieldy. On the other 
> other hand high arity expressions are likely rare to begin with, with less 
> opportunities for sharing than the more common low-arity expressions. 
> 
> I turned the submitted test case into a few JMH benchmarks and did some 
> experiments with `-XX:CompileCommand=MemStat,StringConcat::concat,print`:
> 
> Baseline strategy:
> 13 args: 6.3M
> 23 args: 18M
> 123 args: 868M
> 
> `-Djava.lang.invoke.StringConcat.highArityThreshold=0`:
> 13 args: 2.11M
> 23 args: 3.67M
> 123 args: 4.75M
> 
> For 123 args the memory overhead of the baseline strategy is 180x, but for 23 
> args we're down to a 5x memory overhead, and down to a 3x overhead for 13 
> args. Since the absolute overhead for 23 is borderline acceptable (+15Mb) 
> I've conservatively chosen a threshold at arity 20. This keeps C2 resource 
> pressure at a reasonable level (< 18M) while avoiding perturbing performance 
> at the vast majority of call sites.
> 
> I was asked to use the new class file API for mainline. There's a version of 
> this patch implemented using ASM in 7c52a9f which might be a reasonable basis 
> for a backport.

src/java.base/share/classes/java/lang/invoke/StringConcatFactory.java line 1430:

> 1428: cb.new_(STRING_BUILDER);
> 1429: cb.dup();
> 1430: cb.invokespecial(STRING_BUILDER, "", 
> MethodTypeDesc.ofDescriptor("()V"));

Would there be value in initializing to a larger capacity? Given the number of 
append calls, seems the default cap of 16 is unlikely to be sufficient.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/18690#discussion_r1561966067


Integrated: 8321470: ThreadLocal.nextHashCode can be static final

2023-12-07 Thread Brett Okken
On Wed, 6 Dec 2023 00:52:48 GMT, Brett Okken  wrote:

> The static AtomicInteger used for the nextHashCode should be final.

This pull request has now been integrated.

Changeset: c42535f1
Author:    Brett Okken 
Committer: Aleksey Shipilev 
URL:   
https://git.openjdk.org/jdk/commit/c42535f1110d60d1472080ad4fcadb8acbeb4c4b
Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod

8321470: ThreadLocal.nextHashCode can be static final

Reviewed-by: shade, jpai

-

PR: https://git.openjdk.org/jdk/pull/16987


Re: RFR: 8321470: ThreadLocal.nextHashCode can be static final [v2]

2023-12-06 Thread Brett Okken
> The static AtomicInteger used for the nextHashCode should be final.

Brett Okken has updated the pull request incrementally with one additional 
commit since the last revision:

  Update full name

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/16987/files
  - new: https://git.openjdk.org/jdk/pull/16987/files/9e276138..d05272a3

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=16987&range=01
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16987&range=00-01

  Stats: 0 lines in 0 files changed: 0 ins; 0 del; 0 mod
  Patch: https://git.openjdk.org/jdk/pull/16987.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/16987/head:pull/16987

PR: https://git.openjdk.org/jdk/pull/16987


RFR: 8321470: ThreadLocal.nextHashCode can be static final

2023-12-06 Thread Brett Okken
The static AtomicInteger used for the nextHashCode should be final.

-

Commit messages:
 - Merge remote-tracking branch 'upstream/master' into threadlocal_final
 - make ThreadLocal.nextHashCode final

Changes: https://git.openjdk.org/jdk/pull/16987/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=16987&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8321470
  Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod
  Patch: https://git.openjdk.org/jdk/pull/16987.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/16987/head:pull/16987

PR: https://git.openjdk.org/jdk/pull/16987


ThreadLocal nextHashCode

2023-12-05 Thread Brett Okken
Should the private static AtomicInteger nextHashCode also be final?

https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/ThreadLocal.java#L100


Re: RFR: 8321059: Unneeded array assignments in MergeCollation and CompactByteArray

2023-11-30 Thread Brett Okken
On Thu, 30 Nov 2023 20:46:12 GMT, Naoto Sato  wrote:

> Removing the unnecessary array assignments.

src/java.base/share/classes/sun/text/CompactByteArray.java line 83:

> 81: for (i = 0; i < UNICODECOUNT; ++i) {
> 82: values[i] = defaultValue;
> 83: }

should this be `Arrays.fill(values, defaultValue);` ?

https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/util/Arrays.html#fill(byte%5B%5D,byte)

-

PR Review Comment: https://git.openjdk.org/jdk/pull/16912#discussion_r1411279673


Re: RFR: 8309622: Re-examine the cache mechanism in BaseLocale [v3]

2023-06-16 Thread Brett Okken
On Fri, 16 Jun 2023 17:18:38 GMT, Naoto Sato  wrote:

>> src/java.base/share/classes/sun/util/locale/BaseLocale.java line 175:
>> 
>>> 173: 
>>> LocaleUtils.toLowerString(b.getLanguage()).intern(),
>>> 174: 
>>> LocaleUtils.toTitleString(b.getScript()).intern(),
>>> 175: 
>>> LocaleUtils.toUpperString(b.getRegion()).intern(),
>> 
>> don't lines 147 and 148 above already guarantee the language and region have 
>> correct case?
>
> 147,148 are in `getInstance()` and the BaseLocale key here is created with a 
> constructor, which simply copies the passed arguments. So normalization is 
> needed here

perhaps I am missing something, but aren't `language` and `region` being 
reassigned to the case adjusted values? So the values used to create the "key" 
BaseLocale instance reflect that case adjustment? (language is again 
potentially adjusted at line 162 to account for old iso codes, but that never 
results in the wrong case.)


language = LocaleUtils.toLowerString(language);
region = LocaleUtils.toUpperString(region);

-

PR Review Comment: https://git.openjdk.org/jdk/pull/14404#discussion_r1232880865


Re: RFR: 8309622: Re-examine the cache mechanism in BaseLocale [v3]

2023-06-16 Thread Brett Okken
On Fri, 16 Jun 2023 17:18:33 GMT, Naoto Sato  wrote:

>> src/java.base/share/classes/sun/util/locale/BaseLocale.java line 168:
>> 
>>> 166: // BaseLocale as the key. The returned "normalized" instance
>>> 167: // can subsequently be used by the Locale instance which
>>> 168: // guarantees the locale components are properly 
>>> cased/interned.
>> 
>> Is it truly important that the String values are interned, or is the desire 
>> simply that all refer to the same "canonical" String instance?
>> my thought is that managing the canonical String instances could avoid 
>> WeakReferences and synchronized map access and just return a new BaseLocale 
>> each time.
>> 
>> 
>> 
>> 
>> public static BaseLocale getInstance(String language, String script,
>>  String region, String variant) {
>> if (script == null || script.isEmpty()) {
>> script = "";
>> } else {
>> script = getCanonicalString(script, LocaleUtils::toTitleString);
>> }
>> 
>> if (region == null || region.isEmpty()) {
>> region = "";
>> } else {
>> region = getCanonicalString(region, LocaleUtils::toUpperString);
>> }
>> 
>> if (language == null || language.isEmpty()) {
>> language = "";
>> } else {
>> language = getCanonicalString(language, 
>> LocaleUtils::toLowerString);
>> }
>> 
>> if (variant == null || variant.isEmpty()) {
>> variant = "";
>> } else {
>> variant = getCanonicalString(variant, Function.identity());
>> }
>> ...
>> return new BaseLocale(language, script, region, variant);
>> }
>> 
>> 
>> 
>> private static final ConcurrentMap CANONICALIZED_STRINGS 
>> = new ConcurrentHashMap<>();
>> 
>> static {
>> // prime the old iso codes
>> CANONICALIZED_STRINGS.put("he", "he");
>> CANONICALIZED_STRINGS.put("iw", "iw");
>> CANONICALIZED_STRINGS.put("id", "id");
>> CANONICALIZED_STRINGS.put("in", "in");
>> CANONICALIZED_STRINGS.put("ji", "ji");
>> CANONICALIZED_STRINGS.put("yi", "yi");
>> }
>> 
>> private static String getCanonicalString(String value, Function> String> conversion) {
>> String existing = CANONICALIZED_STRINGS.get(value);
>> if (existing != null) {
>> return existing;
>> }
>> 
>> String converted = conversion.apply(value);
>> return CANONICALIZED_STRINGS.merge(converted, converted, (k,v) -> 
>> v);...
>
> Interning components in a Locale instance (through BaseLocale) is a long 
> standing behavior. Changing it could break apps that might have relied on it 
> (using `==` for comparison)

as in doing `"en" == locale.getLanguage()`?

In that case, the interned values could still be used as the canonical values, 
and the CHM can account both for case adjustments and interning (and likely 
faster than actually calling intern). I am curious how this does with your jmh 
test:



private static final ConcurrentMap 
CANONICALIZED_UPPER_STRINGS = new ConcurrentHashMap<>();
private static final ConcurrentMap 
CANONICALIZED_LOWER_STRINGS = new ConcurrentHashMap<>();
private static final ConcurrentMap 
CANONICALIZED_TITLE_STRINGS = new ConcurrentHashMap<>();
private static final ConcurrentMap CANONICALIZED_STRINGS = 
new ConcurrentHashMap<>();

public static BaseLocale getInstance(String language, String script,
 String region, String variant) {
if (script == null || script.isEmpty()) {
script = "";
} else {
script = getCanonicalString(script, LocaleUtils::toTitleString, 
CANONICALIZED_TITLE_STRINGS);
}

if (region == null || region.isEmpty()) {
region = "";
} else {
region = getCanonicalString(region, LocaleUtils::toUpperString, 
CANONICALIZED_UPPER_STRINGS);
}

if (language == null || language.isEmpty()) {
language = "";
} else {
language = getCanonicalString(language, LocaleUtils::toLowerString, 
CANONICALIZED_LOWER_STRINGS);
}

if (variant == null || variant.isEmpty()) {
variant = "";
} else {
variant = getCanonicalString(variant, Function.identity(), 
CANONICALIZED_STRINGS);
}
...
return new BaseLocale(language, script, region, variant);
}

private static String getCanonicalString(String value, Function conversion, ConcurrentMap map) {
String existing = map.get(value);
if (existing != null) {
return existing;
}

String converted = conversion.apply(value).intern();
// put the interned value in, keyed by the interned value
map.putIfAbsent(converted, converted);

// also put the interned value in for 

Re: RFR: 8309622: Re-examine the cache mechanism in BaseLocale [v3]

2023-06-16 Thread Brett Okken
On Tue, 13 Jun 2023 17:56:57 GMT, Naoto Sato  wrote:

>> This is stemming from the PR: https://github.com/openjdk/jdk/pull/14211 
>> where aggressive GC can cause NPE in `BaseLocale$Key` class. I refactored 
>> the in-house cache with WeakHashMap, and removed the Key class as it is no 
>> longer needed (thus the original NPE will no longer be possible). Also with 
>> the new JMH test case, it gains some performance improvement:
>> 
>> (w/o fix)
>> 
>> Benchmark   Mode  Cnt  Score Error  Units
>> LocaleCache.testForLanguageTag  avgt   20   5781.275 ± 569.580  ns/op
>> LocaleCache.testLocaleOfavgt   20  62564.079 ± 406.697  ns/op
>> 
>> (w/ fix)
>> Benchmark   Mode  Cnt  Score Error  Units
>> LocaleCache.testForLanguageTag  avgt   20   4801.175 ± 371.830  ns/op
>> LocaleCache.testLocaleOfavgt   20  60394.652 ± 352.471  ns/op
>
> Naoto Sato has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Addressing review comments

src/java.base/share/classes/sun/util/locale/BaseLocale.java line 168:

> 166: // BaseLocale as the key. The returned "normalized" instance
> 167: // can subsequently be used by the Locale instance which
> 168: // guarantees the locale components are properly cased/interned.

Is it truly important that the String values are interned, or is the desire 
simply that all refer to the same "canonical" String instance?
my thought is that managing the canonical String instances could avoid 
WeakReferences and synchronized map access and just return a new BaseLocale 
each time.




public static BaseLocale getInstance(String language, String script,
 String region, String variant) {
if (script == null) {
script = "";
} else {
script = getCanonicalString(script, LocaleUtils::toTitleString);
}

if (region == null) {
region = "";
} else {
region = getCanonicalString(region, LocaleUtils::toUpperString);
}

if (language == null) {
language = "";
} else {
language = getCanonicalString(language, LocaleUtils::toLowerString);
}

if (variant == null) {
variant = "";
} else {
variant = getCanonicalString(variant, Function.identity());
}
...
return new BaseLocale(language, script, region, variant);
}



private static final ConcurrentMap CANONICALIZED_STRINGS = 
new ConcurrentHashMap<>();

static {
// prime the old iso codes
CANONICALIZED_STRINGS.put("he", "he");
CANONICALIZED_STRINGS.put("iw", "iw");
CANONICALIZED_STRINGS.put("id", "id");
CANONICALIZED_STRINGS.put("in", "in");
CANONICALIZED_STRINGS.put("ji", "ji");
CANONICALIZED_STRINGS.put("yi", "yi");
}

private static String getCanonicalString(String value, Function conversion) {
String existing = CANONICALIZED_STRINGS.get(value);
if (existing != null) {
return existing;
}

String converted = conversion.apply(value);
return CANONICALIZED_STRINGS.merge(converted, converted, (k,v) -> v);
}

-

PR Review Comment: https://git.openjdk.org/jdk/pull/14404#discussion_r1232205527


Re: RFR: 8309622: Re-examine the cache mechanism in BaseLocale [v3]

2023-06-16 Thread Brett Okken
On Tue, 13 Jun 2023 17:56:57 GMT, Naoto Sato  wrote:

>> This is stemming from the PR: https://github.com/openjdk/jdk/pull/14211 
>> where aggressive GC can cause NPE in `BaseLocale$Key` class. I refactored 
>> the in-house cache with WeakHashMap, and removed the Key class as it is no 
>> longer needed (thus the original NPE will no longer be possible). Also with 
>> the new JMH test case, it gains some performance improvement:
>> 
>> (w/o fix)
>> 
>> Benchmark   Mode  Cnt  Score Error  Units
>> LocaleCache.testForLanguageTag  avgt   20   5781.275 ± 569.580  ns/op
>> LocaleCache.testLocaleOfavgt   20  62564.079 ± 406.697  ns/op
>> 
>> (w/ fix)
>> Benchmark   Mode  Cnt  Score Error  Units
>> LocaleCache.testForLanguageTag  avgt   20   4801.175 ± 371.830  ns/op
>> LocaleCache.testLocaleOfavgt   20  60394.652 ± 352.471  ns/op
>
> Naoto Sato has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Addressing review comments

src/java.base/share/classes/sun/util/locale/BaseLocale.java line 175:

> 173: 
> LocaleUtils.toLowerString(b.getLanguage()).intern(),
> 174: 
> LocaleUtils.toTitleString(b.getScript()).intern(),
> 175: 
> LocaleUtils.toUpperString(b.getRegion()).intern(),

don't lines 147 and 148 above already guarantee the language and region have 
correct case?

-

PR Review Comment: https://git.openjdk.org/jdk/pull/14404#discussion_r1232167067


Re: RFR: 8309665: Simplify Arrays.copyOf/-Range methods

2023-06-08 Thread Brett Okken
On Thu, 8 Jun 2023 15:08:28 GMT, Claes Redestad  wrote:

> https://bugs.openjdk.org/browse/JDK-8301958 and later changes improved 
> Arrays.copyOf/-Range methods to improve peak performance in microbenchmarks 
> when copying the entire array, but it's resulted in a few lurking footprint 
> benchmark issues that come down to incurring slightly more JIT activity. As a 
> partial remedy I propose simplifying the implementation so that no new 
> methods are added compared to the JDK 20 baseline. This patch maintains the 
> microbenchmark speed-up from JDK-8301958 while reducing JIT compilations in 
> startup/footprint benchmarks.

src/java.base/share/classes/java/util/Arrays.java line 3845:

> 3843:  */
> 3844: public static byte[] copyOfRange(byte[] original, int from, int to) 
> {
> 3845: // Tickle the JIT to fold special cases optimally

this comment has been inconsistently left vs removed in the various methods 
(even though the other changes appear the same). was this intentional?

-

PR Review Comment: https://git.openjdk.org/jdk/pull/14380#discussion_r1223276847


Re: RFR: 8308804: Improve UUID.randomUUID performance with bulk/scalable PRNG access [v7]

2023-06-01 Thread Brett Okken
On Thu, 1 Jun 2023 09:37:29 GMT, Aleksey Shipilev  wrote:

>> UUID is the very important class that is used to track identities of objects 
>> in large scale systems. On some of our systems, `UUID.randomUUID` takes >1% 
>> of total CPU time, and is frequently a scalability bottleneck due to 
>> `SecureRandom` synchronization.
>> 
>> The major issue with UUID code itself is that it reads from the single 
>> `SecureRandom` instance by 16 bytes. So the heavily contended `SecureRandom` 
>> is bashed with very small requests. This also has a chilling effect on other 
>> users of `SecureRandom`, when there is a heavy UUID generation traffic.
>> 
>> We can improve this by doing the bulk reads from the backing SecureRandom 
>> and possibly striping the reads across many instances of it. 
>> 
>> 
>> Benchmark   Mode  Cnt  Score   Error   Units
>> 
>> ### AArch64 (m6g.4xlarge, Graviton, 16 cores)
>> 
>> # Before
>> UUIDRandomBench.single  thrpt   15  3.545 ± 0.058  ops/us
>> UUIDRandomBench.max thrpt   15  1.832 ± 0.059  ops/us ; negative scaling
>> 
>> # After
>> UUIDRandomBench.single  thrpt   15  4.823 ± 0.023  ops/us 
>> UUIDRandomBench.max thrpt   15  6.561 ± 0.054  ops/us ; positive 
>> scaling, ~1.5x
>> 
>> ### x86_64 (c6.8xlarge, Xeon, 18 cores)
>> 
>> # Before
>> UUIDRandomBench.single  thrpt   15  2.710 ± 0.038  ops/us
>> UUIDRandomBench.max thrpt   15  1.880 ± 0.029  ops/us  ; negative 
>> scaling 
>> 
>> # After
>> BenchmarkMode  Cnt  Score   Error   Units
>> UUIDRandomBench.single  thrpt   15  3.109 ± 0.026  ops/us
>> UUIDRandomBench.max thrpt   15  3.561 ± 0.071  ops/us  ; positive 
>> scaling, ~1.2x
>> 
>> 
>> Note that there is still a scalability bottleneck in current default random 
>> (`NativePRNG`), because it synchronizes over a singleton instance for SHA1 
>> mixer, then the engine itself, etc. -- it is quite a whack-a-mole to figure 
>> out the synchronization story there. The scalability fix in current default 
>> `SecureRandom` would be much more intrusive and risky, since it would change 
>> a core crypto class with unknown bug fanout.
>> 
>> Using the bulk reads even when the underlying PRNG is heavily synchronized 
>> is still a win. A more scalable PRNG would benefit from this as well. This 
>> PR adds a system property to select the PRNG implementation, and there we 
>> can clearly see the benefit with more scalable PRNG sources:
>> 
>> 
>> Benchmark   Mode  Cnt   Score   Error   Units
>> 
>> ### x86_64 (c6.8xlarge, Xeon, 18 cores)
>> 
>> # Before, hacked `new SecureRandom()` to 
>> `SecureRandom.getInstance("SHA1PRNG")`
>> UUIDRandomBench.single  thrpt  ...
>
> Aleksey Shipilev has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains 13 commits:
> 
>  - Revert test changes
>  - Merge branch 'master' into JDK-8308804-uuid-buffers
>  - Simplify clinit
>  - Remove the properties
>  - Swap lsb/msb
>  - Fine-tune exceptions
>  - Handle privileged properties
>  - Use ByteArray to convert. Do version/variant preparations straight on 
> locals. Move init out of optimistic lock section.
>  - More touchups
>  - Comment updates
>  - ... and 3 more: https://git.openjdk.org/jdk/compare/4460429d...fd7eaa1a

src/java.base/share/classes/java/util/UUID.java line 234:

> 232: long msb = ByteArray.getLong(buf, 0);
> 233: long lsb = ByteArray.getLong(buf, 8);
> 234: return fromRandom(msb, lsb);

is there any value in moving this outside of the critical section?

-

PR Review Comment: https://git.openjdk.org/jdk/pull/14135#discussion_r1213117598


Re: RFR: 8308804: Improve UUID.randomUUID performance with bulk/scalable PRNG access

2023-05-25 Thread Brett Okken
On Wed, 24 May 2023 19:36:44 GMT, Aleksey Shipilev  wrote:

> UUID is the very important class that is used to track identities of objects 
> in large scale systems. On some of our systems, `UUID.randomUUID` takes >1% 
> of total CPU time, and is frequently a scalability bottleneck due to 
> `SecureRandom` synchronization.
> 
> The major issue with UUID code itself is that it reads from the single 
> `SecureRandom` instance by 16 bytes. So the heavily contended `SecureRandom` 
> is bashed with very small requests. This also has a chilling effect on other 
> users of `SecureRandom`, when there is a heavy UUID generation traffic.
> 
> We can improve this by doing the bulk reads from the backing SecureRandom and 
> possibly striping the reads across many instances of it. 
> 
> 
> Benchmark   Mode  Cnt  Score   Error   Units
> 
> ### AArch64 (m6g.4xlarge, Graviton, 16 cores)
> 
> # Before
> UUIDRandomBench.single  thrpt   15  3.545 ± 0.058  ops/us
> UUIDRandomBench.max thrpt   15  1.832 ± 0.059  ops/us ; negative scaling
> 
> # After
> UUIDRandomBench.single  thrpt   15  4.421 ± 0.047  ops/us 
> UUIDRandomBench.max thrpt   15  6.658 ± 0.092  ops/us ; positive scaling, 
> ~1.5x
> 
> ### x86_64 (c6.8xlarge, Xeon, 18 cores)
> 
> # Before
> UUIDRandomBench.single  thrpt   15  2.710 ± 0.038  ops/us
> UUIDRandomBench.max thrpt   15  1.880 ± 0.029  ops/us  ; negative scaling 
> 
> # After
> BenchmarkMode  Cnt  Score   Error   Units
> UUIDRandomBench.single  thrpt   15  3.099 ± 0.022  ops/us
> UUIDRandomBench.max thrpt   15  3.555 ± 0.062  ops/us  ; positive 
> scaling, ~1.2x
> 
> 
> Note that there is still a scalability bottleneck in current default random 
> (`NativePRNG`), because it synchronizes over a singleton instance for SHA1 
> mixer, then the engine itself, etc. -- it is quite a whack-a-mole to figure 
> out the synchronization story there. The scalability fix in current default 
> `SecureRandom` would be much more intrusive and risky, since it would change 
> a core crypto class with unknown bug fanout.
> 
> Using the bulk reads even when the underlying PRNG is heavily synchronized is 
> still a win. A more scalable PRNG would benefit from this as well. This PR 
> adds a system property to select the PRNG implementation, and there we can 
> clearly see the benefit with more scalable PRNG sources:
> 
> 
> Benchmark   Mode  Cnt   Score   Error   Units
> 
> ### x86_64 (c6.8xlarge, Xeon, 18 cores)
> 
> # Before, hacked `new SecureRandom()` to 
> `SecureRandom.getInstance("SHA1PRNG")`
> UUIDRandomBench.single  thrpt   15  3.661 ± 0.008  ops/us
> UUIDRandomBench...

src/java.base/share/classes/java/util/UUID.java line 255:

> 253: // initializations, and thus false sharing between 
> reader threads.
> 254: random.nextBytes(buf);
> 255: for (int c = 0; c < BUF_SIZE; c += UUID_CHUNK) {

I think this could be faster by using a ByteBuffer (or VarHandle) to process as 
longs.

https://mail.openjdk.org/pipermail/core-libs-dev/2023-March/101249.html

-

PR Review Comment: https://git.openjdk.org/jdk/pull/14135#discussion_r1206118465


Re: UUID creation performance

2023-03-05 Thread Brett Okken
The new ByteArray class works great for the nameUUIDFromBytes method,
which must be in big endian.
For randomUUID, byte order does not matter, so using native would be
fastest, but there does not appear to be a utility class for that.
Is there a preference of just having a native order VarHandle to use
in UUID vs. having a utility method which chooses which utility class
to call based on the native order vs. some other option?

Thanks,
Brett

On Wed, Mar 1, 2023 at 9:08 AM Roger Riggs  wrote:
>
> Hi,
>
> That's an interesting idea.  Recently VarHandle access methods were
> created by JDK-8300236 [1] [2]
> in the jdk.internal.util package. See the ByteArray and
> ByteArrayLittleEndian classes.
>
> See how that would affect performance and leverage existing VarHandles.
>
> Thanks, Roger
>
> [1] https://bugs.openjdk.org/browse/JDK-8300236
> [2] https://github.com/openjdk/jdk/pull/12076
>
> On 3/1/23 7:50 AM, Brett Okken wrote:
> > Is there any interest in updating the static UUID.randomUUID() and
> > UUID.nameUUIDFromBytes(byte[]) factory methods to use either a
> > ByteBuffer or byteArrayViewVarHandle to convert the byte[] to 2 long
> > values then do the bit twiddling?
> > These methods are really dominated by time to create/populate the
> > byte[], but this does reduce the time to create the 2 long values by
> > at least half.
> > It would also allow the removal of the private UUID(byte[] data).
> >
> >  public static UUID randomUUID() {
> >  SecureRandom ng = Holder.numberGenerator;
> >
> >  byte[] randomBytes = new byte[16];
> >  ng.nextBytes(randomBytes);
> >  final ByteBuffer bb = ByteBuffer.wrap(randomBytes);
> >  bb.order(ByteOrder.nativeOrder());
> >
> >  long msb = bb.getLong();
> >  long lsb = bb.getLong();
> >
> >  msb &= 0x0FFFL;  /* clear version*/
> >  msb |= 0x4000L;  /* set to version 4 */
> >
> >  lsb &= 0x3FFFL;  /* clear variant*/
> >  lsb |= 0x8000L;  /* set to IETF variant  */
> >
> >  return new UUID(msb, lsb);
> >  }
> >
> >  public static UUID nameUUIDFromBytes(byte[] name) {
> >  MessageDigest md;
> >  try {
> >  md = MessageDigest.getInstance("MD5");
> >  } catch (NoSuchAlgorithmException nsae) {
> >  throw new InternalError("MD5 not supported", nsae);
> >  }
> >  byte[] md5Bytes = md.digest(name);
> >
> >  // default byte order is BIG_ENDIAN
> >  final ByteBuffer bb = ByteBuffer.wrap(md5Bytes);
> >
> >  long msb = bb.getLong();
> >  long lsb = bb.getLong();
> >
> >  msb &= 0x0FFFL;  /* clear version*/
> >  msb |= 0x3000L;  /* set to version 3 */
> >
> >  lsb &= 0x3FFFL;  /* clear variant*/
> >  lsb |= 0x8000L;  /* set to IETF variant  */
> >
> >  return new UUID(msb, lsb);
> >  }
> >
> > BenchmarkMode  Cnt   Score   Error  Units
> > UUIDBenchmark.jdk_name   avgt3  11.885 ± 4.025  ns/op
> > UUIDBenchmark.jdk_random avgt3  11.656 ± 0.987  ns/op
> > UUIDBenchmark.longs  avgt3   7.618 ± 1.047  ns/op
> > UUIDBenchmark.longs_bb   avgt3   7.755 ± 1.643  ns/op
> > UUIDBenchmark.longs_name avgt3   8.467 ± 1.784  ns/op
> > UUIDBenchmark.longs_name_bb  avgt3   8.455 ± 1.662  ns/op
> > UUIDBenchmark.randomBytesavgt3   6.132 ± 0.447  ns/op
> >
> >
> > @BenchmarkMode(Mode.AverageTime)
> > @OutputTimeUnit(TimeUnit.NANOSECONDS)
> > @Warmup(iterations = 3, time = 2, timeUnit = TimeUnit.SECONDS)
> > @Measurement(iterations = 3, time = 2, timeUnit = TimeUnit.SECONDS)
> > @Fork(1)
> > @State(Scope.Benchmark)
> > public class UUIDBenchmark {
> >
> >  private static final VarHandle LONGS_ACCESS =
> > MethodHandles.byteArrayViewVarHandle(long[].class,
> > ByteOrder.nativeOrder());
> >
> >  private static final VarHandle BE_LONGS_ACCESS =
> > MethodHandles.byteArrayViewVarHandle(long[].class,
> > ByteOrder.BIG_ENDIAN);
> >
> >  @Benchmark
> >  public byte[] randomBytes() {
> >  final byte[] bytes = new byte[16];
> >  randomBytes(bytes);
> >  return bytes;
> >  }
> >
> >  @Benchmark
> &

UUID creation performance

2023-03-01 Thread Brett Okken
Is there any interest in updating the static UUID.randomUUID() and
UUID.nameUUIDFromBytes(byte[]) factory methods to use either a
ByteBuffer or byteArrayViewVarHandle to convert the byte[] to 2 long
values then do the bit twiddling?
These methods are really dominated by time to create/populate the
byte[], but this does reduce the time to create the 2 long values by
at least half.
It would also allow the removal of the private UUID(byte[] data).

public static UUID randomUUID() {
SecureRandom ng = Holder.numberGenerator;

byte[] randomBytes = new byte[16];
ng.nextBytes(randomBytes);
final ByteBuffer bb = ByteBuffer.wrap(randomBytes);
bb.order(ByteOrder.nativeOrder());

long msb = bb.getLong();
long lsb = bb.getLong();

msb &= 0x0FFFL;  /* clear version*/
msb |= 0x4000L;  /* set to version 4 */

lsb &= 0x3FFFL;  /* clear variant*/
lsb |= 0x8000L;  /* set to IETF variant  */

return new UUID(msb, lsb);
}

public static UUID nameUUIDFromBytes(byte[] name) {
MessageDigest md;
try {
md = MessageDigest.getInstance("MD5");
} catch (NoSuchAlgorithmException nsae) {
throw new InternalError("MD5 not supported", nsae);
}
byte[] md5Bytes = md.digest(name);

// default byte order is BIG_ENDIAN
final ByteBuffer bb = ByteBuffer.wrap(md5Bytes);

long msb = bb.getLong();
long lsb = bb.getLong();

msb &= 0x0FFFL;  /* clear version*/
msb |= 0x3000L;  /* set to version 3 */

lsb &= 0x3FFFL;  /* clear variant*/
lsb |= 0x8000L;  /* set to IETF variant  */

return new UUID(msb, lsb);
}

BenchmarkMode  Cnt   Score   Error  Units
UUIDBenchmark.jdk_name   avgt3  11.885 ± 4.025  ns/op
UUIDBenchmark.jdk_random avgt3  11.656 ± 0.987  ns/op
UUIDBenchmark.longs  avgt3   7.618 ± 1.047  ns/op
UUIDBenchmark.longs_bb   avgt3   7.755 ± 1.643  ns/op
UUIDBenchmark.longs_name avgt3   8.467 ± 1.784  ns/op
UUIDBenchmark.longs_name_bb  avgt3   8.455 ± 1.662  ns/op
UUIDBenchmark.randomBytesavgt3   6.132 ± 0.447  ns/op


@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 3, time = 2, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 3, time = 2, timeUnit = TimeUnit.SECONDS)
@Fork(1)
@State(Scope.Benchmark)
public class UUIDBenchmark {

private static final VarHandle LONGS_ACCESS =
MethodHandles.byteArrayViewVarHandle(long[].class,
ByteOrder.nativeOrder());

private static final VarHandle BE_LONGS_ACCESS =
MethodHandles.byteArrayViewVarHandle(long[].class,
ByteOrder.BIG_ENDIAN);

@Benchmark
public byte[] randomBytes() {
final byte[] bytes = new byte[16];
randomBytes(bytes);
return bytes;
}

@Benchmark
public void jdk_random(Blackhole bh) {
final byte[] data = new byte[16];
randomBytes(data);
data[6]  &= 0x0f;  /* clear version*/
data[6]  |= 0x40;  /* set to version 4 */
data[8]  &= 0x3f;  /* clear variant*/
data[8]  |= 0x80;  /* set to IETF variant  */
long msb = 0;
long lsb = 0;
assert data.length == 16 : "data must be 16 bytes in length";
for (int i=0; i<8; i++)
msb = (msb << 8) | (data[i] & 0xff);
for (int i=8; i<16; i++)
lsb = (lsb << 8) | (data[i] & 0xff);
bh.consume(msb);
bh.consume(lsb);
}

@Benchmark
public void jdk_name(Blackhole bh)
{
final byte[] md5Bytes = new byte[16];
randomBytes(md5Bytes);
md5Bytes[6]  &= 0x0f;  /* clear version*/
md5Bytes[6]  |= 0x30;  /* set to version 3 */
md5Bytes[8]  &= 0x3f;  /* clear variant*/
md5Bytes[8]  |= 0x80;  /* set to IETF variant  */
long msb = 0;
long lsb = 0;
assert md5Bytes.length == 16 : "data must be 16 bytes in length";
for (int i=0; i<8; i++)
msb = (msb << 8) | (md5Bytes[i] & 0xff);
for (int i=8; i<16; i++)
lsb = (lsb << 8) | (md5Bytes[i] & 0xff);
bh.consume(msb);
bh.consume(lsb);
}

@Benchmark
public void longs(Blackhole bh) {
final byte[] data = new byte[16];
randomBytes(data);

long msb = (long) LONGS_ACCESS.get(data, 0);
long lsb = (long) LONGS_ACCESS.get(data, 8);

msb &= 0x0FFFL;
msb |= 0x4000L;

lsb &= 0x3FFFL;
lsb |= 0x8000L;

bh.consume(msb);
bh.consume(lsb);
}

@Benchmark
public void longs_name(Blackhole bh) {
final byte[] data = new byte[16];
randomBytes(data);

long msb =

Re: RFR: 8302863: Speed up String::encodeASCII using countPositives

2023-02-20 Thread Brett Okken
On Sat, 18 Feb 2023 23:26:08 GMT, Claes Redestad  wrote:

> When encoding Strings to US-ASCII we can speed up the happy path 
> significantly by using `StringCoding.countPositives` as a speculative check 
> for whether there are any chars that needs to be replaced by `'?'`. Once a 
> non-ASCII char is encountered we fall back to the slow loop and replace as 
> needed.
> 
> An alternative could be unrolling or using a byte array VarHandle, as 
> show-cased by Brett Okken here: 
> https://mail.openjdk.org/pipermail/core-libs-dev/2023-February/100573.html 
> Having to replace chars with `?` is essentially an encoding error so it might 
> be safe to assume this case is exceptional in practice.

For the happy path of no encoding failures, this "trivial" change is a great 
improvement.

-

Marked as reviewed by bok...@github.com (no known OpenJDK username).

PR: https://git.openjdk.org/jdk/pull/12640


Re: String encodeASCII

2023-02-18 Thread Brett Okken
Thanks for taking a look at this.
That is a pretty good improvement with a much smaller change.
I recognize the intricacies of bootstrapping, but have no expertise. Would
using Unsafe directly be easier? Particularly for shorter strings, doing
the copying and checking together rather than as separate loops seems to
speed things up considerably, even for happy-path of no failures.

Brett

On Sat, Feb 18, 2023 at 5:49 PM Claes Redestad 
wrote:

> Hi,
>
> general comment: String might be one of the trickier places to add a
> VarHandle dependency, since String is used very early in the bootstrap and
> depended upon by everything else, so I’d expect such a solution would have
> to navigate potential circularity issues carefully. It’d be good to
> experiment with changes to java.lang.String proper to see that the solution
> that works nice externally is or can be made feasible within String.
>
> Specifically on the performance opportunity then while US-ASCII encoding
> is probably on the way out we shouldn’t neglect it.
>
> One way to go about this without pulling VarHandles into String might be
> to use what other encode methods in String does and leverage
> StringCoding.countPositives:
>
> https://github.com/openjdk/jdk/pull/12640
>
> Testing this on the existing StringEncode microbenchmark, shows a
> promising speed-up when the input is ASCII-encodable:
>
> Baseline
> Benchmark  (charsetName)  Mode  Cnt  Score
> Error  Units
> StringEncode.encodeAsciiLongUS-ASCII  avgt5  26626,025 ±
> 448,307  ns/op
> StringEncode.encodeAsciiShort   US-ASCII  avgt5 33,336 ±
> 2,032  ns/op
>
> Patch:
> Benchmark  (charsetName)  Mode  Cnt ScoreError
>  Units
> StringEncode.encodeAsciiLongUS-ASCII  avgt5  5492,985 ± 40,066
>  ns/op
> StringEncode.encodeAsciiShort   US-ASCII  avgt528,545 ±  4,883
>  ns/op
>
> (You might see that this will go back into a scalar loop on encoding
> failures. That loop could still benefit from unrolling or
> byteArrayViewVarHandle, but I think you have a bigger problem in an app
> than raw performance if you have a lot of encoding failures...)
>
> WDYT?
>
> /Claes
>
> 18 feb. 2023 kl. 19:36 skrev Brett Okken :
>
>
> https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/String.java#L976-L981
>
> For String.encodeASCII, with the LATIN1 coder is there any interest in
> exploring the performance impacts of utilizing a
> byteArrayViewVarHandle to read/write as longs and utilize a bitmask to
> identify if negative values are present?
>
> A simple jmh benchmark covering either 0 or 1 non-ascii (negative)
> values shows times cut in half (or more) for most scenarios with
> strings ranging in length from 3 - ~2000.
> VM version: JDK 17.0.6, OpenJDK 64-Bit Server VM, 17.0.6+10
> Windows 10 Intel(R) Core(TM) i7-9850H
>
> Hand unrolling the loops shows noted improvement, but does make for
> less aesthetically pleasing code.
>
>
> Benchmark  (nonascii)  (size)  Mode
> CntScore Error  Units
> AsciiEncodeBenchmark.jdk none   3  avgt
> 4   15.531 ±   1.122  ns/op
> AsciiEncodeBenchmark.jdk none  10  avgt
> 4   17.350 ±   0.473  ns/op
> AsciiEncodeBenchmark.jdk none  16  avgt
> 4   18.277 ±   0.421  ns/op
> AsciiEncodeBenchmark.jdk none  23  avgt
> 4   20.139 ±   0.350  ns/op
> AsciiEncodeBenchmark.jdk none  33  avgt
> 4   22.008 ±   0.656  ns/op
> AsciiEncodeBenchmark.jdk none  42  avgt
> 4   24.393 ±   1.155  ns/op
> AsciiEncodeBenchmark.jdk none 201  avgt
> 4   55.884 ±   0.645  ns/op
> AsciiEncodeBenchmark.jdk none 511  avgt
> 4  120.817 ±   2.917  ns/op
> AsciiEncodeBenchmark.jdk none2087  avgt
> 4  471.039 ±  13.329  ns/op
> AsciiEncodeBenchmark.jdkfirst   3  avgt
> 4   15.794 ±   1.494  ns/op
> AsciiEncodeBenchmark.jdkfirst  10  avgt
> 4   18.446 ±   0.780  ns/op
> AsciiEncodeBenchmark.jdkfirst  16  avgt
> 4   20.458 ±   0.394  ns/op
> AsciiEncodeBenchmark.jdkfirst  23  avgt
> 4   22.934 ±   0.422  ns/op
> AsciiEncodeBenchmark.jdkfirst  33  avgt
> 4   25.367 ±   0.178  ns/op
> AsciiEncodeBenchmark.jdkfirst  42  avgt
> 4   28.620 ±   0.678  ns/op
> AsciiEncodeBenchmark.jdkfirst 201  avgt
> 4   80.250 ±   4.376

String encodeASCII

2023-02-18 Thread Brett Okken
https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/String.java#L976-L981

For String.encodeASCII, with the LATIN1 coder is there any interest in
exploring the performance impacts of utilizing a
byteArrayViewVarHandle to read/write as longs and utilize a bitmask to
identify if negative values are present?

A simple jmh benchmark covering either 0 or 1 non-ascii (negative)
values shows times cut in half (or more) for most scenarios with
strings ranging in length from 3 - ~2000.
VM version: JDK 17.0.6, OpenJDK 64-Bit Server VM, 17.0.6+10
Windows 10 Intel(R) Core(TM) i7-9850H

Hand unrolling the loops shows noted improvement, but does make for
less aesthetically pleasing code.


Benchmark  (nonascii)  (size)  Mode
CntScore Error  Units
AsciiEncodeBenchmark.jdk none   3  avgt
4   15.531 ±   1.122  ns/op
AsciiEncodeBenchmark.jdk none  10  avgt
4   17.350 ±   0.473  ns/op
AsciiEncodeBenchmark.jdk none  16  avgt
4   18.277 ±   0.421  ns/op
AsciiEncodeBenchmark.jdk none  23  avgt
4   20.139 ±   0.350  ns/op
AsciiEncodeBenchmark.jdk none  33  avgt
4   22.008 ±   0.656  ns/op
AsciiEncodeBenchmark.jdk none  42  avgt
4   24.393 ±   1.155  ns/op
AsciiEncodeBenchmark.jdk none 201  avgt
4   55.884 ±   0.645  ns/op
AsciiEncodeBenchmark.jdk none 511  avgt
4  120.817 ±   2.917  ns/op
AsciiEncodeBenchmark.jdk none2087  avgt
4  471.039 ±  13.329  ns/op
AsciiEncodeBenchmark.jdkfirst   3  avgt
4   15.794 ±   1.494  ns/op
AsciiEncodeBenchmark.jdkfirst  10  avgt
4   18.446 ±   0.780  ns/op
AsciiEncodeBenchmark.jdkfirst  16  avgt
4   20.458 ±   0.394  ns/op
AsciiEncodeBenchmark.jdkfirst  23  avgt
4   22.934 ±   0.422  ns/op
AsciiEncodeBenchmark.jdkfirst  33  avgt
4   25.367 ±   0.178  ns/op
AsciiEncodeBenchmark.jdkfirst  42  avgt
4   28.620 ±   0.678  ns/op
AsciiEncodeBenchmark.jdkfirst 201  avgt
4   80.250 ±   4.376  ns/op
AsciiEncodeBenchmark.jdkfirst 511  avgt
4  185.518 ±   6.370  ns/op
AsciiEncodeBenchmark.jdkfirst2087  avgt
4  713.213 ±  13.488  ns/op
AsciiEncodeBenchmark.jdk last   3  avgt
4   14.991 ±   0.190  ns/op
AsciiEncodeBenchmark.jdk last  10  avgt
4   18.284 ±   0.317  ns/op
AsciiEncodeBenchmark.jdk last  16  avgt
4   20.591 ±   1.002  ns/op
AsciiEncodeBenchmark.jdk last  23  avgt
4   22.560 ±   0.963  ns/op
AsciiEncodeBenchmark.jdk last  33  avgt
4   25.521 ±   0.554  ns/op
AsciiEncodeBenchmark.jdk last  42  avgt
4   28.484 ±   0.446  ns/op
AsciiEncodeBenchmark.jdk last 201  avgt
4   79.434 ±   2.256  ns/op
AsciiEncodeBenchmark.jdk last 511  avgt
4  186.639 ±   4.258  ns/op
AsciiEncodeBenchmark.jdk last2087  avgt
4  725.196 ± 149.416  ns/op
AsciiEncodeBenchmark.longCheckCopy   none   3  avgt
47.222 ±   0.428  ns/op
AsciiEncodeBenchmark.longCheckCopy   none  10  avgt
48.070 ±   0.171  ns/op
AsciiEncodeBenchmark.longCheckCopy   none  16  avgt
46.711 ±   0.409  ns/op
AsciiEncodeBenchmark.longCheckCopy   none  23  avgt
4   12.906 ±   3.633  ns/op
AsciiEncodeBenchmark.longCheckCopy   none  33  avgt
4   10.414 ±   0.447  ns/op
AsciiEncodeBenchmark.longCheckCopy   none  42  avgt
4   11.935 ±   1.235  ns/op
AsciiEncodeBenchmark.longCheckCopy   none 201  avgt
4   29.538 ±   3.265  ns/op
AsciiEncodeBenchmark.longCheckCopy   none 511  avgt
4  106.228 ±  68.475  ns/op
AsciiEncodeBenchmark.longCheckCopy   none2087  avgt
4  494.845 ± 890.985  ns/op
AsciiEncodeBenchmark.longCheckCopy  first   3  avgt
47.775 ±   0.278  ns/op
AsciiEncodeBenchmark.longCheckCopy  first  10  avgt
4   13.396 ±   2.072  ns/op
AsciiEncodeBenchmark.longCheckCopy  first  16  avgt
4   13.528 ±   0.702  ns/op
AsciiEncodeBenchmark.longCheckCopy  first  23  avgt
4   17.376 ±   0.360  ns/op
AsciiEncodeBenchmark.longCheckCopy  first  33  avgt
4   16.251 ±   0.203  ns/op
AsciiEncodeBenchmark.longCheckCopy  first  42  avgt
4   17.932 ±   1.773  ns/op
AsciiEncodeBenchmark.longCheckCopy  first 201  avgt
4   39.028 ±   4.699  ns/op
AsciiEncodeBenchmark.longCheckCopy  first 5

Re: RFR: 8298033: Character.codePointAt(char[], int, int) doesn't do JavaDoc-specified check

2022-12-02 Thread Brett Okken
On Fri, 2 Dec 2022 12:44:18 GMT, Sergey Tsypanov  wrote:

> I found out that this code
> 
> public class Main {
> public static void main(String[] args) {
> String s = "Hello world!";
> char[] chars = s.toCharArray();
> int point = Character.codePointAt(chars, -1, 1);
> }
> }
> 
> throws `ArrayIndexOutOfBoundsException` instead of JavaDoc-specified 
> `IndexOutOfBoundsException`: 
> 
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index -1 
> out of bounds for length 12
>   at java.base/java.lang.Character.codePointAtImpl(Character.java:9254)
>   at java.base/java.lang.Character.codePointAt(Character.java:9249)
>   at org.example.Main.main(Main.java:7)
> 
> and the method doesn't check whether `index` parameter is negative:
> 
> public static int codePointAt(char[] a, int index, int limit) {
> if (index >= limit || limit < 0 || limit > a.length) {
> throw new IndexOutOfBoundsException();
> }
> return codePointAtImpl(a, index, limit);
> }
> 
> I suggest to check the `index` parameter explicitly instead of relying on 
> AIOOBE thrown from accessing the array with negative index.

As `ArrayIndexOutOfBoundsException` is an `IndexOutOfBoundsException`, it is 
not clear to me how this is not matching the javadoc/spec.
https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/ArrayIndexOutOfBoundsException.html

-

PR: https://git.openjdk.org/jdk/pull/11480