Re: RFR: 8345668: ZoneOffset.ofTotalSeconds performance regression [v3]
On Fri, 27 Dec 2024 23:24:46 GMT, Chen Liang wrote: >>> For sure we should use result of `putIfAbsent` >> >> Drive-by comment... >> >> From what i can infer, the performance regression being addressed here is >> caused in part by the fact that (for example) >> `ConcurrentHashMap.computeIfAbsent()` provides an atomicity guarantee, which >> is a stronger requirement than is necessary here, and therefore by splitting >> up that call up into two separate, non-atomic `get()` and `put()` calls we >> get (counter-intuitively) faster execution time, even though there are more >> lines of code. Note `putIfAbsent()` also guarantees atomicity, so the same >> problem of slowness caused by "unnecessary atomicity" might occur with it as >> well. > > Indeed, just noticed that both `computeIfAbsent` and `putIfAbsent` may > acquire the lock when the key is present, while `get` never acquires a lock > for read-only access. > > Maybe the implementation was written back when locking was less costly (with > biased locking, etc.). Now we might have to reconsider locking until we know > for sure a plain get fails. This scenario is discussed in Effective Java by Joshua Block. His observation then (java 5/6 time frame?) was optimistically calling `get` first and only calling `putIfAbsent` or `computeIfAbsent` if the `get` returned `null` was 250% faster, and this is because calls to put/compute ifAbsent have contention. There have been changes made to those methods since then to try to avoid synchronization when the key is already present, but the observation seems to confirm that the optimistic `get` call first is still faster (though a much smaller difference). My comment was not to revert back to the prior change of just calling `computeIfAbsent`, but rather just to change the (expected rare) case when the first `get` returns `null` to replace the `putIfAbsent` and second `get` call with a single `computeIfAbsent` (or utilize the result of `putIfAbsent` to avoid the second call to `get`). - PR Review Comment: https://git.openjdk.org/jdk/pull/22854#discussion_r1899699668
Re: RFR: 8345668: ZoneOffset.ofTotalSeconds performance regression [v3]
On Fri, 20 Dec 2024 21:06:16 GMT, Naoto Sato wrote: >> The change made in >> [JDK-8288723](https://bugs.openjdk.org/browse/JDK-8288723) seems innocuous, >> but it caused this performance regression. Partially reverting the change >> (ones that involve `computeIfAbsent()`) to the original. Provided a >> benchmark that iterates the call to `ZoneOffset.ofTotalSeconds(0)` 1,000 >> times, which improves the operation time from 3,946ns to 2,241ns. > > Naoto Sato has updated the pull request incrementally with one additional > commit since the last revision: > > Fixed compile error src/java.base/share/classes/java/time/format/DateTimeTextProvider.java line 316: > 314: store = createStore(field, locale); > 315: CACHE.putIfAbsent(key, store); > 316: store = CACHE.get(key); should this be `store = CACHE.computeIfAbsent(key, e -> createStore(e.getKey(), e.getValue()));` That still allow the optimistic/concurrent get call to succeed most of the time (when already cached) but reduce the interactions with the map when a value is created/set/accessed the first time. Alternatively, the result of `putIfAbsent` could be checked/used to avoid the second call to `get`. - PR Review Comment: https://git.openjdk.org/jdk/pull/22854#discussion_r1898012555
Re: Request for Comments: Adding bulk-read method "CharSequence.getChars(int srcBegin, int srcEnd, char[] dst, int dstBegin)"
+1 to option A. Option C is too limiting on uses for the target. It could be added as a default which calls through to option A, if there is meaningful demand. Is it a follow up change/PR for various places which would benefit from using the new method? On Fri, Oct 25, 2024 at 1:34 PM Markus Karg wrote: > I hereby request for comments on the proposal to generalize the existing > method "String.getChars()"'s signature to become a new default interface > method "CharSequence.getChars()". > > > > Problem > > > > For performance reasons, many CharSequence implementations, in particular > String, StringBuilder, StringBuffer and CharBuffer, provide a way to > bulk-read a complete region of their characters content into a provided > char array. > > Unfortunately, there is no _uniform_ way to perform this, and it is not > guaranteed that there is bulk-reading implemented with _any_ CharSequence, > in particular custom ones. > > While String, StringBuilder and StringBuffer all share the same getChars() > method signature for this purpose, CharBuffer's way to perform the very > same is the get() method. > > Other implementations have other method signatures, or do not have any > solution to this problem at all. > > In particular, there is no method in their common interface, CharSequence, > to perform such an bulk-optimized read, as CharSequence only allows to read > one character after the next in a sequential way, either by iterating over > charAt() from 0 to length(), or by consuming the chars() Stream. > > > > As a result, code that wants to read from CharSequence in an > implementation-agnostic, but still bulk-optimized way, needs to know _each_ > possible implementation's specific method! > > Effectively this results in code like this (real-world example taken from > the implementation of Reader.of(CharSequence) in JDK 24): > > > > switch (cs) { > >case String s -> s.getChars(next, next + n, cbuf, off); > >case StringBuilder sb -> sb.getChars(next, next + n, cbuf, > off); > >case StringBuffer sb -> sb.getChars(next, next + n, cbuf, > off); > >case CharBuffer cb -> cb.get(next, cbuf, off, n); > >default -> { > > for (int i = 0; i < n; i++) > > cbuf[off + i] = > cs.charAt(next + i); > >} > > } > > > > The problem with this code is that it is bound and limited to exactly that > given set of CharSequence implementations. > > If a future CharSequence implementation shall get accessed in an > bulk-optimized way, the switch expression has to get extended and > recompiled _every time_. > > If some custom CharSequence implementation is used that this code is not > aware of, sequential read is applied, even if that implementation _does_ > provide some bulk-read method! > > > > Solution > > > > There are several possible alternative solutions: > > * (A) CharSequence.getChars(int srcBegin, int srcEnd, char[] dst, int > dstBegin) - As this signature is already supported by String, StringBuffer > and StringBuilder, I hereby propose to add this signature to CharSequence > and provide a default implementation that iterates over charAt(int) from 0 > to length(). > > * (B) Alternatively the same default method could be implemented using the > chars() Stream - I assume that might run slower, but correct me if I am > wrong. > > * (C) Alternatively we could go with the signature get(char[] dst, int > offset, int length) - Only CharBuffer implements that already, so more > changes are needed and more duplicate methods will exist in the end. > > * (D) Alternatively we could come up with a totally different signature - > That would be most fair to all existing implementations, but in the end it > will imply the most changes and the most duplicate methods. > > * (E) We could give up the idea and live with the situation as-is. - I > assume only few people really prefer that outcome. > > > > Please tell me if I missed a viable option! > > > > As a side benefit of CharSequence.getChars(), its existence might trigger > implementors to provide bulk-reading if not done yet, at least for those > cases where it is actually feasible. > > In the same way it might trigger callers of Reader to start making use of > bulk reading, at least in those cases where it does make sense but > application authors were reluctant to implement the switch-case shown above. > > > > Hence, if nobody vetoes, I will file Jira Issue, PR and CSR for > "CharSequence.getChars()" (alternative A) in the next days. > > > > -Markus >
Re: RFR: 8341566: Adding factory for non-synchronized CharSequence Reader [v3]
On Sun, 6 Oct 2024 17:44:53 GMT, Markus KARG wrote: >> This Pull Requests proposes an implementation for >> [JDK-8341566](https://bugs.openjdk.org/browse/JDK-8341566): Adding the new >> method `public static Reader Reader.of(CharSequence)` will return an >> anonymous, non-synchronized implementation of a `Reader` for each kind of >> `CharSequence` implementation. It is optimized for `String`, >> `StringBuilder`, `StringBuffer` and `CharBuffer`. >> >> In addition, this Pull Request proposes to replace the implementation of >> `StringReader` to become a simple synchronized wrapper around >> `Reader.of(CharSequence)` for the case of `String` sources. To ensure >> correctness, this PR... >> * ...simply moved the **original code** of `StringBuilder` to become the >> de-facto implementation of `Reader.of()`, then stripped synchronized from it >> on the left hand, but kept just a synchronized wrapper on the right hand. >> Then added a `switch` for optimizations within the original code, at the >> exact location where previously just an optimization for `String` lived in. >> * ...added tests for all methods (`Of.java`), and applied that test upon the >> modified `StringBuilder`. >> >> Wherever new JavaDocs were added, existing phrases from other code locations >> have been copied and adapted, to best match the same wording. > > Markus KARG has updated the pull request incrementally with one additional > commit since the last revision: > > fixup! Reader.of(String) > > Dropping non-public JavaDocs src/java.base/share/classes/java/io/Reader.java line 203: > 201: int n = Math.min(length - next, len); > 202: switch (cs) { > 203: case String s -> s.getChars(next, next + n, cbuf, > off); There was some discussion on the mailing list of introducing a method to CharSequence for bulk getChars. Doing that would help both here and in Appendable/Writer implementations like StringWriter, PrintWriter, and OutputStreamWriter which currently convert to a String to then write. https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/io/Writer.java#L367 https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/io/OutputStreamWriter.java#L253 - PR Review Comment: https://git.openjdk.org/jdk/pull/21371#discussion_r1789217494
Re: RFR: 8339531: Improve performance of MemorySegment::mismatch [v9]
On Wed, 11 Sep 2024 13:15:57 GMT, Per Minborg wrote: >> src/java.base/share/classes/jdk/internal/foreign/SegmentBulkOperations.java >> line 244: >> >>> 242: return (Architecture.isLittleEndian() >>> 243: ? Long.numberOfTrailingZeros(x) >>> 244: : Long.numberOfLeadingZeros(x)) / 8; >> >> Would there be value in having a constant LongToIntFunction lambda to >> capture this logic? >> >> >> private static final LongToIntFunction LONG_LEADING_ZEROS = >> Architecture.isLittleEndian() ? Long::numberOfTrailingZeros : >> Long::numberOfLeadingZeros; > > Performance-wise, I suspect there would not be that much of a difference. The > compiler will see that Architecture.isLittleEndian provides a stable value > and can optimize and eliminate code (I think). I am unsure which is easier to > read. part of the value would be if it allowed just removing this method entirely and moving the logic above: if (s != d) { return start + offset + mismatch(s, d); } would become: if (s != d) { return start + offset + (LONG_LEADING_ZEROS.apply(s^d) / 8); } - PR Review Comment: https://git.openjdk.org/jdk/pull/20848#discussion_r1758828018
Re: RFR: 8339531: Improve performance of MemorySegment::mismatch [v9]
On Thu, 5 Sep 2024 17:47:16 GMT, Per Minborg wrote: >> This PR proposes to improve the performance of `MemorySegment::mismatch` by >> using Java code rather than transitioning to native code. > > Per Minborg has updated the pull request incrementally with one additional > commit since the last revision: > > Fix errors in a benchmark src/java.base/share/classes/jdk/internal/foreign/SegmentBulkOperations.java line 244: > 242: return (Architecture.isLittleEndian() > 243: ? Long.numberOfTrailingZeros(x) > 244: : Long.numberOfLeadingZeros(x)) / 8; Would there be value in having a constant LongToIntFunction lambda to capture this logic? private static final LongToIntFunction LEADING_ZEROS = Architecture.isLittleEndian() ? Long::numberOfTrailingZeros : Long::numberOfLeadingZeros; - PR Review Comment: https://git.openjdk.org/jdk/pull/20848#discussion_r1747281751
Re: RFR: 8336831: Optimize StringConcatHelper.simpleConcat [v5]
On Sat, 20 Jul 2024 13:11:33 GMT, Brett Okken wrote: >> Shaojin Wen has updated the pull request incrementally with two additional >> commits since the last revision: >> >> - Update src/java.base/share/classes/java/lang/String.java >> >>Co-authored-by: Chen Liang >> - add comments > > src/java.base/share/classes/java/lang/StringConcatHelper.java line 387: > >> 385: byte[] buf = newArray(newLength); >> 386: s1.getBytes(buf, 0, coder); >> 387: s2.getBytes(buf, s1.length(), coder); > > Does s1 length need to be shifted by coder for dstBegin? Nope - I see the index is char based https://github.com/openjdk/jdk/blob/69901157e4dae9018abd727956f60fd11b8fa252/src/java.base/share/classes/java/lang/String.java#L4821C15-L4821C23 - PR Review Comment: https://git.openjdk.org/jdk/pull/20253#discussion_r1685432480
Re: RFR: 8336831: Optimize StringConcatHelper.simpleConcat [v5]
On Fri, 19 Jul 2024 21:42:09 GMT, Shaojin Wen wrote: >> Currently simpleConcat is implemented using mix and prepend, but in this >> simple scenario, it can be implemented in a simpler way and can improve >> performance. > > Shaojin Wen has updated the pull request incrementally with two additional > commits since the last revision: > > - Update src/java.base/share/classes/java/lang/String.java > >Co-authored-by: Chen Liang > - add comments src/java.base/share/classes/java/lang/StringConcatHelper.java line 387: > 385: byte[] buf = newArray(newLength); > 386: s1.getBytes(buf, 0, coder); > 387: s2.getBytes(buf, s1.length(), coder); Does s1 length need to be shifted by coder for dstBegin? - PR Review Comment: https://git.openjdk.org/jdk/pull/20253#discussion_r1685431381
Re: RFR: 8336741: Optimize LocalTime.toString with StringBuilder.repeat
On Thu, 18 Jul 2024 11:32:47 GMT, Shaojin Wen wrote: > class LocalTime { > public String toString() { > // ... > if (nanoValue % 1000_000 == 0) { > buf.append(Integer.toString((nanoValue / 1000_000) + > 1000).substring(1)); > } else if (nanoValue % 1000 == 0) { > buf.append(Integer.toString((nanoValue / 1000) + > 1000_000).substring(1)); > } else { > buf.append(Integer.toString((nanoValue) + > 1000_000_000).substring(1)); > } >// ... > } > } > > Currently, LocalTime.toString handles nanos by adding a value and then > subString(1) to fill it with zeros. Using StringBuilder.repeat is more > concise and has better performance. src/java.base/share/classes/java/time/LocalTime.java line 1653: > 1651: int digits; > 1652: if (nanoValue % 1000_000 == 0) { > 1653: digits = nanoValue / 1000_000; should it be `1_000_000` - breaking on every 3 digits? - PR Review Comment: https://git.openjdk.org/jdk/pull/20232#discussion_r1683016529
Re: RFR: 8333833: UUID toString removes the use of ByteArrayLittleEndian [v2]
On Sat, 8 Jun 2024 06:40:39 GMT, Shaojin Wen wrote: >> After PR #16245, C2 optimizes stores into primitive arrays by combining >> values into larger stores. In the UUID.toString method, >> ByteArrayLittleEndian can be removed, making the code more elegant and >> faster. > > Shaojin Wen has updated the pull request incrementally with one additional > commit since the last revision: > > add comments src/java.base/share/classes/jdk/internal/util/HexDigits.java line 122: > 120: * @param value to convert > 121: */ > 122: public static void putHex(byte[] buffer, int off, int i) { Should there be 2 methods - for 2 and 4 bytes respectively? Does c2 optimize 8 byte writes as well? - PR Review Comment: https://git.openjdk.org/jdk/pull/19610#discussion_r1632068305
Re: RFR: 8333833: UUID toString removes the use of ByteArrayLittleEndian [v2]
On Sat, 8 Jun 2024 06:40:39 GMT, Shaojin Wen wrote: >> After PR #16245, C2 optimizes stores into primitive arrays by combining >> values into larger stores. In the UUID.toString method, >> ByteArrayLittleEndian can be removed, making the code more elegant and >> faster. > > Shaojin Wen has updated the pull request incrementally with one additional > commit since the last revision: > > add comments src/java.base/share/classes/jdk/internal/util/HexDigits.java line 117: > 115: > 116: /** > 117: * Insert the int into the buffer as 4 hexadecimal digits 4 hex digits is 2 bytes, right? should this take a short instead? At a minimum, seems documentation should be clear as to which 2 bytes are used. - PR Review Comment: https://git.openjdk.org/jdk/pull/19610#discussion_r1632064197
Re: large longs to string
Claes, > IIUC this optimization leans on 4 long divs being slower than 1 long div + 4 > int divs Exactly. I think there is also some benefit from unrolling the 4 int digit pair operations. > which might not be true on all platforms, nor stay true in the future Agreed, but I am not sure how to reason about that. > Long values will in practice likely be biased towards lower values, so it’s > important that any optimization to .. longer values doesn’t regress inputs in > the int range. Since it’s all guarded by a test that is already there there > shouldn’t be much room for a difference, but adding code can cause > interesting issues so it’s always worth measuring to make sure. Have you run > any benchmark for inputs smaller than the threshold? And for a healthy mix of > values? I had been focused on "longer" values, as I have uses which are effectively guaranteed to have some bits from 53-63 set. I added some tests for "int" values as well as a mix with 2 "longs" and 3 "ints". This showed a pretty small regression for the int and mixed case. That regression went away by changing back to a while loop comparing to Integer.MIN_VALUE, and that did not give up much of the gain. private static final long[] ints = new long[] {Integer.MAX_VALUE, 12345, -5432, 654321, 5}; private static final long[] longs = new long[] {235789987235L, 235789987235326L, -98975433216803632L, Long.MAX_VALUE}; private static final long[] mixed = new long[] {5, Long.MIN_VALUE, 654321, 9876543210L, 543}; Benchmark (type) Mode Cnt ScoreError Units LongToStringBenchmark.baseline int avgt3 24.105 ┬▒ 11.751 ns/op LongToStringBenchmark.baseline long avgt3 51.223 ┬▒ 21.069 ns/op LongToStringBenchmark.baseline mix avgt3 34.849 ┬▒ 7.270 ns/op LongToStringBenchmark.change int avgt3 25.846 ┬▒ 2.012 ns/op LongToStringBenchmark.changelong avgt3 43.053 ┬▒ 13.886 ns/op LongToStringBenchmark.change mix avgt3 35.826 ┬▒ 2.901 ns/op LongToStringBenchmark.changeLoop int avgt3 24.261 ┬▒ 7.325 ns/op LongToStringBenchmark.changeLooplong avgt3 44.254 ┬▒ 22.921 ns/op LongToStringBenchmark.changeLoop mix avgt3 29.501 ┬▒ 8.693 ns/op "change" is what I had proposed originally and "changeLoop" is: while (i <= Integer.MIN_VALUE) { long q = i / 100_000_000; charPos -= 8; write4DigitPairs(buf, charPos, (int) ((q * 100_000_000) - i)); v = q; } Brett On Wed, Apr 24, 2024 at 2:39 PM Claes Redestad wrote: > > Hi, > > IIUC this optimization leans on 4 long divs being slower than 1 long div + 4 > int divs, which might not be true on all platforms, nor stay true in the > future. Long values will in practice likely be biased towards lower values, > so it’s important that any optimization to .. longer values doesn’t regress > inputs in the int range. Since it’s all guarded by a test that is already > there there shouldn’t be much room for a difference, but adding code can > cause interesting issues so it’s always worth measuring to make sure. Have > you run any benchmark for inputs smaller than the threshold? And for a > healthy mix of values? > > Thanks! > /Claes > > 24 apr. 2024 kl. 21:08 skrev Brett Okken : > > Is there interest in optimizing StringLatin1.getChars(long, int, byte[]) for > large (larger than int) long values[1]? > We can change this to work with 8 digits at a time, which reduces the amount > of 64 bit arithmetic required. > > if (i <= -1_000_000_000) { > long q = i / 100_000_000; > charPos -= 8; > write4DigitPairs(buf, charPos, (int) ((q * 100_000_000) - i)); > i = q; > if (i <= -1_000_000_000) { > q = i / 100_000_000; > charPos -= 8; > write4DigitPairs(buf, charPos, (int) ((q * 100_000_000) - i)); > i = q; > } > } > > > A simple implementation of write4DigitPairs would just call the existing > writeDigitPair method 4 times: > > > private static void write4DigitPairs(byte[] buf, int idx, int value) { > int v = value; > int v2 = v / 100; > writeDigitPair(buf, idx + 6, v - (v2 * 100)); > v = v2; > > v2 = v / 100; > writeDigitPair(buf, idx + 4, v - (v2 * 100)); > v = v2; > > v2 = v / 100; > writeDigitPair(buf, idx + 2, v - (v2 * 100)); > v = v2; > > v2 = v / 100; > writeDigitPair(buf, idx, v - (v2 * 100)); > } > > There is the option to OR the 4 short values together into a long and > leverage a ByteArrayLittleEndian.setLong call, but I see that the previous > usage of ByteArrayLittleEndian.setShort was removed[2]. > > A small benchmark of longs which would qualify shows up to 20% improvement. > > Presumably a similar change could make sense for StringUTF16, but I have not > spent any time benchmarking it. > > Brett > > [1] - > https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/StringLatin1.java#L163-L168 > [2] - > https://github.com/openjdk/jdk/commit/913e43fea995b746fb9e1b25587d254396c7c3c9 > >
large longs to string
Is there interest in optimizing StringLatin1.getChars(long, int, byte[]) for large (larger than int) long values[1]? We can change this to work with 8 digits at a time, which reduces the amount of 64 bit arithmetic required. if (i <= -1_000_000_000) { long q = i / 100_000_000; charPos -= 8; write4DigitPairs(buf, charPos, (int) ((q * 100_000_000) - i)); i = q; if (i <= -1_000_000_000) { q = i / 100_000_000; charPos -= 8; write4DigitPairs(buf, charPos, (int) ((q * 100_000_000) - i)); i = q; } } A simple implementation of write4DigitPairs would just call the existing writeDigitPair method 4 times: private static void write4DigitPairs(byte[] buf, int idx, int value) { int v = value; int v2 = v / 100; writeDigitPair(buf, idx + 6, v - (v2 * 100)); v = v2; v2 = v / 100; writeDigitPair(buf, idx + 4, v - (v2 * 100)); v = v2; v2 = v / 100; writeDigitPair(buf, idx + 2, v - (v2 * 100)); v = v2; v2 = v / 100; writeDigitPair(buf, idx, v - (v2 * 100)); } There is the option to OR the 4 short values together into a long and leverage a ByteArrayLittleEndian.setLong call, but I see that the previous usage of ByteArrayLittleEndian.setShort was removed[2]. A small benchmark of longs which would qualify shows up to 20% improvement. Presumably a similar change could make sense for StringUTF16, but I have not spent any time benchmarking it. Brett [1] - https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/StringLatin1.java#L163-L168 [2] - https://github.com/openjdk/jdk/commit/913e43fea995b746fb9e1b25587d254396c7c3c9
Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases [v2]
On Fri, 12 Apr 2024 10:19:27 GMT, Claes Redestad wrote: >> src/java.base/share/classes/java/lang/invoke/StringConcatFactory.java line >> 1430: >> >>> 1428: cb.new_(STRING_BUILDER); >>> 1429: cb.dup(); >>> 1430: cb.invokespecial(STRING_BUILDER, "", >>> MethodTypeDesc.ofDescriptor("()V")); >> >> Would there be value in initializing to a larger capacity? Given the number >> of append calls, seems the default cap of 16 is unlikely to be sufficient. > > Possibly. I tried a few simple variants that initialized the `StringBuilder` > capacity at various guesses, such as sum of constant sizes + some factor > based on args, but across a diverse set of micros that gives both some wins > and some regressions. Getting the estimation just right is pretty tricky, > especially when size of the arguments are arbitrary (like for any > String/Object arg). What are the scenarios which had regressions? Given the conservative growth for StringBuilder, it surprises me a bit that any scenario would regress. - PR Review Comment: https://git.openjdk.org/jdk/pull/18690#discussion_r1564165946
Re: RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases
On Tue, 9 Apr 2024 12:01:49 GMT, Claes Redestad wrote: > This patch suggests a workaround to an issue with huge SCF MH expression > trees taking excessive JIT compilation resources by reviving (part of) the > simple bytecode-generating strategy that was originally available as an > all-or-nothing strategy choice. > > Instead of reintroducing a binary strategy choice I propose a threshold > parameter, controlled by > `-Djava.lang.invoke.StringConcat.highArityThreshold=`: For expressions > below or at this threshold there's no change, for expressions with an arity > above it we use the `StringBuilder`-chain bytecode generator. > > There are a few trade-offs at play here which influence the choice of > threshold. The simple high arity strategy will for example not see any reuse > of LambdaForms but strictly always generate a class per indy callsite, which > means we might end up with a higher total number of classes generated and > loaded in applications if we set this value too low. It may also produce > worse performance on average. On the other hand there is the observed > increase in C2 resource usage as expressions grow unwieldy. On the other > other hand high arity expressions are likely rare to begin with, with less > opportunities for sharing than the more common low-arity expressions. > > I turned the submitted test case into a few JMH benchmarks and did some > experiments with `-XX:CompileCommand=MemStat,StringConcat::concat,print`: > > Baseline strategy: > 13 args: 6.3M > 23 args: 18M > 123 args: 868M > > `-Djava.lang.invoke.StringConcat.highArityThreshold=0`: > 13 args: 2.11M > 23 args: 3.67M > 123 args: 4.75M > > For 123 args the memory overhead of the baseline strategy is 180x, but for 23 > args we're down to a 5x memory overhead, and down to a 3x overhead for 13 > args. Since the absolute overhead for 23 is borderline acceptable (+15Mb) > I've conservatively chosen a threshold at arity 20. This keeps C2 resource > pressure at a reasonable level (< 18M) while avoiding perturbing performance > at the vast majority of call sites. > > I was asked to use the new class file API for mainline. There's a version of > this patch implemented using ASM in 7c52a9f which might be a reasonable basis > for a backport. src/java.base/share/classes/java/lang/invoke/StringConcatFactory.java line 1430: > 1428: cb.new_(STRING_BUILDER); > 1429: cb.dup(); > 1430: cb.invokespecial(STRING_BUILDER, "", > MethodTypeDesc.ofDescriptor("()V")); Would there be value in initializing to a larger capacity? Given the number of append calls, seems the default cap of 16 is unlikely to be sufficient. - PR Review Comment: https://git.openjdk.org/jdk/pull/18690#discussion_r1561966067
Integrated: 8321470: ThreadLocal.nextHashCode can be static final
On Wed, 6 Dec 2023 00:52:48 GMT, Brett Okken wrote: > The static AtomicInteger used for the nextHashCode should be final. This pull request has now been integrated. Changeset: c42535f1 Author: Brett Okken Committer: Aleksey Shipilev URL: https://git.openjdk.org/jdk/commit/c42535f1110d60d1472080ad4fcadb8acbeb4c4b Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod 8321470: ThreadLocal.nextHashCode can be static final Reviewed-by: shade, jpai - PR: https://git.openjdk.org/jdk/pull/16987
Re: RFR: 8321470: ThreadLocal.nextHashCode can be static final [v2]
> The static AtomicInteger used for the nextHashCode should be final. Brett Okken has updated the pull request incrementally with one additional commit since the last revision: Update full name - Changes: - all: https://git.openjdk.org/jdk/pull/16987/files - new: https://git.openjdk.org/jdk/pull/16987/files/9e276138..d05272a3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16987&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16987&range=00-01 Stats: 0 lines in 0 files changed: 0 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/16987.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16987/head:pull/16987 PR: https://git.openjdk.org/jdk/pull/16987
RFR: 8321470: ThreadLocal.nextHashCode can be static final
The static AtomicInteger used for the nextHashCode should be final. - Commit messages: - Merge remote-tracking branch 'upstream/master' into threadlocal_final - make ThreadLocal.nextHashCode final Changes: https://git.openjdk.org/jdk/pull/16987/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=16987&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8321470 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/16987.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16987/head:pull/16987 PR: https://git.openjdk.org/jdk/pull/16987
ThreadLocal nextHashCode
Should the private static AtomicInteger nextHashCode also be final? https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/ThreadLocal.java#L100
Re: RFR: 8321059: Unneeded array assignments in MergeCollation and CompactByteArray
On Thu, 30 Nov 2023 20:46:12 GMT, Naoto Sato wrote: > Removing the unnecessary array assignments. src/java.base/share/classes/sun/text/CompactByteArray.java line 83: > 81: for (i = 0; i < UNICODECOUNT; ++i) { > 82: values[i] = defaultValue; > 83: } should this be `Arrays.fill(values, defaultValue);` ? https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/util/Arrays.html#fill(byte%5B%5D,byte) - PR Review Comment: https://git.openjdk.org/jdk/pull/16912#discussion_r1411279673
Re: RFR: 8309622: Re-examine the cache mechanism in BaseLocale [v3]
On Fri, 16 Jun 2023 17:18:38 GMT, Naoto Sato wrote: >> src/java.base/share/classes/sun/util/locale/BaseLocale.java line 175: >> >>> 173: >>> LocaleUtils.toLowerString(b.getLanguage()).intern(), >>> 174: >>> LocaleUtils.toTitleString(b.getScript()).intern(), >>> 175: >>> LocaleUtils.toUpperString(b.getRegion()).intern(), >> >> don't lines 147 and 148 above already guarantee the language and region have >> correct case? > > 147,148 are in `getInstance()` and the BaseLocale key here is created with a > constructor, which simply copies the passed arguments. So normalization is > needed here perhaps I am missing something, but aren't `language` and `region` being reassigned to the case adjusted values? So the values used to create the "key" BaseLocale instance reflect that case adjustment? (language is again potentially adjusted at line 162 to account for old iso codes, but that never results in the wrong case.) language = LocaleUtils.toLowerString(language); region = LocaleUtils.toUpperString(region); - PR Review Comment: https://git.openjdk.org/jdk/pull/14404#discussion_r1232880865
Re: RFR: 8309622: Re-examine the cache mechanism in BaseLocale [v3]
On Fri, 16 Jun 2023 17:18:33 GMT, Naoto Sato wrote: >> src/java.base/share/classes/sun/util/locale/BaseLocale.java line 168: >> >>> 166: // BaseLocale as the key. The returned "normalized" instance >>> 167: // can subsequently be used by the Locale instance which >>> 168: // guarantees the locale components are properly >>> cased/interned. >> >> Is it truly important that the String values are interned, or is the desire >> simply that all refer to the same "canonical" String instance? >> my thought is that managing the canonical String instances could avoid >> WeakReferences and synchronized map access and just return a new BaseLocale >> each time. >> >> >> >> >> public static BaseLocale getInstance(String language, String script, >> String region, String variant) { >> if (script == null || script.isEmpty()) { >> script = ""; >> } else { >> script = getCanonicalString(script, LocaleUtils::toTitleString); >> } >> >> if (region == null || region.isEmpty()) { >> region = ""; >> } else { >> region = getCanonicalString(region, LocaleUtils::toUpperString); >> } >> >> if (language == null || language.isEmpty()) { >> language = ""; >> } else { >> language = getCanonicalString(language, >> LocaleUtils::toLowerString); >> } >> >> if (variant == null || variant.isEmpty()) { >> variant = ""; >> } else { >> variant = getCanonicalString(variant, Function.identity()); >> } >> ... >> return new BaseLocale(language, script, region, variant); >> } >> >> >> >> private static final ConcurrentMap CANONICALIZED_STRINGS >> = new ConcurrentHashMap<>(); >> >> static { >> // prime the old iso codes >> CANONICALIZED_STRINGS.put("he", "he"); >> CANONICALIZED_STRINGS.put("iw", "iw"); >> CANONICALIZED_STRINGS.put("id", "id"); >> CANONICALIZED_STRINGS.put("in", "in"); >> CANONICALIZED_STRINGS.put("ji", "ji"); >> CANONICALIZED_STRINGS.put("yi", "yi"); >> } >> >> private static String getCanonicalString(String value, Function> String> conversion) { >> String existing = CANONICALIZED_STRINGS.get(value); >> if (existing != null) { >> return existing; >> } >> >> String converted = conversion.apply(value); >> return CANONICALIZED_STRINGS.merge(converted, converted, (k,v) -> >> v);... > > Interning components in a Locale instance (through BaseLocale) is a long > standing behavior. Changing it could break apps that might have relied on it > (using `==` for comparison) as in doing `"en" == locale.getLanguage()`? In that case, the interned values could still be used as the canonical values, and the CHM can account both for case adjustments and interning (and likely faster than actually calling intern). I am curious how this does with your jmh test: private static final ConcurrentMap CANONICALIZED_UPPER_STRINGS = new ConcurrentHashMap<>(); private static final ConcurrentMap CANONICALIZED_LOWER_STRINGS = new ConcurrentHashMap<>(); private static final ConcurrentMap CANONICALIZED_TITLE_STRINGS = new ConcurrentHashMap<>(); private static final ConcurrentMap CANONICALIZED_STRINGS = new ConcurrentHashMap<>(); public static BaseLocale getInstance(String language, String script, String region, String variant) { if (script == null || script.isEmpty()) { script = ""; } else { script = getCanonicalString(script, LocaleUtils::toTitleString, CANONICALIZED_TITLE_STRINGS); } if (region == null || region.isEmpty()) { region = ""; } else { region = getCanonicalString(region, LocaleUtils::toUpperString, CANONICALIZED_UPPER_STRINGS); } if (language == null || language.isEmpty()) { language = ""; } else { language = getCanonicalString(language, LocaleUtils::toLowerString, CANONICALIZED_LOWER_STRINGS); } if (variant == null || variant.isEmpty()) { variant = ""; } else { variant = getCanonicalString(variant, Function.identity(), CANONICALIZED_STRINGS); } ... return new BaseLocale(language, script, region, variant); } private static String getCanonicalString(String value, Function conversion, ConcurrentMap map) { String existing = map.get(value); if (existing != null) { return existing; } String converted = conversion.apply(value).intern(); // put the interned value in, keyed by the interned value map.putIfAbsent(converted, converted); // also put the interned value in for
Re: RFR: 8309622: Re-examine the cache mechanism in BaseLocale [v3]
On Tue, 13 Jun 2023 17:56:57 GMT, Naoto Sato wrote: >> This is stemming from the PR: https://github.com/openjdk/jdk/pull/14211 >> where aggressive GC can cause NPE in `BaseLocale$Key` class. I refactored >> the in-house cache with WeakHashMap, and removed the Key class as it is no >> longer needed (thus the original NPE will no longer be possible). Also with >> the new JMH test case, it gains some performance improvement: >> >> (w/o fix) >> >> Benchmark Mode Cnt Score Error Units >> LocaleCache.testForLanguageTag avgt 20 5781.275 ± 569.580 ns/op >> LocaleCache.testLocaleOfavgt 20 62564.079 ± 406.697 ns/op >> >> (w/ fix) >> Benchmark Mode Cnt Score Error Units >> LocaleCache.testForLanguageTag avgt 20 4801.175 ± 371.830 ns/op >> LocaleCache.testLocaleOfavgt 20 60394.652 ± 352.471 ns/op > > Naoto Sato has updated the pull request incrementally with one additional > commit since the last revision: > > Addressing review comments src/java.base/share/classes/sun/util/locale/BaseLocale.java line 168: > 166: // BaseLocale as the key. The returned "normalized" instance > 167: // can subsequently be used by the Locale instance which > 168: // guarantees the locale components are properly cased/interned. Is it truly important that the String values are interned, or is the desire simply that all refer to the same "canonical" String instance? my thought is that managing the canonical String instances could avoid WeakReferences and synchronized map access and just return a new BaseLocale each time. public static BaseLocale getInstance(String language, String script, String region, String variant) { if (script == null) { script = ""; } else { script = getCanonicalString(script, LocaleUtils::toTitleString); } if (region == null) { region = ""; } else { region = getCanonicalString(region, LocaleUtils::toUpperString); } if (language == null) { language = ""; } else { language = getCanonicalString(language, LocaleUtils::toLowerString); } if (variant == null) { variant = ""; } else { variant = getCanonicalString(variant, Function.identity()); } ... return new BaseLocale(language, script, region, variant); } private static final ConcurrentMap CANONICALIZED_STRINGS = new ConcurrentHashMap<>(); static { // prime the old iso codes CANONICALIZED_STRINGS.put("he", "he"); CANONICALIZED_STRINGS.put("iw", "iw"); CANONICALIZED_STRINGS.put("id", "id"); CANONICALIZED_STRINGS.put("in", "in"); CANONICALIZED_STRINGS.put("ji", "ji"); CANONICALIZED_STRINGS.put("yi", "yi"); } private static String getCanonicalString(String value, Function conversion) { String existing = CANONICALIZED_STRINGS.get(value); if (existing != null) { return existing; } String converted = conversion.apply(value); return CANONICALIZED_STRINGS.merge(converted, converted, (k,v) -> v); } - PR Review Comment: https://git.openjdk.org/jdk/pull/14404#discussion_r1232205527
Re: RFR: 8309622: Re-examine the cache mechanism in BaseLocale [v3]
On Tue, 13 Jun 2023 17:56:57 GMT, Naoto Sato wrote: >> This is stemming from the PR: https://github.com/openjdk/jdk/pull/14211 >> where aggressive GC can cause NPE in `BaseLocale$Key` class. I refactored >> the in-house cache with WeakHashMap, and removed the Key class as it is no >> longer needed (thus the original NPE will no longer be possible). Also with >> the new JMH test case, it gains some performance improvement: >> >> (w/o fix) >> >> Benchmark Mode Cnt Score Error Units >> LocaleCache.testForLanguageTag avgt 20 5781.275 ± 569.580 ns/op >> LocaleCache.testLocaleOfavgt 20 62564.079 ± 406.697 ns/op >> >> (w/ fix) >> Benchmark Mode Cnt Score Error Units >> LocaleCache.testForLanguageTag avgt 20 4801.175 ± 371.830 ns/op >> LocaleCache.testLocaleOfavgt 20 60394.652 ± 352.471 ns/op > > Naoto Sato has updated the pull request incrementally with one additional > commit since the last revision: > > Addressing review comments src/java.base/share/classes/sun/util/locale/BaseLocale.java line 175: > 173: > LocaleUtils.toLowerString(b.getLanguage()).intern(), > 174: > LocaleUtils.toTitleString(b.getScript()).intern(), > 175: > LocaleUtils.toUpperString(b.getRegion()).intern(), don't lines 147 and 148 above already guarantee the language and region have correct case? - PR Review Comment: https://git.openjdk.org/jdk/pull/14404#discussion_r1232167067
Re: RFR: 8309665: Simplify Arrays.copyOf/-Range methods
On Thu, 8 Jun 2023 15:08:28 GMT, Claes Redestad wrote: > https://bugs.openjdk.org/browse/JDK-8301958 and later changes improved > Arrays.copyOf/-Range methods to improve peak performance in microbenchmarks > when copying the entire array, but it's resulted in a few lurking footprint > benchmark issues that come down to incurring slightly more JIT activity. As a > partial remedy I propose simplifying the implementation so that no new > methods are added compared to the JDK 20 baseline. This patch maintains the > microbenchmark speed-up from JDK-8301958 while reducing JIT compilations in > startup/footprint benchmarks. src/java.base/share/classes/java/util/Arrays.java line 3845: > 3843: */ > 3844: public static byte[] copyOfRange(byte[] original, int from, int to) > { > 3845: // Tickle the JIT to fold special cases optimally this comment has been inconsistently left vs removed in the various methods (even though the other changes appear the same). was this intentional? - PR Review Comment: https://git.openjdk.org/jdk/pull/14380#discussion_r1223276847
Re: RFR: 8308804: Improve UUID.randomUUID performance with bulk/scalable PRNG access [v7]
On Thu, 1 Jun 2023 09:37:29 GMT, Aleksey Shipilev wrote: >> UUID is the very important class that is used to track identities of objects >> in large scale systems. On some of our systems, `UUID.randomUUID` takes >1% >> of total CPU time, and is frequently a scalability bottleneck due to >> `SecureRandom` synchronization. >> >> The major issue with UUID code itself is that it reads from the single >> `SecureRandom` instance by 16 bytes. So the heavily contended `SecureRandom` >> is bashed with very small requests. This also has a chilling effect on other >> users of `SecureRandom`, when there is a heavy UUID generation traffic. >> >> We can improve this by doing the bulk reads from the backing SecureRandom >> and possibly striping the reads across many instances of it. >> >> >> Benchmark Mode Cnt Score Error Units >> >> ### AArch64 (m6g.4xlarge, Graviton, 16 cores) >> >> # Before >> UUIDRandomBench.single thrpt 15 3.545 ± 0.058 ops/us >> UUIDRandomBench.max thrpt 15 1.832 ± 0.059 ops/us ; negative scaling >> >> # After >> UUIDRandomBench.single thrpt 15 4.823 ± 0.023 ops/us >> UUIDRandomBench.max thrpt 15 6.561 ± 0.054 ops/us ; positive >> scaling, ~1.5x >> >> ### x86_64 (c6.8xlarge, Xeon, 18 cores) >> >> # Before >> UUIDRandomBench.single thrpt 15 2.710 ± 0.038 ops/us >> UUIDRandomBench.max thrpt 15 1.880 ± 0.029 ops/us ; negative >> scaling >> >> # After >> BenchmarkMode Cnt Score Error Units >> UUIDRandomBench.single thrpt 15 3.109 ± 0.026 ops/us >> UUIDRandomBench.max thrpt 15 3.561 ± 0.071 ops/us ; positive >> scaling, ~1.2x >> >> >> Note that there is still a scalability bottleneck in current default random >> (`NativePRNG`), because it synchronizes over a singleton instance for SHA1 >> mixer, then the engine itself, etc. -- it is quite a whack-a-mole to figure >> out the synchronization story there. The scalability fix in current default >> `SecureRandom` would be much more intrusive and risky, since it would change >> a core crypto class with unknown bug fanout. >> >> Using the bulk reads even when the underlying PRNG is heavily synchronized >> is still a win. A more scalable PRNG would benefit from this as well. This >> PR adds a system property to select the PRNG implementation, and there we >> can clearly see the benefit with more scalable PRNG sources: >> >> >> Benchmark Mode Cnt Score Error Units >> >> ### x86_64 (c6.8xlarge, Xeon, 18 cores) >> >> # Before, hacked `new SecureRandom()` to >> `SecureRandom.getInstance("SHA1PRNG")` >> UUIDRandomBench.single thrpt ... > > Aleksey Shipilev has updated the pull request with a new target base due to a > merge or a rebase. The pull request now contains 13 commits: > > - Revert test changes > - Merge branch 'master' into JDK-8308804-uuid-buffers > - Simplify clinit > - Remove the properties > - Swap lsb/msb > - Fine-tune exceptions > - Handle privileged properties > - Use ByteArray to convert. Do version/variant preparations straight on > locals. Move init out of optimistic lock section. > - More touchups > - Comment updates > - ... and 3 more: https://git.openjdk.org/jdk/compare/4460429d...fd7eaa1a src/java.base/share/classes/java/util/UUID.java line 234: > 232: long msb = ByteArray.getLong(buf, 0); > 233: long lsb = ByteArray.getLong(buf, 8); > 234: return fromRandom(msb, lsb); is there any value in moving this outside of the critical section? - PR Review Comment: https://git.openjdk.org/jdk/pull/14135#discussion_r1213117598
Re: RFR: 8308804: Improve UUID.randomUUID performance with bulk/scalable PRNG access
On Wed, 24 May 2023 19:36:44 GMT, Aleksey Shipilev wrote: > UUID is the very important class that is used to track identities of objects > in large scale systems. On some of our systems, `UUID.randomUUID` takes >1% > of total CPU time, and is frequently a scalability bottleneck due to > `SecureRandom` synchronization. > > The major issue with UUID code itself is that it reads from the single > `SecureRandom` instance by 16 bytes. So the heavily contended `SecureRandom` > is bashed with very small requests. This also has a chilling effect on other > users of `SecureRandom`, when there is a heavy UUID generation traffic. > > We can improve this by doing the bulk reads from the backing SecureRandom and > possibly striping the reads across many instances of it. > > > Benchmark Mode Cnt Score Error Units > > ### AArch64 (m6g.4xlarge, Graviton, 16 cores) > > # Before > UUIDRandomBench.single thrpt 15 3.545 ± 0.058 ops/us > UUIDRandomBench.max thrpt 15 1.832 ± 0.059 ops/us ; negative scaling > > # After > UUIDRandomBench.single thrpt 15 4.421 ± 0.047 ops/us > UUIDRandomBench.max thrpt 15 6.658 ± 0.092 ops/us ; positive scaling, > ~1.5x > > ### x86_64 (c6.8xlarge, Xeon, 18 cores) > > # Before > UUIDRandomBench.single thrpt 15 2.710 ± 0.038 ops/us > UUIDRandomBench.max thrpt 15 1.880 ± 0.029 ops/us ; negative scaling > > # After > BenchmarkMode Cnt Score Error Units > UUIDRandomBench.single thrpt 15 3.099 ± 0.022 ops/us > UUIDRandomBench.max thrpt 15 3.555 ± 0.062 ops/us ; positive > scaling, ~1.2x > > > Note that there is still a scalability bottleneck in current default random > (`NativePRNG`), because it synchronizes over a singleton instance for SHA1 > mixer, then the engine itself, etc. -- it is quite a whack-a-mole to figure > out the synchronization story there. The scalability fix in current default > `SecureRandom` would be much more intrusive and risky, since it would change > a core crypto class with unknown bug fanout. > > Using the bulk reads even when the underlying PRNG is heavily synchronized is > still a win. A more scalable PRNG would benefit from this as well. This PR > adds a system property to select the PRNG implementation, and there we can > clearly see the benefit with more scalable PRNG sources: > > > Benchmark Mode Cnt Score Error Units > > ### x86_64 (c6.8xlarge, Xeon, 18 cores) > > # Before, hacked `new SecureRandom()` to > `SecureRandom.getInstance("SHA1PRNG")` > UUIDRandomBench.single thrpt 15 3.661 ± 0.008 ops/us > UUIDRandomBench... src/java.base/share/classes/java/util/UUID.java line 255: > 253: // initializations, and thus false sharing between > reader threads. > 254: random.nextBytes(buf); > 255: for (int c = 0; c < BUF_SIZE; c += UUID_CHUNK) { I think this could be faster by using a ByteBuffer (or VarHandle) to process as longs. https://mail.openjdk.org/pipermail/core-libs-dev/2023-March/101249.html - PR Review Comment: https://git.openjdk.org/jdk/pull/14135#discussion_r1206118465
Re: UUID creation performance
The new ByteArray class works great for the nameUUIDFromBytes method, which must be in big endian. For randomUUID, byte order does not matter, so using native would be fastest, but there does not appear to be a utility class for that. Is there a preference of just having a native order VarHandle to use in UUID vs. having a utility method which chooses which utility class to call based on the native order vs. some other option? Thanks, Brett On Wed, Mar 1, 2023 at 9:08 AM Roger Riggs wrote: > > Hi, > > That's an interesting idea. Recently VarHandle access methods were > created by JDK-8300236 [1] [2] > in the jdk.internal.util package. See the ByteArray and > ByteArrayLittleEndian classes. > > See how that would affect performance and leverage existing VarHandles. > > Thanks, Roger > > [1] https://bugs.openjdk.org/browse/JDK-8300236 > [2] https://github.com/openjdk/jdk/pull/12076 > > On 3/1/23 7:50 AM, Brett Okken wrote: > > Is there any interest in updating the static UUID.randomUUID() and > > UUID.nameUUIDFromBytes(byte[]) factory methods to use either a > > ByteBuffer or byteArrayViewVarHandle to convert the byte[] to 2 long > > values then do the bit twiddling? > > These methods are really dominated by time to create/populate the > > byte[], but this does reduce the time to create the 2 long values by > > at least half. > > It would also allow the removal of the private UUID(byte[] data). > > > > public static UUID randomUUID() { > > SecureRandom ng = Holder.numberGenerator; > > > > byte[] randomBytes = new byte[16]; > > ng.nextBytes(randomBytes); > > final ByteBuffer bb = ByteBuffer.wrap(randomBytes); > > bb.order(ByteOrder.nativeOrder()); > > > > long msb = bb.getLong(); > > long lsb = bb.getLong(); > > > > msb &= 0x0FFFL; /* clear version*/ > > msb |= 0x4000L; /* set to version 4 */ > > > > lsb &= 0x3FFFL; /* clear variant*/ > > lsb |= 0x8000L; /* set to IETF variant */ > > > > return new UUID(msb, lsb); > > } > > > > public static UUID nameUUIDFromBytes(byte[] name) { > > MessageDigest md; > > try { > > md = MessageDigest.getInstance("MD5"); > > } catch (NoSuchAlgorithmException nsae) { > > throw new InternalError("MD5 not supported", nsae); > > } > > byte[] md5Bytes = md.digest(name); > > > > // default byte order is BIG_ENDIAN > > final ByteBuffer bb = ByteBuffer.wrap(md5Bytes); > > > > long msb = bb.getLong(); > > long lsb = bb.getLong(); > > > > msb &= 0x0FFFL; /* clear version*/ > > msb |= 0x3000L; /* set to version 3 */ > > > > lsb &= 0x3FFFL; /* clear variant*/ > > lsb |= 0x8000L; /* set to IETF variant */ > > > > return new UUID(msb, lsb); > > } > > > > BenchmarkMode Cnt Score Error Units > > UUIDBenchmark.jdk_name avgt3 11.885 ± 4.025 ns/op > > UUIDBenchmark.jdk_random avgt3 11.656 ± 0.987 ns/op > > UUIDBenchmark.longs avgt3 7.618 ± 1.047 ns/op > > UUIDBenchmark.longs_bb avgt3 7.755 ± 1.643 ns/op > > UUIDBenchmark.longs_name avgt3 8.467 ± 1.784 ns/op > > UUIDBenchmark.longs_name_bb avgt3 8.455 ± 1.662 ns/op > > UUIDBenchmark.randomBytesavgt3 6.132 ± 0.447 ns/op > > > > > > @BenchmarkMode(Mode.AverageTime) > > @OutputTimeUnit(TimeUnit.NANOSECONDS) > > @Warmup(iterations = 3, time = 2, timeUnit = TimeUnit.SECONDS) > > @Measurement(iterations = 3, time = 2, timeUnit = TimeUnit.SECONDS) > > @Fork(1) > > @State(Scope.Benchmark) > > public class UUIDBenchmark { > > > > private static final VarHandle LONGS_ACCESS = > > MethodHandles.byteArrayViewVarHandle(long[].class, > > ByteOrder.nativeOrder()); > > > > private static final VarHandle BE_LONGS_ACCESS = > > MethodHandles.byteArrayViewVarHandle(long[].class, > > ByteOrder.BIG_ENDIAN); > > > > @Benchmark > > public byte[] randomBytes() { > > final byte[] bytes = new byte[16]; > > randomBytes(bytes); > > return bytes; > > } > > > > @Benchmark > &
UUID creation performance
Is there any interest in updating the static UUID.randomUUID() and UUID.nameUUIDFromBytes(byte[]) factory methods to use either a ByteBuffer or byteArrayViewVarHandle to convert the byte[] to 2 long values then do the bit twiddling? These methods are really dominated by time to create/populate the byte[], but this does reduce the time to create the 2 long values by at least half. It would also allow the removal of the private UUID(byte[] data). public static UUID randomUUID() { SecureRandom ng = Holder.numberGenerator; byte[] randomBytes = new byte[16]; ng.nextBytes(randomBytes); final ByteBuffer bb = ByteBuffer.wrap(randomBytes); bb.order(ByteOrder.nativeOrder()); long msb = bb.getLong(); long lsb = bb.getLong(); msb &= 0x0FFFL; /* clear version*/ msb |= 0x4000L; /* set to version 4 */ lsb &= 0x3FFFL; /* clear variant*/ lsb |= 0x8000L; /* set to IETF variant */ return new UUID(msb, lsb); } public static UUID nameUUIDFromBytes(byte[] name) { MessageDigest md; try { md = MessageDigest.getInstance("MD5"); } catch (NoSuchAlgorithmException nsae) { throw new InternalError("MD5 not supported", nsae); } byte[] md5Bytes = md.digest(name); // default byte order is BIG_ENDIAN final ByteBuffer bb = ByteBuffer.wrap(md5Bytes); long msb = bb.getLong(); long lsb = bb.getLong(); msb &= 0x0FFFL; /* clear version*/ msb |= 0x3000L; /* set to version 3 */ lsb &= 0x3FFFL; /* clear variant*/ lsb |= 0x8000L; /* set to IETF variant */ return new UUID(msb, lsb); } BenchmarkMode Cnt Score Error Units UUIDBenchmark.jdk_name avgt3 11.885 ± 4.025 ns/op UUIDBenchmark.jdk_random avgt3 11.656 ± 0.987 ns/op UUIDBenchmark.longs avgt3 7.618 ± 1.047 ns/op UUIDBenchmark.longs_bb avgt3 7.755 ± 1.643 ns/op UUIDBenchmark.longs_name avgt3 8.467 ± 1.784 ns/op UUIDBenchmark.longs_name_bb avgt3 8.455 ± 1.662 ns/op UUIDBenchmark.randomBytesavgt3 6.132 ± 0.447 ns/op @BenchmarkMode(Mode.AverageTime) @OutputTimeUnit(TimeUnit.NANOSECONDS) @Warmup(iterations = 3, time = 2, timeUnit = TimeUnit.SECONDS) @Measurement(iterations = 3, time = 2, timeUnit = TimeUnit.SECONDS) @Fork(1) @State(Scope.Benchmark) public class UUIDBenchmark { private static final VarHandle LONGS_ACCESS = MethodHandles.byteArrayViewVarHandle(long[].class, ByteOrder.nativeOrder()); private static final VarHandle BE_LONGS_ACCESS = MethodHandles.byteArrayViewVarHandle(long[].class, ByteOrder.BIG_ENDIAN); @Benchmark public byte[] randomBytes() { final byte[] bytes = new byte[16]; randomBytes(bytes); return bytes; } @Benchmark public void jdk_random(Blackhole bh) { final byte[] data = new byte[16]; randomBytes(data); data[6] &= 0x0f; /* clear version*/ data[6] |= 0x40; /* set to version 4 */ data[8] &= 0x3f; /* clear variant*/ data[8] |= 0x80; /* set to IETF variant */ long msb = 0; long lsb = 0; assert data.length == 16 : "data must be 16 bytes in length"; for (int i=0; i<8; i++) msb = (msb << 8) | (data[i] & 0xff); for (int i=8; i<16; i++) lsb = (lsb << 8) | (data[i] & 0xff); bh.consume(msb); bh.consume(lsb); } @Benchmark public void jdk_name(Blackhole bh) { final byte[] md5Bytes = new byte[16]; randomBytes(md5Bytes); md5Bytes[6] &= 0x0f; /* clear version*/ md5Bytes[6] |= 0x30; /* set to version 3 */ md5Bytes[8] &= 0x3f; /* clear variant*/ md5Bytes[8] |= 0x80; /* set to IETF variant */ long msb = 0; long lsb = 0; assert md5Bytes.length == 16 : "data must be 16 bytes in length"; for (int i=0; i<8; i++) msb = (msb << 8) | (md5Bytes[i] & 0xff); for (int i=8; i<16; i++) lsb = (lsb << 8) | (md5Bytes[i] & 0xff); bh.consume(msb); bh.consume(lsb); } @Benchmark public void longs(Blackhole bh) { final byte[] data = new byte[16]; randomBytes(data); long msb = (long) LONGS_ACCESS.get(data, 0); long lsb = (long) LONGS_ACCESS.get(data, 8); msb &= 0x0FFFL; msb |= 0x4000L; lsb &= 0x3FFFL; lsb |= 0x8000L; bh.consume(msb); bh.consume(lsb); } @Benchmark public void longs_name(Blackhole bh) { final byte[] data = new byte[16]; randomBytes(data); long msb =
Re: RFR: 8302863: Speed up String::encodeASCII using countPositives
On Sat, 18 Feb 2023 23:26:08 GMT, Claes Redestad wrote: > When encoding Strings to US-ASCII we can speed up the happy path > significantly by using `StringCoding.countPositives` as a speculative check > for whether there are any chars that needs to be replaced by `'?'`. Once a > non-ASCII char is encountered we fall back to the slow loop and replace as > needed. > > An alternative could be unrolling or using a byte array VarHandle, as > show-cased by Brett Okken here: > https://mail.openjdk.org/pipermail/core-libs-dev/2023-February/100573.html > Having to replace chars with `?` is essentially an encoding error so it might > be safe to assume this case is exceptional in practice. For the happy path of no encoding failures, this "trivial" change is a great improvement. - Marked as reviewed by bok...@github.com (no known OpenJDK username). PR: https://git.openjdk.org/jdk/pull/12640
Re: String encodeASCII
Thanks for taking a look at this. That is a pretty good improvement with a much smaller change. I recognize the intricacies of bootstrapping, but have no expertise. Would using Unsafe directly be easier? Particularly for shorter strings, doing the copying and checking together rather than as separate loops seems to speed things up considerably, even for happy-path of no failures. Brett On Sat, Feb 18, 2023 at 5:49 PM Claes Redestad wrote: > Hi, > > general comment: String might be one of the trickier places to add a > VarHandle dependency, since String is used very early in the bootstrap and > depended upon by everything else, so I’d expect such a solution would have > to navigate potential circularity issues carefully. It’d be good to > experiment with changes to java.lang.String proper to see that the solution > that works nice externally is or can be made feasible within String. > > Specifically on the performance opportunity then while US-ASCII encoding > is probably on the way out we shouldn’t neglect it. > > One way to go about this without pulling VarHandles into String might be > to use what other encode methods in String does and leverage > StringCoding.countPositives: > > https://github.com/openjdk/jdk/pull/12640 > > Testing this on the existing StringEncode microbenchmark, shows a > promising speed-up when the input is ASCII-encodable: > > Baseline > Benchmark (charsetName) Mode Cnt Score > Error Units > StringEncode.encodeAsciiLongUS-ASCII avgt5 26626,025 ± > 448,307 ns/op > StringEncode.encodeAsciiShort US-ASCII avgt5 33,336 ± > 2,032 ns/op > > Patch: > Benchmark (charsetName) Mode Cnt ScoreError > Units > StringEncode.encodeAsciiLongUS-ASCII avgt5 5492,985 ± 40,066 > ns/op > StringEncode.encodeAsciiShort US-ASCII avgt528,545 ± 4,883 > ns/op > > (You might see that this will go back into a scalar loop on encoding > failures. That loop could still benefit from unrolling or > byteArrayViewVarHandle, but I think you have a bigger problem in an app > than raw performance if you have a lot of encoding failures...) > > WDYT? > > /Claes > > 18 feb. 2023 kl. 19:36 skrev Brett Okken : > > > https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/String.java#L976-L981 > > For String.encodeASCII, with the LATIN1 coder is there any interest in > exploring the performance impacts of utilizing a > byteArrayViewVarHandle to read/write as longs and utilize a bitmask to > identify if negative values are present? > > A simple jmh benchmark covering either 0 or 1 non-ascii (negative) > values shows times cut in half (or more) for most scenarios with > strings ranging in length from 3 - ~2000. > VM version: JDK 17.0.6, OpenJDK 64-Bit Server VM, 17.0.6+10 > Windows 10 Intel(R) Core(TM) i7-9850H > > Hand unrolling the loops shows noted improvement, but does make for > less aesthetically pleasing code. > > > Benchmark (nonascii) (size) Mode > CntScore Error Units > AsciiEncodeBenchmark.jdk none 3 avgt > 4 15.531 ± 1.122 ns/op > AsciiEncodeBenchmark.jdk none 10 avgt > 4 17.350 ± 0.473 ns/op > AsciiEncodeBenchmark.jdk none 16 avgt > 4 18.277 ± 0.421 ns/op > AsciiEncodeBenchmark.jdk none 23 avgt > 4 20.139 ± 0.350 ns/op > AsciiEncodeBenchmark.jdk none 33 avgt > 4 22.008 ± 0.656 ns/op > AsciiEncodeBenchmark.jdk none 42 avgt > 4 24.393 ± 1.155 ns/op > AsciiEncodeBenchmark.jdk none 201 avgt > 4 55.884 ± 0.645 ns/op > AsciiEncodeBenchmark.jdk none 511 avgt > 4 120.817 ± 2.917 ns/op > AsciiEncodeBenchmark.jdk none2087 avgt > 4 471.039 ± 13.329 ns/op > AsciiEncodeBenchmark.jdkfirst 3 avgt > 4 15.794 ± 1.494 ns/op > AsciiEncodeBenchmark.jdkfirst 10 avgt > 4 18.446 ± 0.780 ns/op > AsciiEncodeBenchmark.jdkfirst 16 avgt > 4 20.458 ± 0.394 ns/op > AsciiEncodeBenchmark.jdkfirst 23 avgt > 4 22.934 ± 0.422 ns/op > AsciiEncodeBenchmark.jdkfirst 33 avgt > 4 25.367 ± 0.178 ns/op > AsciiEncodeBenchmark.jdkfirst 42 avgt > 4 28.620 ± 0.678 ns/op > AsciiEncodeBenchmark.jdkfirst 201 avgt > 4 80.250 ± 4.376
String encodeASCII
https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/String.java#L976-L981 For String.encodeASCII, with the LATIN1 coder is there any interest in exploring the performance impacts of utilizing a byteArrayViewVarHandle to read/write as longs and utilize a bitmask to identify if negative values are present? A simple jmh benchmark covering either 0 or 1 non-ascii (negative) values shows times cut in half (or more) for most scenarios with strings ranging in length from 3 - ~2000. VM version: JDK 17.0.6, OpenJDK 64-Bit Server VM, 17.0.6+10 Windows 10 Intel(R) Core(TM) i7-9850H Hand unrolling the loops shows noted improvement, but does make for less aesthetically pleasing code. Benchmark (nonascii) (size) Mode CntScore Error Units AsciiEncodeBenchmark.jdk none 3 avgt 4 15.531 ± 1.122 ns/op AsciiEncodeBenchmark.jdk none 10 avgt 4 17.350 ± 0.473 ns/op AsciiEncodeBenchmark.jdk none 16 avgt 4 18.277 ± 0.421 ns/op AsciiEncodeBenchmark.jdk none 23 avgt 4 20.139 ± 0.350 ns/op AsciiEncodeBenchmark.jdk none 33 avgt 4 22.008 ± 0.656 ns/op AsciiEncodeBenchmark.jdk none 42 avgt 4 24.393 ± 1.155 ns/op AsciiEncodeBenchmark.jdk none 201 avgt 4 55.884 ± 0.645 ns/op AsciiEncodeBenchmark.jdk none 511 avgt 4 120.817 ± 2.917 ns/op AsciiEncodeBenchmark.jdk none2087 avgt 4 471.039 ± 13.329 ns/op AsciiEncodeBenchmark.jdkfirst 3 avgt 4 15.794 ± 1.494 ns/op AsciiEncodeBenchmark.jdkfirst 10 avgt 4 18.446 ± 0.780 ns/op AsciiEncodeBenchmark.jdkfirst 16 avgt 4 20.458 ± 0.394 ns/op AsciiEncodeBenchmark.jdkfirst 23 avgt 4 22.934 ± 0.422 ns/op AsciiEncodeBenchmark.jdkfirst 33 avgt 4 25.367 ± 0.178 ns/op AsciiEncodeBenchmark.jdkfirst 42 avgt 4 28.620 ± 0.678 ns/op AsciiEncodeBenchmark.jdkfirst 201 avgt 4 80.250 ± 4.376 ns/op AsciiEncodeBenchmark.jdkfirst 511 avgt 4 185.518 ± 6.370 ns/op AsciiEncodeBenchmark.jdkfirst2087 avgt 4 713.213 ± 13.488 ns/op AsciiEncodeBenchmark.jdk last 3 avgt 4 14.991 ± 0.190 ns/op AsciiEncodeBenchmark.jdk last 10 avgt 4 18.284 ± 0.317 ns/op AsciiEncodeBenchmark.jdk last 16 avgt 4 20.591 ± 1.002 ns/op AsciiEncodeBenchmark.jdk last 23 avgt 4 22.560 ± 0.963 ns/op AsciiEncodeBenchmark.jdk last 33 avgt 4 25.521 ± 0.554 ns/op AsciiEncodeBenchmark.jdk last 42 avgt 4 28.484 ± 0.446 ns/op AsciiEncodeBenchmark.jdk last 201 avgt 4 79.434 ± 2.256 ns/op AsciiEncodeBenchmark.jdk last 511 avgt 4 186.639 ± 4.258 ns/op AsciiEncodeBenchmark.jdk last2087 avgt 4 725.196 ± 149.416 ns/op AsciiEncodeBenchmark.longCheckCopy none 3 avgt 47.222 ± 0.428 ns/op AsciiEncodeBenchmark.longCheckCopy none 10 avgt 48.070 ± 0.171 ns/op AsciiEncodeBenchmark.longCheckCopy none 16 avgt 46.711 ± 0.409 ns/op AsciiEncodeBenchmark.longCheckCopy none 23 avgt 4 12.906 ± 3.633 ns/op AsciiEncodeBenchmark.longCheckCopy none 33 avgt 4 10.414 ± 0.447 ns/op AsciiEncodeBenchmark.longCheckCopy none 42 avgt 4 11.935 ± 1.235 ns/op AsciiEncodeBenchmark.longCheckCopy none 201 avgt 4 29.538 ± 3.265 ns/op AsciiEncodeBenchmark.longCheckCopy none 511 avgt 4 106.228 ± 68.475 ns/op AsciiEncodeBenchmark.longCheckCopy none2087 avgt 4 494.845 ± 890.985 ns/op AsciiEncodeBenchmark.longCheckCopy first 3 avgt 47.775 ± 0.278 ns/op AsciiEncodeBenchmark.longCheckCopy first 10 avgt 4 13.396 ± 2.072 ns/op AsciiEncodeBenchmark.longCheckCopy first 16 avgt 4 13.528 ± 0.702 ns/op AsciiEncodeBenchmark.longCheckCopy first 23 avgt 4 17.376 ± 0.360 ns/op AsciiEncodeBenchmark.longCheckCopy first 33 avgt 4 16.251 ± 0.203 ns/op AsciiEncodeBenchmark.longCheckCopy first 42 avgt 4 17.932 ± 1.773 ns/op AsciiEncodeBenchmark.longCheckCopy first 201 avgt 4 39.028 ± 4.699 ns/op AsciiEncodeBenchmark.longCheckCopy first 5
Re: RFR: 8298033: Character.codePointAt(char[], int, int) doesn't do JavaDoc-specified check
On Fri, 2 Dec 2022 12:44:18 GMT, Sergey Tsypanov wrote: > I found out that this code > > public class Main { > public static void main(String[] args) { > String s = "Hello world!"; > char[] chars = s.toCharArray(); > int point = Character.codePointAt(chars, -1, 1); > } > } > > throws `ArrayIndexOutOfBoundsException` instead of JavaDoc-specified > `IndexOutOfBoundsException`: > > Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index -1 > out of bounds for length 12 > at java.base/java.lang.Character.codePointAtImpl(Character.java:9254) > at java.base/java.lang.Character.codePointAt(Character.java:9249) > at org.example.Main.main(Main.java:7) > > and the method doesn't check whether `index` parameter is negative: > > public static int codePointAt(char[] a, int index, int limit) { > if (index >= limit || limit < 0 || limit > a.length) { > throw new IndexOutOfBoundsException(); > } > return codePointAtImpl(a, index, limit); > } > > I suggest to check the `index` parameter explicitly instead of relying on > AIOOBE thrown from accessing the array with negative index. As `ArrayIndexOutOfBoundsException` is an `IndexOutOfBoundsException`, it is not clear to me how this is not matching the javadoc/spec. https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/ArrayIndexOutOfBoundsException.html - PR: https://git.openjdk.org/jdk/pull/11480