Re: RFR: 8308804: Improve UUID.randomUUID performance with bulk/scalable PRNG access [v7]
On Thu, 1 Jun 2023 09:37:29 GMT, Aleksey Shipilev wrote: >> UUID is the very important class that is used to track identities of objects >> in large scale systems. On some of our systems, `UUID.randomUUID` takes >1% >> of total CPU time, and is frequently a scalability bottleneck due to >> `SecureRandom` synchronization. >> >> The major issue with UUID code itself is that it reads from the single >> `SecureRandom` instance by 16 bytes. So the heavily contended `SecureRandom` >> is bashed with very small requests. This also has a chilling effect on other >> users of `SecureRandom`, when there is a heavy UUID generation traffic. >> >> We can improve this by doing the bulk reads from the backing SecureRandom >> and possibly striping the reads across many instances of it. >> >> >> Benchmark Mode Cnt Score Error Units >> >> ### AArch64 (m6g.4xlarge, Graviton, 16 cores) >> >> # Before >> UUIDRandomBench.single thrpt 15 3.545 ± 0.058 ops/us >> UUIDRandomBench.max thrpt 15 1.832 ± 0.059 ops/us ; negative scaling >> >> # After >> UUIDRandomBench.single thrpt 15 4.823 ± 0.023 ops/us >> UUIDRandomBench.max thrpt 15 6.561 ± 0.054 ops/us ; positive >> scaling, ~1.5x >> >> ### x86_64 (c6.8xlarge, Xeon, 18 cores) >> >> # Before >> UUIDRandomBench.single thrpt 15 2.710 ± 0.038 ops/us >> UUIDRandomBench.max thrpt 15 1.880 ± 0.029 ops/us ; negative >> scaling >> >> # After >> BenchmarkMode Cnt Score Error Units >> UUIDRandomBench.single thrpt 15 3.109 ± 0.026 ops/us >> UUIDRandomBench.max thrpt 15 3.561 ± 0.071 ops/us ; positive >> scaling, ~1.2x >> >> >> Note that there is still a scalability bottleneck in current default random >> (`NativePRNG`), because it synchronizes over a singleton instance for SHA1 >> mixer, then the engine itself, etc. -- it is quite a whack-a-mole to figure >> out the synchronization story there. The scalability fix in current default >> `SecureRandom` would be much more intrusive and risky, since it would change >> a core crypto class with unknown bug fanout. >> >> Using the bulk reads even when the underlying PRNG is heavily synchronized >> is still a win. A more scalable PRNG would benefit from this as well. This >> PR adds a system property to select the PRNG implementation, and there we >> can clearly see the benefit with more scalable PRNG sources: >> >> >> Benchmark Mode Cnt Score Error Units >> >> ### x86_64 (c6.8xlarge, Xeon, 18 cores) >> >> # Before, hacked `new SecureRandom()` to >> `SecureRandom.getInstance("SHA1PRNG")` >> UUIDRandomBench.single thrpt ... > > Aleksey Shipilev has updated the pull request with a new target base due to a > merge or a rebase. The pull request now contains 13 commits: > > - Revert test changes > - Merge branch 'master' into JDK-8308804-uuid-buffers > - Simplify clinit > - Remove the properties > - Swap lsb/msb > - Fine-tune exceptions > - Handle privileged properties > - Use ByteArray to convert. Do version/variant preparations straight on > locals. Move init out of optimistic lock section. > - More touchups > - Comment updates > - ... and 3 more: https://git.openjdk.org/jdk/compare/4460429d...fd7eaa1a src/java.base/share/classes/java/util/UUID.java line 181: > 179: > 180: private static UUID fromRandom(long msb, long lsb) { > 181: // set version to 3 The comment doesn't match the code. Is it version 4? - PR Review Comment: https://git.openjdk.org/jdk/pull/14135#discussion_r1213472422
Re: RFR: 8308804: Improve UUID.randomUUID performance with bulk/scalable PRNG access [v3]
On Fri, 26 May 2023 11:43:11 GMT, Aleksey Shipilev wrote: >> UUID is the very important class that is used to track identities of objects >> in large scale systems. On some of our systems, `UUID.randomUUID` takes >1% >> of total CPU time, and is frequently a scalability bottleneck due to >> `SecureRandom` synchronization. >> >> The major issue with UUID code itself is that it reads from the single >> `SecureRandom` instance by 16 bytes. So the heavily contended `SecureRandom` >> is bashed with very small requests. This also has a chilling effect on other >> users of `SecureRandom`, when there is a heavy UUID generation traffic. >> >> We can improve this by doing the bulk reads from the backing SecureRandom >> and possibly striping the reads across many instances of it. >> >> >> Benchmark Mode Cnt Score Error Units >> >> ### AArch64 (m6g.4xlarge, Graviton, 16 cores) >> >> # Before >> UUIDRandomBench.single thrpt 15 3.545 ± 0.058 ops/us >> UUIDRandomBench.max thrpt 15 1.832 ± 0.059 ops/us ; negative scaling >> >> # After >> UUIDRandomBench.single thrpt 15 4.823 ± 0.023 ops/us >> UUIDRandomBench.max thrpt 15 6.561 ± 0.054 ops/us ; positive >> scaling, ~1.5x >> >> ### x86_64 (c6.8xlarge, Xeon, 18 cores) >> >> # Before >> UUIDRandomBench.single thrpt 15 2.710 ± 0.038 ops/us >> UUIDRandomBench.max thrpt 15 1.880 ± 0.029 ops/us ; negative >> scaling >> >> # After >> BenchmarkMode Cnt Score Error Units >> UUIDRandomBench.single thrpt 15 3.109 ± 0.026 ops/us >> UUIDRandomBench.max thrpt 15 3.561 ± 0.071 ops/us ; positive >> scaling, ~1.2x >> >> >> Note that there is still a scalability bottleneck in current default random >> (`NativePRNG`), because it synchronizes over a singleton instance for SHA1 >> mixer, then the engine itself, etc. -- it is quite a whack-a-mole to figure >> out the synchronization story there. The scalability fix in current default >> `SecureRandom` would be much more intrusive and risky, since it would change >> a core crypto class with unknown bug fanout. >> >> Using the bulk reads even when the underlying PRNG is heavily synchronized >> is still a win. A more scalable PRNG would benefit from this as well. This >> PR adds a system property to select the PRNG implementation, and there we >> can clearly see the benefit with more scalable PRNG sources: >> >> >> Benchmark Mode Cnt Score Error Units >> >> ### x86_64 (c6.8xlarge, Xeon, 18 cores) >> >> # Before, hacked `new SecureRandom()` to >> `SecureRandom.getInstance("SHA1PRNG")` >> UUIDRandomBench.single thrpt ... > > Aleksey Shipilev has updated the pull request incrementally with one > additional commit since the last revision: > > Fine-tune exceptions src/java.base/share/classes/java/util/UUID.java line 226: > 224: // set variant to IETF > 225: msb = (msb & (0x3FFF___L)) | > 0x8000___L; > 226: return new UUID(lsb, msb); Doesn't `msb` [come first](https://github.com/openjdk/jdk/blob/deddaa6ea61f85e9822982a37dad51ff4310457b/src/java.base/share/classes/java/util/UUID.java#L314)? - PR Review Comment: https://git.openjdk.org/jdk/pull/14135#discussion_r1206680628
Re: RFR: 8308804: Improve UUID.randomUUID performance with bulk/scalable PRNG access [v3]
On Fri, 26 May 2023 08:36:46 GMT, Aleksey Shipilev wrote: >> src/java.base/share/classes/java/util/UUID.java line 224: >> >>> 222: // Try to pull the UUID from the current buffer at >>> current position. >>> 223: if (stamp != 0) { >>> 224: int p = (int)VH_POS.getAndAdd(this, >>> UUID_CHUNK); >> >> An atomic update together with an optimistic lock looks non-idiomatic use of >> StampedLock to me. Won't a simple CAS loop be simpler? E.g. in pseudocode: >> >> while ((p = this.pos) + 16) < buf.length) { >> long msb = getLong(buf, p); >> long lsb = getLong(buf, p + 8); >> if (cas(this.pos, p, p + 16)) { >> return new UUID(msb, lsb); >> } >> } >> >> // refill buffer under lock > > Nope, I don't think you cannot do this, because there is a race on > concurrently-updating `buf`. The StampedLock (RW lock), protects the > buffer-reading threads from seeing the `buf` undergoing the initialization by > buffer-writing threads. Atomic pos only arbitrates the concurrent > buffer-threads. That atomic does not help to resolve the buf races. Right, my example suffers from the ABA problem. It could be likely solved by adding "generation" bits to the pos, but this will be a bit ugly. OK, let's stick with the StampedLock. - PR Review Comment: https://git.openjdk.org/jdk/pull/14135#discussion_r1206686499
Re: RFR: 8308804: Improve UUID.randomUUID performance with bulk/scalable PRNG access
On Wed, 24 May 2023 19:36:44 GMT, Aleksey Shipilev wrote: > UUID is the very important class that is used to track identities of objects > in large scale systems. On some of our systems, `UUID.randomUUID` takes >1% > of total CPU time, and is frequently a scalability bottleneck due to > `SecureRandom` synchronization. > > The major issue with UUID code itself is that it reads from the single > `SecureRandom` instance by 16 bytes. So the heavily contended `SecureRandom` > is bashed with very small requests. This also has a chilling effect on other > users of `SecureRandom`, when there is a heavy UUID generation traffic. > > We can improve this by doing the bulk reads from the backing SecureRandom and > possibly striping the reads across many instances of it. > > > Benchmark Mode Cnt Score Error Units > > ### AArch64 (m6g.4xlarge, Graviton, 16 cores) > > # Before > UUIDRandomBench.single thrpt 15 3.545 ± 0.058 ops/us > UUIDRandomBench.max thrpt 15 1.832 ± 0.059 ops/us ; negative scaling > > # After > UUIDRandomBench.single thrpt 15 4.421 ± 0.047 ops/us > UUIDRandomBench.max thrpt 15 6.658 ± 0.092 ops/us ; positive scaling, > ~1.5x > > ### x86_64 (c6.8xlarge, Xeon, 18 cores) > > # Before > UUIDRandomBench.single thrpt 15 2.710 ± 0.038 ops/us > UUIDRandomBench.max thrpt 15 1.880 ± 0.029 ops/us ; negative scaling > > # After > BenchmarkMode Cnt Score Error Units > UUIDRandomBench.single thrpt 15 3.099 ± 0.022 ops/us > UUIDRandomBench.max thrpt 15 3.555 ± 0.062 ops/us ; positive > scaling, ~1.2x > > > Note that there is still a scalability bottleneck in current default random > (`NativePRNG`), because it synchronizes over a singleton instance for SHA1 > mixer, then the engine itself, etc. -- it is quite a whack-a-mole to figure > out the synchronization story there. The scalability fix in current default > `SecureRandom` would be much more intrusive and risky, since it would change > a core crypto class with unknown bug fanout. > > Using the bulk reads even when the underlying PRNG is heavily synchronized is > still a win. A more scalable PRNG would benefit from this as well. This PR > adds a system property to select the PRNG implementation, and there we can > clearly see the benefit with more scalable PRNG sources: > > > Benchmark Mode Cnt Score Error Units > > ### x86_64 (c6.8xlarge, Xeon, 18 cores) > > # Before, hacked `new SecureRandom()` to > `SecureRandom.getInstance("SHA1PRNG")` > UUIDRandomBench.single thrpt 15 3.661 ± 0.008 ops/us > UUIDRandomBench... src/java.base/share/classes/java/util/UUID.java line 224: > 222: // Try to pull the UUID from the current buffer at > current position. > 223: if (stamp != 0) { > 224: int p = (int)VH_POS.getAndAdd(this, UUID_CHUNK); An atomic update together with an optimistic lock looks non-idiomatic use of StampedLock to me. Won't a simple CAS loop be simpler? E.g. in pseudocode: while ((p = this.pos) + 16) < buf.length) { long msb = getLong(buf, p); long lsb = getLong(buf, p + 8); if (cas(this.pos, p, p + 16)) { return new UUID(msb, lsb); } } // refill buffer under lock src/java.base/share/classes/java/util/UUID.java line 226: > 224: int p = (int)VH_POS.getAndAdd(this, UUID_CHUNK); > 225: if (p < BUF_SIZE) { > 226: uuid = new UUID(buf, p); We can read msb/lsb from the buffer here and move object allocation outside the lock to reduce the length of the critical section. src/java.base/share/classes/java/util/UUID.java line 260: > 258: buf[c + 8] &= 0x3f; /* clear variant*/ > 259: buf[c + 8] |= (byte) 0x80; /* set to IETF > variant */ > 260: } I'm not sure I understand the point about initialization. Why not just setting the required version bits right in UUID constructor without updating the array? This will be faster and simpler IMO. src/java.base/share/classes/java/util/UUID.java line 286: > 284: long lsb = 0; > 285: for (int i = start; i < start + 8; i++) { > 286: msb = (msb << 8) | (data[i] & 0xff); Can we use VarHandle/ByteBuffer to read 64 bits at a time? - PR Review Comment: https://git.openjdk.org/jdk/pull/14135#discussion_r1206104940 PR Review Comment: https://git.openjdk.org/jdk/pull/14135#discussion_r1206105877 PR Review Comment: https://git.openjdk.org/jdk/pull/14135#discussion_r1206096052 PR Review Comment: https://git.openjdk.org/jdk/pull/14135#discussion_r1206097261
Re: RFR: 8306075: Micro-optimize Enum.hashCode [v6]
On Mon, 17 Apr 2023 16:42:38 GMT, olivergillespie wrote: >> Improve the speed of Enum.hashCode by caching the identity hashcode on first >> use. I've seen an application where Enum.hashCode is a hot path, and this is >> fairly simple speedup. The memory overhead is low; in enums with no extra >> fields there is already a 4-byte space due to alignment so this new field >> can slot in 'for free'. In other cases, the singleton nature of enum values >> means that the number of total instances is typically very low, so a small >> per-instance overhead is not a concern. >> >> Please see more discussion/explanation in the [original enhancement >> request](https://bugs.openjdk.org/browse/JDK-8306075). [Benchmark results >> and generated code analysis moved to comment] >> >> Thanks @shipilev for help with the implementation and interpreting the >> generated code. > > olivergillespie has updated the pull request incrementally with one > additional commit since the last revision: > > Shuffle docs Marked as reviewed by apangin (no project role). - PR Review: https://git.openjdk.org/jdk/pull/13491#pullrequestreview-1392235226
Re: RFR: 8306075: Micro-optimize Enum.hashCode [v3]
On Mon, 17 Apr 2023 13:27:43 GMT, olivergillespie wrote: >> Improve the speed of Enum.hashCode by caching the identity hashcode on first >> use. I've seen an application where Enum.hashCode is a hot path, and this is >> fairly simple speedup. The memory overhead is low; in enums with no extra >> fields there is already a 4-byte space due to alignment so this new field >> can slot in 'for free'. In other cases, the singleton nature of enum values >> means that the number of total instances is typically very low, so a small >> per-instance overhead is not a concern. >> >> Please see more discussion/explanation in the [original enhancement >> request](https://bugs.openjdk.org/browse/JDK-8306075). >> >> ### Benchmark >> >> >> >> Before: >> >> Benchmark Mode Cnt Score Error Units >> # Intel Cascade lake >> EnumHashCode.constant avgt 15 1.602 ± 0.011 ns/op >> EnumHashCode.field avgt 15 1.681 ± 0.014 ns/op >> # Arm Neoverse N1 >> EnumHashCode.constant avgt 15 1.642 ± 0.033 ns/op >> EnumHashCode.field avgt 15 1.717 ± 0.059 ns/op >> >> >> >> After: >> >> Benchmark Mode Cnt Score Error Units >> # Intel Cascade lake >> EnumHashCode.constant avgt 15 0.479 ± 0.001 ns/op >> EnumHashCode.field avgt 15 0.799 ± 0.002 ns/op >> # Arm Neoverse N1 >> EnumHashCode.constant avgt 15 0.802 ± 0.002 ns/op >> EnumHashCode.field avgt 15 1.059 ± 0.056 ns/op >> >> >> Using `-prof perfasm` on the benchmark, we can compare the generated code >> for x86_64: >> >> Before: >> >> │ 0x7fae4868dd17: lea(%r12,%r10,8),%rsi ;*getfield e >> {reexecute=0 rethrow=0 return_oop=0} >> │ ; - >> org.sample.EnumHashCode::field@1 (line 24) >> │ ; - >> org.sample.jmh_generated.EnumHashCode_field_jmhTest::field_avgt_jmhStub@17 >> (line 186) >> │ 0x7fae4868dd1b: mov(%rsi),%r10 >> │ 0x7fae4868dd1e: mov%r10,%r11 >> │ 0x7fae4868dd21: and$0x3,%r11 >> │ 0x7fae4868dd25: cmp$0x1,%r11 >> │ 0x7fae4868dd29: jne0x7fae4868dcc6 >> │ 0x7fae4868dd2b: shr$0x8,%r10 >> │ 0x7fae4868dd2f: mov%r10d,%eax >> │ 0x7fae4868dd32: and$0x7fff,%eax >> │ 0x7fae4868dd37: test %eax,%eax >> │ 0x7fae4868dd39: je 0x7fae4868dcc6 ;*invokespecial >> hashCode {reexecute=0 rethrow=0 return_oop=0} >> │ ; - >> java.lang.Enum::hashCode@1 (line 175) >> >> >> This is the normal Object.hashCode intrinsic, which involves reading the >> object header, extracting the hash code and handling two slow-path cases >> (displaced object header, hash not initialized). >> >> After: >> >> >> │ 0x7f550068e3b4: mov0x10(%r12,%r10,8),%r8d <-- read the hash >> field >> │ 0x7f550068e3b9: test %r8d,%r8d <-- if (hash == 0) >> │ 0x7f550068e3bc: je 0x7f550068e413 <-- slow init >> path, only taken on first use >> >> >> Thanks @shipilev for help with the implementation and interpreting the >> generated code. > > olivergillespie has refreshed the contents of this pull request, and previous > commits have been removed. Incremental views are not available. The pull > request now contains one commit: > > 8306075: Micro-optimize Enum.hashCode src/java.base/share/classes/java/lang/Enum.java line 181: > 179: */ > 180: @Stable > 181: private int hash; Should not we hide the field from reflection? I understand this is an implementation detail, however, it makes a big difference that you add a private field to the *public* class that is *meant* to be inherited by user classes. Although I'm not aware of a particular example, this can be a breaking change for user code that does introspection on enums. - PR Review Comment: https://git.openjdk.org/jdk/pull/13491#discussion_r1168724994