Re: RFR: 8308804: Improve UUID.randomUUID performance with bulk/scalable PRNG access [v7]

2023-06-01 Thread Andrei Pangin
On Thu, 1 Jun 2023 09:37:29 GMT, Aleksey Shipilev  wrote:

>> UUID is the very important class that is used to track identities of objects 
>> in large scale systems. On some of our systems, `UUID.randomUUID` takes >1% 
>> of total CPU time, and is frequently a scalability bottleneck due to 
>> `SecureRandom` synchronization.
>> 
>> The major issue with UUID code itself is that it reads from the single 
>> `SecureRandom` instance by 16 bytes. So the heavily contended `SecureRandom` 
>> is bashed with very small requests. This also has a chilling effect on other 
>> users of `SecureRandom`, when there is a heavy UUID generation traffic.
>> 
>> We can improve this by doing the bulk reads from the backing SecureRandom 
>> and possibly striping the reads across many instances of it. 
>> 
>> 
>> Benchmark   Mode  Cnt  Score   Error   Units
>> 
>> ### AArch64 (m6g.4xlarge, Graviton, 16 cores)
>> 
>> # Before
>> UUIDRandomBench.single  thrpt   15  3.545 ± 0.058  ops/us
>> UUIDRandomBench.max thrpt   15  1.832 ± 0.059  ops/us ; negative scaling
>> 
>> # After
>> UUIDRandomBench.single  thrpt   15  4.823 ± 0.023  ops/us 
>> UUIDRandomBench.max thrpt   15  6.561 ± 0.054  ops/us ; positive 
>> scaling, ~1.5x
>> 
>> ### x86_64 (c6.8xlarge, Xeon, 18 cores)
>> 
>> # Before
>> UUIDRandomBench.single  thrpt   15  2.710 ± 0.038  ops/us
>> UUIDRandomBench.max thrpt   15  1.880 ± 0.029  ops/us  ; negative 
>> scaling 
>> 
>> # After
>> BenchmarkMode  Cnt  Score   Error   Units
>> UUIDRandomBench.single  thrpt   15  3.109 ± 0.026  ops/us
>> UUIDRandomBench.max thrpt   15  3.561 ± 0.071  ops/us  ; positive 
>> scaling, ~1.2x
>> 
>> 
>> Note that there is still a scalability bottleneck in current default random 
>> (`NativePRNG`), because it synchronizes over a singleton instance for SHA1 
>> mixer, then the engine itself, etc. -- it is quite a whack-a-mole to figure 
>> out the synchronization story there. The scalability fix in current default 
>> `SecureRandom` would be much more intrusive and risky, since it would change 
>> a core crypto class with unknown bug fanout.
>> 
>> Using the bulk reads even when the underlying PRNG is heavily synchronized 
>> is still a win. A more scalable PRNG would benefit from this as well. This 
>> PR adds a system property to select the PRNG implementation, and there we 
>> can clearly see the benefit with more scalable PRNG sources:
>> 
>> 
>> Benchmark   Mode  Cnt   Score   Error   Units
>> 
>> ### x86_64 (c6.8xlarge, Xeon, 18 cores)
>> 
>> # Before, hacked `new SecureRandom()` to 
>> `SecureRandom.getInstance("SHA1PRNG")`
>> UUIDRandomBench.single  thrpt  ...
>
> Aleksey Shipilev has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains 13 commits:
> 
>  - Revert test changes
>  - Merge branch 'master' into JDK-8308804-uuid-buffers
>  - Simplify clinit
>  - Remove the properties
>  - Swap lsb/msb
>  - Fine-tune exceptions
>  - Handle privileged properties
>  - Use ByteArray to convert. Do version/variant preparations straight on 
> locals. Move init out of optimistic lock section.
>  - More touchups
>  - Comment updates
>  - ... and 3 more: https://git.openjdk.org/jdk/compare/4460429d...fd7eaa1a

src/java.base/share/classes/java/util/UUID.java line 181:

> 179: 
> 180: private static UUID fromRandom(long msb, long lsb) {
> 181: // set version to 3

The comment doesn't match the code. Is it version 4?

-

PR Review Comment: https://git.openjdk.org/jdk/pull/14135#discussion_r1213472422


Re: RFR: 8308804: Improve UUID.randomUUID performance with bulk/scalable PRNG access [v3]

2023-05-26 Thread Andrei Pangin
On Fri, 26 May 2023 11:43:11 GMT, Aleksey Shipilev  wrote:

>> UUID is the very important class that is used to track identities of objects 
>> in large scale systems. On some of our systems, `UUID.randomUUID` takes >1% 
>> of total CPU time, and is frequently a scalability bottleneck due to 
>> `SecureRandom` synchronization.
>> 
>> The major issue with UUID code itself is that it reads from the single 
>> `SecureRandom` instance by 16 bytes. So the heavily contended `SecureRandom` 
>> is bashed with very small requests. This also has a chilling effect on other 
>> users of `SecureRandom`, when there is a heavy UUID generation traffic.
>> 
>> We can improve this by doing the bulk reads from the backing SecureRandom 
>> and possibly striping the reads across many instances of it. 
>> 
>> 
>> Benchmark   Mode  Cnt  Score   Error   Units
>> 
>> ### AArch64 (m6g.4xlarge, Graviton, 16 cores)
>> 
>> # Before
>> UUIDRandomBench.single  thrpt   15  3.545 ± 0.058  ops/us
>> UUIDRandomBench.max thrpt   15  1.832 ± 0.059  ops/us ; negative scaling
>> 
>> # After
>> UUIDRandomBench.single  thrpt   15  4.823 ± 0.023  ops/us 
>> UUIDRandomBench.max thrpt   15  6.561 ± 0.054  ops/us ; positive 
>> scaling, ~1.5x
>> 
>> ### x86_64 (c6.8xlarge, Xeon, 18 cores)
>> 
>> # Before
>> UUIDRandomBench.single  thrpt   15  2.710 ± 0.038  ops/us
>> UUIDRandomBench.max thrpt   15  1.880 ± 0.029  ops/us  ; negative 
>> scaling 
>> 
>> # After
>> BenchmarkMode  Cnt  Score   Error   Units
>> UUIDRandomBench.single  thrpt   15  3.109 ± 0.026  ops/us
>> UUIDRandomBench.max thrpt   15  3.561 ± 0.071  ops/us  ; positive 
>> scaling, ~1.2x
>> 
>> 
>> Note that there is still a scalability bottleneck in current default random 
>> (`NativePRNG`), because it synchronizes over a singleton instance for SHA1 
>> mixer, then the engine itself, etc. -- it is quite a whack-a-mole to figure 
>> out the synchronization story there. The scalability fix in current default 
>> `SecureRandom` would be much more intrusive and risky, since it would change 
>> a core crypto class with unknown bug fanout.
>> 
>> Using the bulk reads even when the underlying PRNG is heavily synchronized 
>> is still a win. A more scalable PRNG would benefit from this as well. This 
>> PR adds a system property to select the PRNG implementation, and there we 
>> can clearly see the benefit with more scalable PRNG sources:
>> 
>> 
>> Benchmark   Mode  Cnt   Score   Error   Units
>> 
>> ### x86_64 (c6.8xlarge, Xeon, 18 cores)
>> 
>> # Before, hacked `new SecureRandom()` to 
>> `SecureRandom.getInstance("SHA1PRNG")`
>> UUIDRandomBench.single  thrpt  ...
>
> Aleksey Shipilev has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   Fine-tune exceptions

src/java.base/share/classes/java/util/UUID.java line 226:

> 224: // set variant to IETF
> 225: msb = (msb & (0x3FFF___L)) | 
> 0x8000___L;
> 226: return new UUID(lsb, msb);

Doesn't `msb` [come 
first](https://github.com/openjdk/jdk/blob/deddaa6ea61f85e9822982a37dad51ff4310457b/src/java.base/share/classes/java/util/UUID.java#L314)?

-

PR Review Comment: https://git.openjdk.org/jdk/pull/14135#discussion_r1206680628


Re: RFR: 8308804: Improve UUID.randomUUID performance with bulk/scalable PRNG access [v3]

2023-05-26 Thread Andrei Pangin
On Fri, 26 May 2023 08:36:46 GMT, Aleksey Shipilev  wrote:

>> src/java.base/share/classes/java/util/UUID.java line 224:
>> 
>>> 222: // Try to pull the UUID from the current buffer at 
>>> current position.
>>> 223: if (stamp != 0) {
>>> 224: int p = (int)VH_POS.getAndAdd(this, 
>>> UUID_CHUNK);
>> 
>> An atomic update together with an optimistic lock looks non-idiomatic use of 
>> StampedLock to me. Won't a simple CAS loop be simpler? E.g. in pseudocode:
>> 
>> while ((p = this.pos) + 16) < buf.length) {
>> long msb = getLong(buf, p);
>> long lsb = getLong(buf, p + 8);
>> if (cas(this.pos, p, p + 16)) {
>> return new UUID(msb, lsb);
>> }
>> }
>> 
>> // refill buffer under lock
>
> Nope, I don't think you cannot do this, because there is a race on 
> concurrently-updating `buf`. The StampedLock (RW lock), protects the 
> buffer-reading threads from seeing the `buf` undergoing the initialization by 
> buffer-writing threads. Atomic pos only arbitrates the concurrent 
> buffer-threads. That atomic does not help to resolve the buf races.

Right, my example suffers from the ABA problem. It could be likely solved by 
adding "generation" bits to the pos, but this will be a bit ugly. OK, let's 
stick with the StampedLock.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/14135#discussion_r1206686499


Re: RFR: 8308804: Improve UUID.randomUUID performance with bulk/scalable PRNG access

2023-05-25 Thread Andrei Pangin
On Wed, 24 May 2023 19:36:44 GMT, Aleksey Shipilev  wrote:

> UUID is the very important class that is used to track identities of objects 
> in large scale systems. On some of our systems, `UUID.randomUUID` takes >1% 
> of total CPU time, and is frequently a scalability bottleneck due to 
> `SecureRandom` synchronization.
> 
> The major issue with UUID code itself is that it reads from the single 
> `SecureRandom` instance by 16 bytes. So the heavily contended `SecureRandom` 
> is bashed with very small requests. This also has a chilling effect on other 
> users of `SecureRandom`, when there is a heavy UUID generation traffic.
> 
> We can improve this by doing the bulk reads from the backing SecureRandom and 
> possibly striping the reads across many instances of it. 
> 
> 
> Benchmark   Mode  Cnt  Score   Error   Units
> 
> ### AArch64 (m6g.4xlarge, Graviton, 16 cores)
> 
> # Before
> UUIDRandomBench.single  thrpt   15  3.545 ± 0.058  ops/us
> UUIDRandomBench.max thrpt   15  1.832 ± 0.059  ops/us ; negative scaling
> 
> # After
> UUIDRandomBench.single  thrpt   15  4.421 ± 0.047  ops/us 
> UUIDRandomBench.max thrpt   15  6.658 ± 0.092  ops/us ; positive scaling, 
> ~1.5x
> 
> ### x86_64 (c6.8xlarge, Xeon, 18 cores)
> 
> # Before
> UUIDRandomBench.single  thrpt   15  2.710 ± 0.038  ops/us
> UUIDRandomBench.max thrpt   15  1.880 ± 0.029  ops/us  ; negative scaling 
> 
> # After
> BenchmarkMode  Cnt  Score   Error   Units
> UUIDRandomBench.single  thrpt   15  3.099 ± 0.022  ops/us
> UUIDRandomBench.max thrpt   15  3.555 ± 0.062  ops/us  ; positive 
> scaling, ~1.2x
> 
> 
> Note that there is still a scalability bottleneck in current default random 
> (`NativePRNG`), because it synchronizes over a singleton instance for SHA1 
> mixer, then the engine itself, etc. -- it is quite a whack-a-mole to figure 
> out the synchronization story there. The scalability fix in current default 
> `SecureRandom` would be much more intrusive and risky, since it would change 
> a core crypto class with unknown bug fanout.
> 
> Using the bulk reads even when the underlying PRNG is heavily synchronized is 
> still a win. A more scalable PRNG would benefit from this as well. This PR 
> adds a system property to select the PRNG implementation, and there we can 
> clearly see the benefit with more scalable PRNG sources:
> 
> 
> Benchmark   Mode  Cnt   Score   Error   Units
> 
> ### x86_64 (c6.8xlarge, Xeon, 18 cores)
> 
> # Before, hacked `new SecureRandom()` to 
> `SecureRandom.getInstance("SHA1PRNG")`
> UUIDRandomBench.single  thrpt   15  3.661 ± 0.008  ops/us
> UUIDRandomBench...

src/java.base/share/classes/java/util/UUID.java line 224:

> 222: // Try to pull the UUID from the current buffer at 
> current position.
> 223: if (stamp != 0) {
> 224: int p = (int)VH_POS.getAndAdd(this, UUID_CHUNK);

An atomic update together with an optimistic lock looks non-idiomatic use of 
StampedLock to me. Won't a simple CAS loop be simpler? E.g. in pseudocode:

while ((p = this.pos) + 16) < buf.length) {
long msb = getLong(buf, p);
long lsb = getLong(buf, p + 8);
if (cas(this.pos, p, p + 16)) {
return new UUID(msb, lsb);
}
}

// refill buffer under lock

src/java.base/share/classes/java/util/UUID.java line 226:

> 224: int p = (int)VH_POS.getAndAdd(this, UUID_CHUNK);
> 225: if (p < BUF_SIZE) {
> 226: uuid = new UUID(buf, p);

We can read msb/lsb from the buffer here and move object allocation outside the 
lock to reduce the length of the critical section.

src/java.base/share/classes/java/util/UUID.java line 260:

> 258: buf[c + 8] &= 0x3f;  /* clear variant*/
> 259: buf[c + 8] |= (byte) 0x80;  /* set to IETF 
> variant  */
> 260: }

I'm not sure I understand the point about initialization. Why not just setting 
the required version bits right in UUID constructor without updating the array? 
This will be faster and simpler IMO.

src/java.base/share/classes/java/util/UUID.java line 286:

> 284: long lsb = 0;
> 285: for (int i = start; i < start + 8; i++) {
> 286: msb = (msb << 8) | (data[i] & 0xff);

Can we use VarHandle/ByteBuffer to read 64 bits at a time?

-

PR Review Comment: https://git.openjdk.org/jdk/pull/14135#discussion_r1206104940
PR Review Comment: https://git.openjdk.org/jdk/pull/14135#discussion_r1206105877
PR Review Comment: https://git.openjdk.org/jdk/pull/14135#discussion_r1206096052
PR Review Comment: https://git.openjdk.org/jdk/pull/14135#discussion_r1206097261


Re: RFR: 8306075: Micro-optimize Enum.hashCode [v6]

2023-04-19 Thread Andrei Pangin
On Mon, 17 Apr 2023 16:42:38 GMT, olivergillespie  wrote:

>> Improve the speed of Enum.hashCode by caching the identity hashcode on first 
>> use. I've seen an application where Enum.hashCode is a hot path, and this is 
>> fairly simple speedup. The memory overhead is low; in enums with no extra 
>> fields there is already a 4-byte space due to alignment so this new field 
>> can slot in 'for free'. In other cases, the singleton nature of enum values 
>> means that the number of total instances is typically very low, so a small 
>> per-instance overhead is not a concern.
>> 
>> Please see more discussion/explanation in the [original enhancement 
>> request](https://bugs.openjdk.org/browse/JDK-8306075). [Benchmark results 
>> and generated code analysis moved to comment]
>> 
>> Thanks @shipilev for help with the implementation and interpreting the 
>> generated code.
>
> olivergillespie has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   Shuffle docs

Marked as reviewed by apangin (no project role).

-

PR Review: https://git.openjdk.org/jdk/pull/13491#pullrequestreview-1392235226


Re: RFR: 8306075: Micro-optimize Enum.hashCode [v3]

2023-04-17 Thread Andrei Pangin
On Mon, 17 Apr 2023 13:27:43 GMT, olivergillespie  wrote:

>> Improve the speed of Enum.hashCode by caching the identity hashcode on first 
>> use. I've seen an application where Enum.hashCode is a hot path, and this is 
>> fairly simple speedup. The memory overhead is low; in enums with no extra 
>> fields there is already a 4-byte space due to alignment so this new field 
>> can slot in 'for free'. In other cases, the singleton nature of enum values 
>> means that the number of total instances is typically very low, so a small 
>> per-instance overhead is not a concern.
>> 
>> Please see more discussion/explanation in the [original enhancement 
>> request](https://bugs.openjdk.org/browse/JDK-8306075).
>> 
>> ### Benchmark
>> 
>> 
>> 
>> Before:
>> 
>> Benchmark  Mode  Cnt  Score   Error  Units
>> # Intel Cascade lake
>> EnumHashCode.constant  avgt   15  1.602 ± 0.011  ns/op
>> EnumHashCode.field avgt   15  1.681 ± 0.014  ns/op
>> # Arm Neoverse N1
>> EnumHashCode.constant  avgt   15  1.642 ± 0.033  ns/op
>> EnumHashCode.field avgt   15  1.717 ± 0.059  ns/op
>> 
>> 
>> 
>> After:
>> 
>> Benchmark  Mode  Cnt  Score   Error  Units
>> # Intel Cascade lake
>> EnumHashCode.constant  avgt   15  0.479 ± 0.001  ns/op
>> EnumHashCode.field avgt   15  0.799 ± 0.002  ns/op
>> # Arm Neoverse N1
>> EnumHashCode.constant  avgt   15  0.802 ± 0.002  ns/op
>> EnumHashCode.field avgt   15  1.059 ± 0.056  ns/op
>> 
>> 
>> Using `-prof perfasm` on the benchmark, we can compare the generated code 
>> for x86_64:
>> 
>> Before:
>> 
>> │ 0x7fae4868dd17:   lea(%r12,%r10,8),%rsi   ;*getfield e 
>> {reexecute=0 rethrow=0 return_oop=0}
>> │   ; - 
>> org.sample.EnumHashCode::field@1 (line 24)
>> │   ; - 
>> org.sample.jmh_generated.EnumHashCode_field_jmhTest::field_avgt_jmhStub@17 
>> (line 186)
>> │ 0x7fae4868dd1b:   mov(%rsi),%r10
>> │ 0x7fae4868dd1e:   mov%r10,%r11
>> │ 0x7fae4868dd21:   and$0x3,%r11
>> │ 0x7fae4868dd25:   cmp$0x1,%r11
>> │ 0x7fae4868dd29:   jne0x7fae4868dcc6
>> │ 0x7fae4868dd2b:   shr$0x8,%r10
>> │ 0x7fae4868dd2f:   mov%r10d,%eax
>> │ 0x7fae4868dd32:   and$0x7fff,%eax
>> │ 0x7fae4868dd37:   test   %eax,%eax
>> │ 0x7fae4868dd39:   je 0x7fae4868dcc6   ;*invokespecial 
>> hashCode {reexecute=0 rethrow=0 return_oop=0}
>> │   ; - 
>> java.lang.Enum::hashCode@1 (line 175)
>> 
>> 
>> This is the normal Object.hashCode intrinsic, which involves reading the 
>> object header, extracting the hash code and handling two slow-path cases 
>> (displaced object header, hash not initialized).
>> 
>> After:
>> 
>> 
>>   │  0x7f550068e3b4:   mov0x10(%r12,%r10,8),%r8d  <-- read the hash 
>> field
>>   │  0x7f550068e3b9:   test   %r8d,%r8d   <-- if (hash == 0)
>>   │  0x7f550068e3bc:   je 0x7f550068e413  <-- slow init 
>> path, only taken on first use
>> 
>> 
>> Thanks @shipilev for help with the implementation and interpreting the 
>> generated code.
>
> olivergillespie has refreshed the contents of this pull request, and previous 
> commits have been removed. Incremental views are not available. The pull 
> request now contains one commit:
> 
>   8306075: Micro-optimize Enum.hashCode

src/java.base/share/classes/java/lang/Enum.java line 181:

> 179:  */
> 180: @Stable
> 181: private int hash;

Should not we hide the field from reflection?

I understand this is an implementation detail, however, it makes a big 
difference that you add a private field to the *public* class that is *meant* 
to be inherited by user classes. Although I'm not aware of a particular 
example, this can be a breaking change for user code that does introspection on 
enums.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/13491#discussion_r1168724994