Re: RFR: 8308804: Improve UUID.randomUUID performance with bulk/scalable PRNG access [v7]

2023-06-01 Thread Aleksey Shipilev
> UUID is the very important class that is used to track identities of objects 
> in large scale systems. On some of our systems, `UUID.randomUUID` takes >1% 
> of total CPU time, and is frequently a scalability bottleneck due to 
> `SecureRandom` synchronization.
> 
> The major issue with UUID code itself is that it reads from the single 
> `SecureRandom` instance by 16 bytes. So the heavily contended `SecureRandom` 
> is bashed with very small requests. This also has a chilling effect on other 
> users of `SecureRandom`, when there is a heavy UUID generation traffic.
> 
> We can improve this by doing the bulk reads from the backing SecureRandom and 
> possibly striping the reads across many instances of it. 
> 
> 
> Benchmark   Mode  Cnt  Score   Error   Units
> 
> ### AArch64 (m6g.4xlarge, Graviton, 16 cores)
> 
> # Before
> UUIDRandomBench.single  thrpt   15  3.545 ± 0.058  ops/us
> UUIDRandomBench.max thrpt   15  1.832 ± 0.059  ops/us ; negative scaling
> 
> # After
> UUIDRandomBench.single  thrpt   15  4.823 ± 0.023  ops/us 
> UUIDRandomBench.max thrpt   15  6.561 ± 0.054  ops/us ; positive scaling, 
> ~1.5x
> 
> ### x86_64 (c6.8xlarge, Xeon, 18 cores)
> 
> # Before
> UUIDRandomBench.single  thrpt   15  2.710 ± 0.038  ops/us
> UUIDRandomBench.max thrpt   15  1.880 ± 0.029  ops/us  ; negative scaling 
> 
> # After
> BenchmarkMode  Cnt  Score   Error   Units
> UUIDRandomBench.single  thrpt   15  3.109 ± 0.026  ops/us
> UUIDRandomBench.max thrpt   15  3.561 ± 0.071  ops/us  ; positive 
> scaling, ~1.2x
> 
> 
> Note that there is still a scalability bottleneck in current default random 
> (`NativePRNG`), because it synchronizes over a singleton instance for SHA1 
> mixer, then the engine itself, etc. -- it is quite a whack-a-mole to figure 
> out the synchronization story there. The scalability fix in current default 
> `SecureRandom` would be much more intrusive and risky, since it would change 
> a core crypto class with unknown bug fanout.
> 
> Using the bulk reads even when the underlying PRNG is heavily synchronized is 
> still a win. A more scalable PRNG would benefit from this as well. This PR 
> adds a system property to select the PRNG implementation, and there we can 
> clearly see the benefit with more scalable PRNG sources:
> 
> 
> Benchmark   Mode  Cnt   Score   Error   Units
> 
> ### x86_64 (c6.8xlarge, Xeon, 18 cores)
> 
> # Before, hacked `new SecureRandom()` to 
> `SecureRandom.getInstance("SHA1PRNG")`
> UUIDRandomBench.single  thrpt   15  3.661 ± 0.008  ops/us
> UUIDRandomBench...

Aleksey Shipilev has updated the pull request with a new target base due to a 
merge or a rebase. The pull request now contains 13 commits:

 - Revert test changes
 - Merge branch 'master' into JDK-8308804-uuid-buffers
 - Simplify clinit
 - Remove the properties
 - Swap lsb/msb
 - Fine-tune exceptions
 - Handle privileged properties
 - Use ByteArray to convert. Do version/variant preparations straight on 
locals. Move init out of optimistic lock section.
 - More touchups
 - Comment updates
 - ... and 3 more: https://git.openjdk.org/jdk/compare/4460429d...fd7eaa1a

-

Changes: https://git.openjdk.org/jdk/pull/14135/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=14135&range=06
  Stats: 197 lines in 2 files changed: 186 ins; 8 del; 3 mod
  Patch: https://git.openjdk.org/jdk/pull/14135.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/14135/head:pull/14135

PR: https://git.openjdk.org/jdk/pull/14135


Re: RFR: 8308804: Improve UUID.randomUUID performance with bulk/scalable PRNG access [v7]

2023-06-01 Thread Brett Okken
On Thu, 1 Jun 2023 09:37:29 GMT, Aleksey Shipilev  wrote:

>> UUID is the very important class that is used to track identities of objects 
>> in large scale systems. On some of our systems, `UUID.randomUUID` takes >1% 
>> of total CPU time, and is frequently a scalability bottleneck due to 
>> `SecureRandom` synchronization.
>> 
>> The major issue with UUID code itself is that it reads from the single 
>> `SecureRandom` instance by 16 bytes. So the heavily contended `SecureRandom` 
>> is bashed with very small requests. This also has a chilling effect on other 
>> users of `SecureRandom`, when there is a heavy UUID generation traffic.
>> 
>> We can improve this by doing the bulk reads from the backing SecureRandom 
>> and possibly striping the reads across many instances of it. 
>> 
>> 
>> Benchmark   Mode  Cnt  Score   Error   Units
>> 
>> ### AArch64 (m6g.4xlarge, Graviton, 16 cores)
>> 
>> # Before
>> UUIDRandomBench.single  thrpt   15  3.545 ± 0.058  ops/us
>> UUIDRandomBench.max thrpt   15  1.832 ± 0.059  ops/us ; negative scaling
>> 
>> # After
>> UUIDRandomBench.single  thrpt   15  4.823 ± 0.023  ops/us 
>> UUIDRandomBench.max thrpt   15  6.561 ± 0.054  ops/us ; positive 
>> scaling, ~1.5x
>> 
>> ### x86_64 (c6.8xlarge, Xeon, 18 cores)
>> 
>> # Before
>> UUIDRandomBench.single  thrpt   15  2.710 ± 0.038  ops/us
>> UUIDRandomBench.max thrpt   15  1.880 ± 0.029  ops/us  ; negative 
>> scaling 
>> 
>> # After
>> BenchmarkMode  Cnt  Score   Error   Units
>> UUIDRandomBench.single  thrpt   15  3.109 ± 0.026  ops/us
>> UUIDRandomBench.max thrpt   15  3.561 ± 0.071  ops/us  ; positive 
>> scaling, ~1.2x
>> 
>> 
>> Note that there is still a scalability bottleneck in current default random 
>> (`NativePRNG`), because it synchronizes over a singleton instance for SHA1 
>> mixer, then the engine itself, etc. -- it is quite a whack-a-mole to figure 
>> out the synchronization story there. The scalability fix in current default 
>> `SecureRandom` would be much more intrusive and risky, since it would change 
>> a core crypto class with unknown bug fanout.
>> 
>> Using the bulk reads even when the underlying PRNG is heavily synchronized 
>> is still a win. A more scalable PRNG would benefit from this as well. This 
>> PR adds a system property to select the PRNG implementation, and there we 
>> can clearly see the benefit with more scalable PRNG sources:
>> 
>> 
>> Benchmark   Mode  Cnt   Score   Error   Units
>> 
>> ### x86_64 (c6.8xlarge, Xeon, 18 cores)
>> 
>> # Before, hacked `new SecureRandom()` to 
>> `SecureRandom.getInstance("SHA1PRNG")`
>> UUIDRandomBench.single  thrpt  ...
>
> Aleksey Shipilev has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains 13 commits:
> 
>  - Revert test changes
>  - Merge branch 'master' into JDK-8308804-uuid-buffers
>  - Simplify clinit
>  - Remove the properties
>  - Swap lsb/msb
>  - Fine-tune exceptions
>  - Handle privileged properties
>  - Use ByteArray to convert. Do version/variant preparations straight on 
> locals. Move init out of optimistic lock section.
>  - More touchups
>  - Comment updates
>  - ... and 3 more: https://git.openjdk.org/jdk/compare/4460429d...fd7eaa1a

src/java.base/share/classes/java/util/UUID.java line 234:

> 232: long msb = ByteArray.getLong(buf, 0);
> 233: long lsb = ByteArray.getLong(buf, 8);
> 234: return fromRandom(msb, lsb);

is there any value in moving this outside of the critical section?

-

PR Review Comment: https://git.openjdk.org/jdk/pull/14135#discussion_r1213117598


Re: RFR: 8308804: Improve UUID.randomUUID performance with bulk/scalable PRNG access [v7]

2023-06-01 Thread Aleksey Shipilev
On Thu, 1 Jun 2023 12:59:14 GMT, Brett Okken  wrote:

>> Aleksey Shipilev has updated the pull request with a new target base due to 
>> a merge or a rebase. The pull request now contains 13 commits:
>> 
>>  - Revert test changes
>>  - Merge branch 'master' into JDK-8308804-uuid-buffers
>>  - Simplify clinit
>>  - Remove the properties
>>  - Swap lsb/msb
>>  - Fine-tune exceptions
>>  - Handle privileged properties
>>  - Use ByteArray to convert. Do version/variant preparations straight on 
>> locals. Move init out of optimistic lock section.
>>  - More touchups
>>  - Comment updates
>>  - ... and 3 more: https://git.openjdk.org/jdk/compare/4460429d...fd7eaa1a
>
> src/java.base/share/classes/java/util/UUID.java line 234:
> 
>> 232: long msb = ByteArray.getLong(buf, 0);
>> 233: long lsb = ByteArray.getLong(buf, 8);
>> 234: return fromRandom(msb, lsb);
> 
> is there any value in moving this outside of the critical section?

I have tried it before, and it gives no benefit, but significantly complicates 
the code. This is the part under write lock, where we have spent a significant 
time refilling the buffer. The little improvement we might get here is drowning 
in those costs.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/14135#discussion_r1213121408


Re: RFR: 8308804: Improve UUID.randomUUID performance with bulk/scalable PRNG access [v7]

2023-06-01 Thread Andrei Pangin
On Thu, 1 Jun 2023 09:37:29 GMT, Aleksey Shipilev  wrote:

>> UUID is the very important class that is used to track identities of objects 
>> in large scale systems. On some of our systems, `UUID.randomUUID` takes >1% 
>> of total CPU time, and is frequently a scalability bottleneck due to 
>> `SecureRandom` synchronization.
>> 
>> The major issue with UUID code itself is that it reads from the single 
>> `SecureRandom` instance by 16 bytes. So the heavily contended `SecureRandom` 
>> is bashed with very small requests. This also has a chilling effect on other 
>> users of `SecureRandom`, when there is a heavy UUID generation traffic.
>> 
>> We can improve this by doing the bulk reads from the backing SecureRandom 
>> and possibly striping the reads across many instances of it. 
>> 
>> 
>> Benchmark   Mode  Cnt  Score   Error   Units
>> 
>> ### AArch64 (m6g.4xlarge, Graviton, 16 cores)
>> 
>> # Before
>> UUIDRandomBench.single  thrpt   15  3.545 ± 0.058  ops/us
>> UUIDRandomBench.max thrpt   15  1.832 ± 0.059  ops/us ; negative scaling
>> 
>> # After
>> UUIDRandomBench.single  thrpt   15  4.823 ± 0.023  ops/us 
>> UUIDRandomBench.max thrpt   15  6.561 ± 0.054  ops/us ; positive 
>> scaling, ~1.5x
>> 
>> ### x86_64 (c6.8xlarge, Xeon, 18 cores)
>> 
>> # Before
>> UUIDRandomBench.single  thrpt   15  2.710 ± 0.038  ops/us
>> UUIDRandomBench.max thrpt   15  1.880 ± 0.029  ops/us  ; negative 
>> scaling 
>> 
>> # After
>> BenchmarkMode  Cnt  Score   Error   Units
>> UUIDRandomBench.single  thrpt   15  3.109 ± 0.026  ops/us
>> UUIDRandomBench.max thrpt   15  3.561 ± 0.071  ops/us  ; positive 
>> scaling, ~1.2x
>> 
>> 
>> Note that there is still a scalability bottleneck in current default random 
>> (`NativePRNG`), because it synchronizes over a singleton instance for SHA1 
>> mixer, then the engine itself, etc. -- it is quite a whack-a-mole to figure 
>> out the synchronization story there. The scalability fix in current default 
>> `SecureRandom` would be much more intrusive and risky, since it would change 
>> a core crypto class with unknown bug fanout.
>> 
>> Using the bulk reads even when the underlying PRNG is heavily synchronized 
>> is still a win. A more scalable PRNG would benefit from this as well. This 
>> PR adds a system property to select the PRNG implementation, and there we 
>> can clearly see the benefit with more scalable PRNG sources:
>> 
>> 
>> Benchmark   Mode  Cnt   Score   Error   Units
>> 
>> ### x86_64 (c6.8xlarge, Xeon, 18 cores)
>> 
>> # Before, hacked `new SecureRandom()` to 
>> `SecureRandom.getInstance("SHA1PRNG")`
>> UUIDRandomBench.single  thrpt  ...
>
> Aleksey Shipilev has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains 13 commits:
> 
>  - Revert test changes
>  - Merge branch 'master' into JDK-8308804-uuid-buffers
>  - Simplify clinit
>  - Remove the properties
>  - Swap lsb/msb
>  - Fine-tune exceptions
>  - Handle privileged properties
>  - Use ByteArray to convert. Do version/variant preparations straight on 
> locals. Move init out of optimistic lock section.
>  - More touchups
>  - Comment updates
>  - ... and 3 more: https://git.openjdk.org/jdk/compare/4460429d...fd7eaa1a

src/java.base/share/classes/java/util/UUID.java line 181:

> 179: 
> 180: private static UUID fromRandom(long msb, long lsb) {
> 181: // set version to 3

The comment doesn't match the code. Is it version 4?

-

PR Review Comment: https://git.openjdk.org/jdk/pull/14135#discussion_r1213472422


Re: RFR: 8308804: Improve UUID.randomUUID performance with bulk/scalable PRNG access [v7]

2023-09-04 Thread David Schlosnagle
On Thu, 1 Jun 2023 09:37:29 GMT, Aleksey Shipilev  wrote:

>> UUID is the very important class that is used to track identities of objects 
>> in large scale systems. On some of our systems, `UUID.randomUUID` takes >1% 
>> of total CPU time, and is frequently a scalability bottleneck due to 
>> `SecureRandom` synchronization.
>> 
>> The major issue with UUID code itself is that it reads from the single 
>> `SecureRandom` instance by 16 bytes. So the heavily contended `SecureRandom` 
>> is bashed with very small requests. This also has a chilling effect on other 
>> users of `SecureRandom`, when there is a heavy UUID generation traffic.
>> 
>> We can improve this by doing the bulk reads from the backing SecureRandom 
>> and possibly striping the reads across many instances of it. 
>> 
>> 
>> Benchmark   Mode  Cnt  Score   Error   Units
>> 
>> ### AArch64 (m6g.4xlarge, Graviton, 16 cores)
>> 
>> # Before
>> UUIDRandomBench.single  thrpt   15  3.545 ± 0.058  ops/us
>> UUIDRandomBench.max thrpt   15  1.832 ± 0.059  ops/us ; negative scaling
>> 
>> # After
>> UUIDRandomBench.single  thrpt   15  4.823 ± 0.023  ops/us 
>> UUIDRandomBench.max thrpt   15  6.561 ± 0.054  ops/us ; positive 
>> scaling, ~1.5x
>> 
>> ### x86_64 (c6.8xlarge, Xeon, 18 cores)
>> 
>> # Before
>> UUIDRandomBench.single  thrpt   15  2.710 ± 0.038  ops/us
>> UUIDRandomBench.max thrpt   15  1.880 ± 0.029  ops/us  ; negative 
>> scaling 
>> 
>> # After
>> BenchmarkMode  Cnt  Score   Error   Units
>> UUIDRandomBench.single  thrpt   15  3.109 ± 0.026  ops/us
>> UUIDRandomBench.max thrpt   15  3.561 ± 0.071  ops/us  ; positive 
>> scaling, ~1.2x
>> 
>> 
>> Note that there is still a scalability bottleneck in current default random 
>> (`NativePRNG`), because it synchronizes over a singleton instance for SHA1 
>> mixer, then the engine itself, etc. -- it is quite a whack-a-mole to figure 
>> out the synchronization story there. The scalability fix in current default 
>> `SecureRandom` would be much more intrusive and risky, since it would change 
>> a core crypto class with unknown bug fanout.
>> 
>> Using the bulk reads even when the underlying PRNG is heavily synchronized 
>> is still a win. A more scalable PRNG would benefit from this as well. This 
>> PR adds a system property to select the PRNG implementation, and there we 
>> can clearly see the benefit with more scalable PRNG sources:
>> 
>> 
>> Benchmark   Mode  Cnt   Score   Error   Units
>> 
>> ### x86_64 (c6.8xlarge, Xeon, 18 cores)
>> 
>> # Before, hacked `new SecureRandom()` to 
>> `SecureRandom.getInstance("SHA1PRNG")`
>> UUIDRandomBench.single  thrpt  ...
>
> Aleksey Shipilev has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains 13 commits:
> 
>  - Revert test changes
>  - Merge branch 'master' into JDK-8308804-uuid-buffers
>  - Simplify clinit
>  - Remove the properties
>  - Swap lsb/msb
>  - Fine-tune exceptions
>  - Handle privileged properties
>  - Use ByteArray to convert. Do version/variant preparations straight on 
> locals. Move init out of optimistic lock section.
>  - More touchups
>  - Comment updates
>  - ... and 3 more: https://git.openjdk.org/jdk/compare/4460429d...fd7eaa1a

I see this PR was closed out as stale and not integrated. I am curious if there 
is anything I could do to help push this over the line. As highlighted by 
@shipilev in the PR description, UUID generation is often a critical component 
of some systems and contention on SecureRandom can become a scalability 
bottleneck that I would love to reduce.

Thanks!

-

PR Comment: https://git.openjdk.org/jdk/pull/14135#issuecomment-1705192263