Integrated: 8333867: SHA3 performance can be improved

2024-06-21 Thread Ferenc Rakoczi
On Mon, 10 Jun 2024 15:01:55 GMT, Ferenc Rakoczi  wrote:

> This PR removes some unnecessary conversions between byte arrays and long 
> arrays during SHA3 digest computations.

This pull request has now been integrated.

Changeset: 75bea280
Author:Ferenc Rakoczi 
Committer: Weijun Wang 
URL:   
https://git.openjdk.org/jdk/commit/75bea280b9adb6dac9fefafbb3f4b212f100fbb5
Stats: 86 lines in 3 files changed: 24 ins; 35 del; 27 mod

8333867: SHA3 performance can be improved

Reviewed-by: kvn, valeriep

-

PR: https://git.openjdk.org/jdk/pull/19632


Re: RFR: 8333867: SHA3 performance can be improved [v4]

2024-06-18 Thread Ferenc Rakoczi
> This PR removes some unnecessary conversions between byte arrays and long 
> arrays during SHA3 digest computations.

Ferenc Rakoczi has updated the pull request incrementally with one additional 
commit since the last revision:

  Copyright year update.

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/19632/files
  - new: https://git.openjdk.org/jdk/pull/19632/files/4fcdd09f..2bcd3000

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=19632&range=03
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=19632&range=02-03

  Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod
  Patch: https://git.openjdk.org/jdk/pull/19632.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/19632/head:pull/19632

PR: https://git.openjdk.org/jdk/pull/19632


Re: RFR: 8333867: SHA3 performance can be improved [v3]

2024-06-14 Thread Ferenc Rakoczi
> This PR removes some unnecessary conversions between byte arrays and long 
> arrays during SHA3 digest computations.

Ferenc Rakoczi has updated the pull request incrementally with one additional 
commit since the last revision:

  Accept more review suggestions

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/19632/files
  - new: https://git.openjdk.org/jdk/pull/19632/files/0a9f9624..4fcdd09f

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=19632&range=02
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=19632&range=01-02

  Stats: 21 lines in 1 file changed: 2 ins; 8 del; 11 mod
  Patch: https://git.openjdk.org/jdk/pull/19632.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/19632/head:pull/19632

PR: https://git.openjdk.org/jdk/pull/19632


Re: RFR: 8333867: SHA3 performance can be improved [v2]

2024-06-14 Thread Ferenc Rakoczi
On Thu, 13 Jun 2024 20:25:22 GMT, Valerie Peng  wrote:

>> Ferenc Rakoczi has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   Fix clone(), accept review suggestions.
>
> src/java.base/share/classes/sun/security/provider/SHA3.java line 73:
> 
>> 71: // The following array is allocated to size WIDTH bytes, but we only
>> 72: // ever use the first blockSize bytes it (for bytes <-> long 
>> conversions)
>> 73: private byte[] byteState = new byte[WIDTH];
> 
> Since we are storing the state in longs now, this "byte <-> long" conversion 
> can be made through a local variable, right? Is there a reason for having 
> this `byteState` field with size WIDTH bytes?

This is interesting: if I use WIDTH (or in blockSize) long arrays in the local 
level, the performance drops a few per cents. Even more when I only declare the 
local array in the block in which it is used. However, since we really only 
need 8 bytes, if I allocate that at the beginning of the function, I don't see 
that performance drop. So I rewrote the output loop in the function and got rid 
of the class level declaration.

> src/java.base/share/classes/sun/security/provider/SHA3.java line 121:
> 
>> 119: }
>> 120: implCompress(buffer, 0);
>> 121: int availableBytes = buffer.length;
> 
> change to `blockSize` as in `implCompress0(...)`?

Yes, thanks, I wanted to, just forgot.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/19632#discussion_r1639572500
PR Review Comment: https://git.openjdk.org/jdk/pull/19632#discussion_r1639572413


Re: RFR: 8333867: SHA3 performance can be improved [v2]

2024-06-14 Thread Ferenc Rakoczi
On Fri, 14 Jun 2024 05:56:05 GMT, Andrey Turbanov  wrote:

>> Ferenc Rakoczi has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   Fix clone(), accept review suggestions.
>
> src/java.base/share/classes/sun/security/provider/SHA3.java line 152:
> 
>> 150:  */
>> 151: void implReset() {
>> 152: Arrays.fill(state,0L);
> 
> Suggestion:
> 
> Arrays.fill(state, 0L);

Thanks! Fixed.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/19632#discussion_r1639572679


Re: RFR: 8333867: SHA3 performance can be improved [v2]

2024-06-12 Thread Ferenc Rakoczi
On Wed, 12 Jun 2024 14:08:43 GMT, Ferenc Rakoczi  wrote:

>> This PR removes some unnecessary conversions between byte arrays and long 
>> arrays during SHA3 digest computations.
>
> Ferenc Rakoczi has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Fix clone(), accept review suggestions.

tier1,2,3 mach5 tests all passed.

-

PR Comment: https://git.openjdk.org/jdk/pull/19632#issuecomment-2163807516


Re: RFR: 8333867: SHA3 performance can be improved [v2]

2024-06-12 Thread Ferenc Rakoczi
On Tue, 11 Jun 2024 18:21:49 GMT, Daniel Jeliński  wrote:

>> okay
>
> Could you try using MethodHandles instead of b2lLittle? Similar to what's 
> used in ChaCha20:
> https://github.com/openjdk/jdk/blob/b77bd5fd6a6f7ddbed90300fba790da4fb683275/src/java.base/share/classes/com/sun/crypto/provider/ChaCha20Cipher.java#L132-L134
> 
> with MethodHandles you could drop longBuf, removing unnecessary allocation on 
> the only platform where the intrinsic is implemented (MacOSX-aarch64)

b2lLittle() and l2bLittle() are using those MethodHandle stuff, but yes, it 
adds some more performance if I use them directly in this class, so I changed 
it.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/19632#discussion_r1637007344


Re: RFR: 8333867: SHA3 performance can be improved [v2]

2024-06-12 Thread Ferenc Rakoczi
> This PR removes some unnecessary conversions between byte arrays and long 
> arrays during SHA3 digest computations.

Ferenc Rakoczi has updated the pull request incrementally with one additional 
commit since the last revision:

  Fix clone(), accept review suggestions.

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/19632/files
  - new: https://git.openjdk.org/jdk/pull/19632/files/873164ad..0a9f9624

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=19632&range=01
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=19632&range=00-01

  Stats: 34 lines in 1 file changed: 23 ins; 4 del; 7 mod
  Patch: https://git.openjdk.org/jdk/pull/19632.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/19632/head:pull/19632

PR: https://git.openjdk.org/jdk/pull/19632


Re: RFR: 8333867: SHA3 performance can be improved

2024-06-11 Thread Ferenc Rakoczi
On Tue, 11 Jun 2024 18:18:41 GMT, Daniel Jeliński  wrote:

>> This PR removes some unnecessary conversions between byte arrays and long 
>> arrays during SHA3 digest computations.
>
> src/java.base/share/classes/sun/security/provider/SHA3.java line 70:
> 
>> 68: private long[] state = new long[DM*DM];
>> 69: 
>> 70: // The following two arrays are allocated to size WIDTH bytes, but 
>> we only
> 
> Only one of the following arrays is of WIDTH length... outdated comment?

Well, DM*DM*8 is WIDTH (DM=5 and WIDTH=200), so both arrays have 200 bytes, But 
I will update the comment to make it clearer.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/19632#discussion_r1635419497


Re: RFR: 8252204: AArch64: Implement SHA3 accelerator/intrinsic [v11]

2023-08-02 Thread Ferenc Rakoczi
On Wed, 2 Aug 2023 11:02:07 GMT, Yudi Zheng  wrote:

>> Fei Yang has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   Add if (isJDK16OrHigher()) check for SHA3 in CheckGraalIntrinsics.java
>
> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 3473:
> 
>> 3471: __ bcax(v24, __ T16B, v24, v8,  v31);
>> 3472: 
>> 3473: __ ld1r(v31, __ T2D, __ post(rscratch1, 8));
> 
> is it intentional to load 16 bytes and post-increment by 8?

Actually, with the ld1r instruction the post increment should be the same as 
the size of the memory accessed. So T2D requires 8 as it reads 8 bytes(and 
duplicates it into both halves of the SIMD register).

-

PR Review Comment: https://git.openjdk.org/jdk/pull/207#discussion_r1281840355