Re: RFR: 8275914: SHA3: changing java implementation to help C2 create high-performance code [v3]

2022-02-01 Thread Paul Hohensee
On Tue, 1 Feb 2022 09:52:46 GMT, Boris Ulasevich  wrote:

>> Background
>> 
>> The goal is to improve SHA3 implementation performance as it runs up to two 
>> times slower than native (OpenSSL, measured on AMD64 and AArch6464) 
>> implementation. Some hardware provides SHA3 accelerators, but most (AMD64 
>> and most AArch64) do not.
>> 
>> For AArch64 hardware that does support SHA3 accelerators, an intrinsic was 
>> implemented for ARMv8.2 with SHA3 instructions support:
>> - [JDK-8252204](https://bugs.openjdk.java.net/browse/JDK-8252204): AArch64: 
>> Implement SHA3 accelerator/intrinsic
>> 
>> SHA3 java implementation have already been reworked to eliminate memory 
>> consumption and make it work faster:
>> - [JDK-8157495](https://bugs.openjdk.java.net/browse/JDK-8157495): SHA-3 
>> Hash algorithm performance improvements (~12x speedup)
>> The latter issue addressed this previously, but there is still some room to 
>> improve SHA3 performance with current C2 implementation, which is proposed 
>> in this PR.
>> 
>> Option 1 (this PR)
>> 
>> With this change I unroll the internal cycles manually, inline it manually, 
>> and use local variables (not array) for data processing. Such an approach 
>> gives the best performance (see benchmark results). Without this change 
>> (with current array data processing) we observe a lot of load/store 
>> operations in comparison to processing in local variables, both on AMD64 and 
>> on AArch64.  
>> Native implementations shows that on AArch64 (32 GPU registers) SHA-3 
>> algorithm can hold all 25 data and all temporary variables in registers. C2 
>> can't optimize it as well because many regisers are allocated for internal 
>> usage: rscratch1, rscratch2, rmethod, rthread, etc.  With data in local 
>> variables the number of excessive load/stores is much smaller and 
>> performance result is much better.
>> 
>> Option 2 (alternative)
>> 
>> LINK: [the change](https://github.com/bulasevich/jdk/commit/095fc9db)
>> 
>> This is a more conservative change which minimizes code changes, but it has 
>> lower performance improvement. Please let me know if you think this change 
>> is better then main one: in this case I will replace it within this Pull 
>> Request.
>> 
>> With this change I unroll the internal cycles manually and use @Forceinline 
>> annotation. Manual unrolling is necessary because C2 does not recognize 
>> there is a counted cycle that can be completely unrolled. Instead of 
>> replacing the loop with five loop bodies C2 splits the loop to pre- main- 
>> and post- loop which is not good for this case. C2 works better when the 
>> array is created locally, but in our case the array is created on object 
>> instantiation, so C2 can't prove the array length is constant. The second 
>> issue affecting performance is that methods with unrolled loops get larger 
>> and C2 does not inline them. It's addressed here by using @Forceinline 
>> annotation.
>> 
>> Benchmark results
>> 
>> We measured the change on four platforms: ARM32, PPC, AArch64 (with no 
>> advanced SHA-3 instructions) and AMD on
>>   
>> [MessageDigests](https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/security/MessageDigests.java)
>>  benchmark. Here is the result: 
>> http://cr.openjdk.java.net/~bulasevich/8275914/sha3_bench.html
>> - fix1 (red) is the Option 1 (this PR) change: gains 50/38/83/38% on 
>> ARM32/PPC/AArch64/AMD64
>> - fix2 (green) is the Option 2 (alternative) change: gains 23/33/40/17% on 
>> ARM32/PPC/AArch64/AMD64
>> 
>> Testing
>> 
>> Tested with JTREG and SHA-3 Project (NIST) test vectors: run SHA3 
>> implementation over the same vectors before and after the change, and 
>> checking the results are the same.
>> The test tool: http://cr.openjdk.java.net/~bulasevich/8275914/RunKat.java
>> The test vectors compiled from the [Final Algorithm 
>> Package](https://csrc.nist.gov/Projects/Hash-Functions/SHA-3-Project):
>> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_224.txt
>> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_256.txt
>> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_384.txt
>> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_512.txt
>
> Boris Ulasevich has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   copyright fix

Lgtm as well.

-

Marked as reviewed by phh (Reviewer).

PR: https://git.openjdk.java.net/jdk/pull/6847


Re: RFR: 8275914: SHA3: changing java implementation to help C2 create high-performance code [v3]

2022-02-01 Thread Anthony Scarpino
On Tue, 1 Feb 2022 09:52:46 GMT, Boris Ulasevich  wrote:

>> Background
>> 
>> The goal is to improve SHA3 implementation performance as it runs up to two 
>> times slower than native (OpenSSL, measured on AMD64 and AArch6464) 
>> implementation. Some hardware provides SHA3 accelerators, but most (AMD64 
>> and most AArch64) do not.
>> 
>> For AArch64 hardware that does support SHA3 accelerators, an intrinsic was 
>> implemented for ARMv8.2 with SHA3 instructions support:
>> - [JDK-8252204](https://bugs.openjdk.java.net/browse/JDK-8252204): AArch64: 
>> Implement SHA3 accelerator/intrinsic
>> 
>> SHA3 java implementation have already been reworked to eliminate memory 
>> consumption and make it work faster:
>> - [JDK-8157495](https://bugs.openjdk.java.net/browse/JDK-8157495): SHA-3 
>> Hash algorithm performance improvements (~12x speedup)
>> The latter issue addressed this previously, but there is still some room to 
>> improve SHA3 performance with current C2 implementation, which is proposed 
>> in this PR.
>> 
>> Option 1 (this PR)
>> 
>> With this change I unroll the internal cycles manually, inline it manually, 
>> and use local variables (not array) for data processing. Such an approach 
>> gives the best performance (see benchmark results). Without this change 
>> (with current array data processing) we observe a lot of load/store 
>> operations in comparison to processing in local variables, both on AMD64 and 
>> on AArch64.  
>> Native implementations shows that on AArch64 (32 GPU registers) SHA-3 
>> algorithm can hold all 25 data and all temporary variables in registers. C2 
>> can't optimize it as well because many regisers are allocated for internal 
>> usage: rscratch1, rscratch2, rmethod, rthread, etc.  With data in local 
>> variables the number of excessive load/stores is much smaller and 
>> performance result is much better.
>> 
>> Option 2 (alternative)
>> 
>> LINK: [the change](https://github.com/bulasevich/jdk/commit/095fc9db)
>> 
>> This is a more conservative change which minimizes code changes, but it has 
>> lower performance improvement. Please let me know if you think this change 
>> is better then main one: in this case I will replace it within this Pull 
>> Request.
>> 
>> With this change I unroll the internal cycles manually and use @Forceinline 
>> annotation. Manual unrolling is necessary because C2 does not recognize 
>> there is a counted cycle that can be completely unrolled. Instead of 
>> replacing the loop with five loop bodies C2 splits the loop to pre- main- 
>> and post- loop which is not good for this case. C2 works better when the 
>> array is created locally, but in our case the array is created on object 
>> instantiation, so C2 can't prove the array length is constant. The second 
>> issue affecting performance is that methods with unrolled loops get larger 
>> and C2 does not inline them. It's addressed here by using @Forceinline 
>> annotation.
>> 
>> Benchmark results
>> 
>> We measured the change on four platforms: ARM32, PPC, AArch64 (with no 
>> advanced SHA-3 instructions) and AMD on
>>   
>> [MessageDigests](https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/security/MessageDigests.java)
>>  benchmark. Here is the result: 
>> http://cr.openjdk.java.net/~bulasevich/8275914/sha3_bench.html
>> - fix1 (red) is the Option 1 (this PR) change: gains 50/38/83/38% on 
>> ARM32/PPC/AArch64/AMD64
>> - fix2 (green) is the Option 2 (alternative) change: gains 23/33/40/17% on 
>> ARM32/PPC/AArch64/AMD64
>> 
>> Testing
>> 
>> Tested with JTREG and SHA-3 Project (NIST) test vectors: run SHA3 
>> implementation over the same vectors before and after the change, and 
>> checking the results are the same.
>> The test tool: http://cr.openjdk.java.net/~bulasevich/8275914/RunKat.java
>> The test vectors compiled from the [Final Algorithm 
>> Package](https://csrc.nist.gov/Projects/Hash-Functions/SHA-3-Project):
>> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_224.txt
>> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_256.txt
>> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_384.txt
>> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_512.txt
>
> Boris Ulasevich has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   copyright fix

Looks good to me

-

Marked as reviewed by ascarpino (Reviewer).

PR: https://git.openjdk.java.net/jdk/pull/6847


Re: RFR: 8275914: SHA3: changing java implementation to help C2 create high-performance code [v3]

2022-02-01 Thread Boris Ulasevich
> Background
> 
> The goal is to improve SHA3 implementation performance as it runs up to two 
> times slower than native (OpenSSL, measured on AMD64 and AArch6464) 
> implementation. Some hardware provides SHA3 accelerators, but most (AMD64 and 
> most AArch64) do not.
> 
> For AArch64 hardware that does support SHA3 accelerators, an intrinsic was 
> implemented for ARMv8.2 with SHA3 instructions support:
> - [JDK-8252204](https://bugs.openjdk.java.net/browse/JDK-8252204): AArch64: 
> Implement SHA3 accelerator/intrinsic
> 
> SHA3 java implementation have already been reworked to eliminate memory 
> consumption and make it work faster:
> - [JDK-8157495](https://bugs.openjdk.java.net/browse/JDK-8157495): SHA-3 Hash 
> algorithm performance improvements (~12x speedup)
> The latter issue addressed this previously, but there is still some room to 
> improve SHA3 performance with current C2 implementation, which is proposed in 
> this PR.
> 
> Option 1 (this PR)
> 
> With this change I unroll the internal cycles manually, inline it manually, 
> and use local variables (not array) for data processing. Such an approach 
> gives the best performance (see benchmark results). Without this change (with 
> current array data processing) we observe a lot of load/store operations in 
> comparison to processing in local variables, both on AMD64 and on AArch64.  
> Native implementations shows that on AArch64 (32 GPU registers) SHA-3 
> algorithm can hold all 25 data and all temporary variables in registers. C2 
> can't optimize it as well because many regisers are allocated for internal 
> usage: rscratch1, rscratch2, rmethod, rthread, etc.  With data in local 
> variables the number of excessive load/stores is much smaller and performance 
> result is much better.
> 
> Option 2 (alternative)
> 
> LINK: [the change](https://github.com/bulasevich/jdk/commit/095fc9db)
> 
> This is a more conservative change which minimizes code changes, but it has 
> lower performance improvement. Please let me know if you think this change is 
> better then main one: in this case I will replace it within this Pull Request.
> 
> With this change I unroll the internal cycles manually and use @Forceinline 
> annotation. Manual unrolling is necessary because C2 does not recognize there 
> is a counted cycle that can be completely unrolled. Instead of replacing the 
> loop with five loop bodies C2 splits the loop to pre- main- and post- loop 
> which is not good for this case. C2 works better when the array is created 
> locally, but in our case the array is created on object instantiation, so C2 
> can't prove the array length is constant. The second issue affecting 
> performance is that methods with unrolled loops get larger and C2 does not 
> inline them. It's addressed here by using @Forceinline annotation.
> 
> Benchmark results
> 
> We measured the change on four platforms: ARM32, PPC, AArch64 (with no 
> advanced SHA-3 instructions) and AMD on
>   
> [MessageDigests](https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/security/MessageDigests.java)
>  benchmark. Here is the result: 
> http://cr.openjdk.java.net/~bulasevich/8275914/sha3_bench.html
> - fix1 (red) is the Option 1 (this PR) change: gains 50/38/83/38% on 
> ARM32/PPC/AArch64/AMD64
> - fix2 (green) is the Option 2 (alternative) change: gains 23/33/40/17% on 
> ARM32/PPC/AArch64/AMD64
> 
> Testing
> 
> Tested with JTREG and SHA-3 Project (NIST) test vectors: run SHA3 
> implementation over the same vectors before and after the change, and 
> checking the results are the same.
> The test tool: http://cr.openjdk.java.net/~bulasevich/8275914/RunKat.java
> The test vectors compiled from the [Final Algorithm 
> Package](https://csrc.nist.gov/Projects/Hash-Functions/SHA-3-Project):
> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_224.txt
> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_256.txt
> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_384.txt
> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_512.txt

Boris Ulasevich has updated the pull request incrementally with one additional 
commit since the last revision:

  copyright fix

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/6847/files
  - new: https://git.openjdk.java.net/jdk/pull/6847/files/4d15fea4..de07332b

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6847&range=02
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6847&range=01-02

  Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod
  Patch: https://git.openjdk.java.net/jdk/pull/6847.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/6847/head:pull/6847

PR: https://git.openjdk.java.net/jdk/pull/6847


Re: RFR: 8275914: SHA3: changing java implementation to help C2 create high-performance code [v2]

2022-02-01 Thread Boris Ulasevich
> Background
> 
> The goal is to improve SHA3 implementation performance as it runs up to two 
> times slower than native (OpenSSL, measured on AMD64 and AArch6464) 
> implementation. Some hardware provides SHA3 accelerators, but most (AMD64 and 
> most AArch64) do not.
> 
> For AArch64 hardware that does support SHA3 accelerators, an intrinsic was 
> implemented for ARMv8.2 with SHA3 instructions support:
> - [JDK-8252204](https://bugs.openjdk.java.net/browse/JDK-8252204): AArch64: 
> Implement SHA3 accelerator/intrinsic
> 
> SHA3 java implementation have already been reworked to eliminate memory 
> consumption and make it work faster:
> - [JDK-8157495](https://bugs.openjdk.java.net/browse/JDK-8157495): SHA-3 Hash 
> algorithm performance improvements (~12x speedup)
> The latter issue addressed this previously, but there is still some room to 
> improve SHA3 performance with current C2 implementation, which is proposed in 
> this PR.
> 
> Option 1 (this PR)
> 
> With this change I unroll the internal cycles manually, inline it manually, 
> and use local variables (not array) for data processing. Such an approach 
> gives the best performance (see benchmark results). Without this change (with 
> current array data processing) we observe a lot of load/store operations in 
> comparison to processing in local variables, both on AMD64 and on AArch64.  
> Native implementations shows that on AArch64 (32 GPU registers) SHA-3 
> algorithm can hold all 25 data and all temporary variables in registers. C2 
> can't optimize it as well because many regisers are allocated for internal 
> usage: rscratch1, rscratch2, rmethod, rthread, etc.  With data in local 
> variables the number of excessive load/stores is much smaller and performance 
> result is much better.
> 
> Option 2 (alternative)
> 
> LINK: [the change](https://github.com/bulasevich/jdk/commit/095fc9db)
> 
> This is a more conservative change which minimizes code changes, but it has 
> lower performance improvement. Please let me know if you think this change is 
> better then main one: in this case I will replace it within this Pull Request.
> 
> With this change I unroll the internal cycles manually and use @Forceinline 
> annotation. Manual unrolling is necessary because C2 does not recognize there 
> is a counted cycle that can be completely unrolled. Instead of replacing the 
> loop with five loop bodies C2 splits the loop to pre- main- and post- loop 
> which is not good for this case. C2 works better when the array is created 
> locally, but in our case the array is created on object instantiation, so C2 
> can't prove the array length is constant. The second issue affecting 
> performance is that methods with unrolled loops get larger and C2 does not 
> inline them. It's addressed here by using @Forceinline annotation.
> 
> Benchmark results
> 
> We measured the change on four platforms: ARM32, PPC, AArch64 (with no 
> advanced SHA-3 instructions) and AMD on
>   
> [MessageDigests](https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/security/MessageDigests.java)
>  benchmark. Here is the result: 
> http://cr.openjdk.java.net/~bulasevich/8275914/sha3_bench.html
> - fix1 (red) is the Option 1 (this PR) change: gains 50/38/83/38% on 
> ARM32/PPC/AArch64/AMD64
> - fix2 (green) is the Option 2 (alternative) change: gains 23/33/40/17% on 
> ARM32/PPC/AArch64/AMD64
> 
> Testing
> 
> Tested with JTREG and SHA-3 Project (NIST) test vectors: run SHA3 
> implementation over the same vectors before and after the change, and 
> checking the results are the same.
> The test tool: http://cr.openjdk.java.net/~bulasevich/8275914/RunKat.java
> The test vectors compiled from the [Final Algorithm 
> Package](https://csrc.nist.gov/Projects/Hash-Functions/SHA-3-Project):
> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_224.txt
> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_256.txt
> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_384.txt
> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_512.txt

Boris Ulasevich has updated the pull request with a new target base due to a 
merge or a rebase. The incremental webrev excludes the unrelated changes 
brought in by the merge/rebase. The pull request contains one additional commit 
since the last revision:

  Apply manual inline and unroll, plus moving data into local variables to help 
C2 with SHA3 impl

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/6847/files
  - new: https://git.openjdk.java.net/jdk/pull/6847/files/dd889460..4d15fea4

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6847&range=01
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6847&range=00-01

  Stats: 46135 lines in 1581 files changed: 29727 ins; 9682 del; 6726 mod
  Patch: https://git.openjdk.java.net/jdk/pull/6847.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/6847/head:pull/6847

PR: https://git.openjdk.java.net/jdk/pull/6847


Re: RFR: 8275914: SHA3: changing java implementation to help C2 create high-performance code

2022-01-31 Thread Anthony Scarpino
On Wed, 15 Dec 2021 09:20:38 GMT, Boris Ulasevich  
wrote:

> Background
> 
> The goal is to improve SHA3 implementation performance as it runs up to two 
> times slower than native (OpenSSL, measured on AMD64 and AArch6464) 
> implementation. Some hardware provides SHA3 accelerators, but most (AMD64 and 
> most AArch64) do not.
> 
> For AArch64 hardware that does support SHA3 accelerators, an intrinsic was 
> implemented for ARMv8.2 with SHA3 instructions support:
> - [JDK-8252204](https://bugs.openjdk.java.net/browse/JDK-8252204): AArch64: 
> Implement SHA3 accelerator/intrinsic
> 
> SHA3 java implementation have already been reworked to eliminate memory 
> consumption and make it work faster:
> - [JDK-8157495](https://bugs.openjdk.java.net/browse/JDK-8157495): SHA-3 Hash 
> algorithm performance improvements (~12x speedup)
> The latter issue addressed this previously, but there is still some room to 
> improve SHA3 performance with current C2 implementation, which is proposed in 
> this PR.
> 
> Option 1 (this PR)
> 
> With this change I unroll the internal cycles manually, inline it manually, 
> and use local variables (not array) for data processing. Such an approach 
> gives the best performance (see benchmark results). Without this change (with 
> current array data processing) we observe a lot of load/store operations in 
> comparison to processing in local variables, both on AMD64 and on AArch64.  
> Native implementations shows that on AArch64 (32 GPU registers) SHA-3 
> algorithm can hold all 25 data and all temporary variables in registers. C2 
> can't optimize it as well because many regisers are allocated for internal 
> usage: rscratch1, rscratch2, rmethod, rthread, etc.  With data in local 
> variables the number of excessive load/stores is much smaller and performance 
> result is much better.
> 
> Option 2 (alternative)
> 
> LINK: [the change](https://github.com/bulasevich/jdk/commit/095fc9db)
> 
> This is a more conservative change which minimizes code changes, but it has 
> lower performance improvement. Please let me know if you think this change is 
> better then main one: in this case I will replace it within this Pull Request.
> 
> With this change I unroll the internal cycles manually and use @Forceinline 
> annotation. Manual unrolling is necessary because C2 does not recognize there 
> is a counted cycle that can be completely unrolled. Instead of replacing the 
> loop with five loop bodies C2 splits the loop to pre- main- and post- loop 
> which is not good for this case. C2 works better when the array is created 
> locally, but in our case the array is created on object instantiation, so C2 
> can't prove the array length is constant. The second issue affecting 
> performance is that methods with unrolled loops get larger and C2 does not 
> inline them. It's addressed here by using @Forceinline annotation.
> 
> Benchmark results
> 
> We measured the change on four platforms: ARM32, PPC, AArch64 (with no 
> advanced SHA-3 instructions) and AMD on
>   
> [MessageDigests](https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/security/MessageDigests.java)
>  benchmark. Here is the result: 
> http://cr.openjdk.java.net/~bulasevich/8275914/sha3_bench.html
> - fix1 (red) is the Option 1 (this PR) change: gains 50/38/83/38% on 
> ARM32/PPC/AArch64/AMD64
> - fix2 (green) is the Option 2 (alternative) change: gains 23/33/40/17% on 
> ARM32/PPC/AArch64/AMD64
> 
> Testing
> 
> Tested with JTREG and SHA-3 Project (NIST) test vectors: run SHA3 
> implementation over the same vectors before and after the change, and 
> checking the results are the same.
> The test tool: http://cr.openjdk.java.net/~bulasevich/8275914/RunKat.java
> The test vectors compiled from the [Final Algorithm 
> Package](https://csrc.nist.gov/Projects/Hash-Functions/SHA-3-Project):
> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_224.txt
> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_256.txt
> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_384.txt
> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_512.txt

Please update the copyright date from 2020 to 2022.
I'll run regression tests to make sure things are ok.

-

PR: https://git.openjdk.java.net/jdk/pull/6847


Re: RFR: 8275914: SHA3: changing java implementation to help C2 create high-performance code

2022-01-29 Thread Anthony Scarpino
On Wed, 15 Dec 2021 09:20:38 GMT, Boris Ulasevich  
wrote:

> Background
> 
> The goal is to improve SHA3 implementation performance as it runs up to two 
> times slower than native (OpenSSL, measured on AMD64 and AArch6464) 
> implementation. Some hardware provides SHA3 accelerators, but most (AMD64 and 
> most AArch64) do not.
> 
> For AArch64 hardware that does support SHA3 accelerators, an intrinsic was 
> implemented for ARMv8.2 with SHA3 instructions support:
> - [JDK-8252204](https://bugs.openjdk.java.net/browse/JDK-8252204): AArch64: 
> Implement SHA3 accelerator/intrinsic
> 
> SHA3 java implementation have already been reworked to eliminate memory 
> consumption and make it work faster:
> - [JDK-8157495](https://bugs.openjdk.java.net/browse/JDK-8157495): SHA-3 Hash 
> algorithm performance improvements (~12x speedup)
> The latter issue addressed this previously, but there is still some room to 
> improve SHA3 performance with current C2 implementation, which is proposed in 
> this PR.
> 
> Option 1 (this PR)
> 
> With this change I unroll the internal cycles manually, inline it manually, 
> and use local variables (not array) for data processing. Such an approach 
> gives the best performance (see benchmark results). Without this change (with 
> current array data processing) we observe a lot of load/store operations in 
> comparison to processing in local variables, both on AMD64 and on AArch64.  
> Native implementations shows that on AArch64 (32 GPU registers) SHA-3 
> algorithm can hold all 25 data and all temporary variables in registers. C2 
> can't optimize it as well because many regisers are allocated for internal 
> usage: rscratch1, rscratch2, rmethod, rthread, etc.  With data in local 
> variables the number of excessive load/stores is much smaller and performance 
> result is much better.
> 
> Option 2 (alternative)
> 
> LINK: [the change](https://github.com/bulasevich/jdk/commit/095fc9db)
> 
> This is a more conservative change which minimizes code changes, but it has 
> lower performance improvement. Please let me know if you think this change is 
> better then main one: in this case I will replace it within this Pull Request.
> 
> With this change I unroll the internal cycles manually and use @Forceinline 
> annotation. Manual unrolling is necessary because C2 does not recognize there 
> is a counted cycle that can be completely unrolled. Instead of replacing the 
> loop with five loop bodies C2 splits the loop to pre- main- and post- loop 
> which is not good for this case. C2 works better when the array is created 
> locally, but in our case the array is created on object instantiation, so C2 
> can't prove the array length is constant. The second issue affecting 
> performance is that methods with unrolled loops get larger and C2 does not 
> inline them. It's addressed here by using @Forceinline annotation.
> 
> Benchmark results
> 
> We measured the change on four platforms: ARM32, PPC, AArch64 (with no 
> advanced SHA-3 instructions) and AMD on
>   
> [MessageDigests](https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/security/MessageDigests.java)
>  benchmark. Here is the result: 
> http://cr.openjdk.java.net/~bulasevich/8275914/sha3_bench.html
> - fix1 (red) is the Option 1 (this PR) change: gains 50/38/83/38% on 
> ARM32/PPC/AArch64/AMD64
> - fix2 (green) is the Option 2 (alternative) change: gains 23/33/40/17% on 
> ARM32/PPC/AArch64/AMD64
> 
> Testing
> 
> Tested with JTREG and SHA-3 Project (NIST) test vectors: run SHA3 
> implementation over the same vectors before and after the change, and 
> checking the results are the same.
> The test tool: http://cr.openjdk.java.net/~bulasevich/8275914/RunKat.java
> The test vectors compiled from the [Final Algorithm 
> Package](https://csrc.nist.gov/Projects/Hash-Functions/SHA-3-Project):
> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_224.txt
> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_256.txt
> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_384.txt
> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_512.txt

Ah yes.  Sorry, I saw the replied email that left that out

-

PR: https://git.openjdk.java.net/jdk/pull/6847


Re: RFR: 8275914: SHA3: changing java implementation to help C2 create high-performance code

2022-01-27 Thread Boris Ulasevich
On Wed, 15 Dec 2021 09:20:38 GMT, Boris Ulasevich  
wrote:

> Background
> 
> The goal is to improve SHA3 implementation performance as it runs up to two 
> times slower than native (OpenSSL, measured on AMD64 and AArch6464) 
> implementation. Some hardware provides SHA3 accelerators, but most (AMD64 and 
> most AArch64) do not.
> 
> For AArch64 hardware that does support SHA3 accelerators, an intrinsic was 
> implemented for ARMv8.2 with SHA3 instructions support:
> - [JDK-8252204](https://bugs.openjdk.java.net/browse/JDK-8252204): AArch64: 
> Implement SHA3 accelerator/intrinsic
> 
> SHA3 java implementation have already been reworked to eliminate memory 
> consumption and make it work faster:
> - [JDK-8157495](https://bugs.openjdk.java.net/browse/JDK-8157495): SHA-3 Hash 
> algorithm performance improvements (~12x speedup)
> The latter issue addressed this previously, but there is still some room to 
> improve SHA3 performance with current C2 implementation, which is proposed in 
> this PR.
> 
> Option 1 (this PR)
> 
> With this change I unroll the internal cycles manually, inline it manually, 
> and use local variables (not array) for data processing. Such an approach 
> gives the best performance (see benchmark results). Without this change (with 
> current array data processing) we observe a lot of load/store operations in 
> comparison to processing in local variables, both on AMD64 and on AArch64.  
> Native implementations shows that on AArch64 (32 GPU registers) SHA-3 
> algorithm can hold all 25 data and all temporary variables in registers. C2 
> can't optimize it as well because many regisers are allocated for internal 
> usage: rscratch1, rscratch2, rmethod, rthread, etc.  With data in local 
> variables the number of excessive load/stores is much smaller and performance 
> result is much better.
> 
> Option 2 (alternative)
> 
> LINK: [the change](https://github.com/bulasevich/jdk/commit/095fc9db)
> 
> This is a more conservative change which minimizes code changes, but it has 
> lower performance improvement. Please let me know if you think this change is 
> better then main one: in this case I will replace it within this Pull Request.
> 
> With this change I unroll the internal cycles manually and use @Forceinline 
> annotation. Manual unrolling is necessary because C2 does not recognize there 
> is a counted cycle that can be completely unrolled. Instead of replacing the 
> loop with five loop bodies C2 splits the loop to pre- main- and post- loop 
> which is not good for this case. C2 works better when the array is created 
> locally, but in our case the array is created on object instantiation, so C2 
> can't prove the array length is constant. The second issue affecting 
> performance is that methods with unrolled loops get larger and C2 does not 
> inline them. It's addressed here by using @Forceinline annotation.
> 
> Benchmark results
> 
> We measured the change on four platforms: ARM32, PPC, AArch64 (with no 
> advanced SHA-3 instructions) and AMD on
>   
> [MessageDigests](https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/security/MessageDigests.java)
>  benchmark. Here is the result: 
> http://cr.openjdk.java.net/~bulasevich/8275914/sha3_bench.html
> - fix1 (red) is the Option 1 (this PR) change: gains 50/38/83/38% on 
> ARM32/PPC/AArch64/AMD64
> - fix2 (green) is the Option 2 (alternative) change: gains 23/33/40/17% on 
> ARM32/PPC/AArch64/AMD64
> 
> Testing
> 
> Tested with JTREG and SHA-3 Project (NIST) test vectors: run SHA3 
> implementation over the same vectors before and after the change, and 
> checking the results are the same.
> The test tool: http://cr.openjdk.java.net/~bulasevich/8275914/RunKat.java
> The test vectors compiled from the [Final Algorithm 
> Package](https://csrc.nist.gov/Projects/Hash-Functions/SHA-3-Project):
> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_224.txt
> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_256.txt
> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_384.txt
> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_512.txt

Hi,

Thanks for looking at this!

> Have you run the tests to make sure it passes the Known Answer Tests?

In my original post I included the test suite and KAT - I have not found 
ready-to use one, so I did it from scratch using the test vectors provided for 
KECCAK Final Algorithm Package.

> Can you also include the before and after JMH benchmarks for SHA3 in the PR? 
> they are under test/micro/org/openjdk/bench/java/security/MessageDigests

There are before/after MessageDigests benchmark results for option1/options2 on 
ARM32/PPC/ARM64/AMD: 
http://cr.openjdk.java.net/~bulasevich/8275914/sha3_bench.html - is it OK?

-

PR: https://git.openjdk.java.net/jdk/pull/6847


Re: RFR: 8275914: SHA3: changing java implementation to help C2 create high-performance code

2022-01-27 Thread Anthony Scarpino
On Wed, 15 Dec 2021 09:20:38 GMT, Boris Ulasevich  
wrote:

> Background
> 
> The goal is to improve SHA3 implementation performance as it runs up to two 
> times slower than native (OpenSSL, measured on AMD64 and AArch6464) 
> implementation. Some hardware provides SHA3 accelerators, but most (AMD64 and 
> most AArch64) do not.
> 
> For AArch64 hardware that does support SHA3 accelerators, an intrinsic was 
> implemented for ARMv8.2 with SHA3 instructions support:
> - [JDK-8252204](https://bugs.openjdk.java.net/browse/JDK-8252204): AArch64: 
> Implement SHA3 accelerator/intrinsic
> 
> SHA3 java implementation have already been reworked to eliminate memory 
> consumption and make it work faster:
> - [JDK-8157495](https://bugs.openjdk.java.net/browse/JDK-8157495): SHA-3 Hash 
> algorithm performance improvements (~12x speedup)
> The latter issue addressed this previously, but there is still some room to 
> improve SHA3 performance with current C2 implementation, which is proposed in 
> this PR.
> 
> Option 1 (this PR)
> 
> With this change I unroll the internal cycles manually, inline it manually, 
> and use local variables (not array) for data processing. Such an approach 
> gives the best performance (see benchmark results). Without this change (with 
> current array data processing) we observe a lot of load/store operations in 
> comparison to processing in local variables, both on AMD64 and on AArch64.  
> Native implementations shows that on AArch64 (32 GPU registers) SHA-3 
> algorithm can hold all 25 data and all temporary variables in registers. C2 
> can't optimize it as well because many regisers are allocated for internal 
> usage: rscratch1, rscratch2, rmethod, rthread, etc.  With data in local 
> variables the number of excessive load/stores is much smaller and performance 
> result is much better.
> 
> Option 2 (alternative)
> 
> LINK: [the change](https://github.com/bulasevich/jdk/commit/095fc9db)
> 
> This is a more conservative change which minimizes code changes, but it has 
> lower performance improvement. Please let me know if you think this change is 
> better then main one: in this case I will replace it within this Pull Request.
> 
> With this change I unroll the internal cycles manually and use @Forceinline 
> annotation. Manual unrolling is necessary because C2 does not recognize there 
> is a counted cycle that can be completely unrolled. Instead of replacing the 
> loop with five loop bodies C2 splits the loop to pre- main- and post- loop 
> which is not good for this case. C2 works better when the array is created 
> locally, but in our case the array is created on object instantiation, so C2 
> can't prove the array length is constant. The second issue affecting 
> performance is that methods with unrolled loops get larger and C2 does not 
> inline them. It's addressed here by using @Forceinline annotation.
> 
> Benchmark results
> 
> We measured the change on four platforms: ARM32, PPC, AArch64 (with no 
> advanced SHA-3 instructions) and AMD on
>   
> [MessageDigests](https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/security/MessageDigests.java)
>  benchmark. Here is the result: 
> http://cr.openjdk.java.net/~bulasevich/8275914/sha3_bench.html
> - fix1 (red) is the Option 1 (this PR) change: gains 50/38/83/38% on 
> ARM32/PPC/AArch64/AMD64
> - fix2 (green) is the Option 2 (alternative) change: gains 23/33/40/17% on 
> ARM32/PPC/AArch64/AMD64
> 
> Testing
> 
> Tested with JTREG and SHA-3 Project (NIST) test vectors: run SHA3 
> implementation over the same vectors before and after the change, and 
> checking the results are the same.
> The test tool: http://cr.openjdk.java.net/~bulasevich/8275914/RunKat.java
> The test vectors compiled from the [Final Algorithm 
> Package](https://csrc.nist.gov/Projects/Hash-Functions/SHA-3-Project):
> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_224.txt
> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_256.txt
> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_384.txt
> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_512.txt

Sorry this feel off my stack with the holidays and everything else going on.
I am fine proceeding change in this PR, option 1 I believe. Have you run the 
tests to make sure it passes the Known Answer Tests?
Can you also include the before and after JMH benchmarks for SHA3 in the PR?  
they are under test/micro/org/openjdk/bench/java/security/MessageDigests

Thanks

-

PR: https://git.openjdk.java.net/jdk/pull/6847


Re: RFR: 8275914: SHA3: changing java implementation to help C2 create high-performance code

2022-01-27 Thread Paul Hohensee
On Wed, 15 Dec 2021 09:20:38 GMT, Boris Ulasevich  
wrote:

> Background
> 
> The goal is to improve SHA3 implementation performance as it runs up to two 
> times slower than native (OpenSSL, measured on AMD64 and AArch6464) 
> implementation. Some hardware provides SHA3 accelerators, but most (AMD64 and 
> most AArch64) do not.
> 
> For AArch64 hardware that does support SHA3 accelerators, an intrinsic was 
> implemented for ARMv8.2 with SHA3 instructions support:
> - [JDK-8252204](https://bugs.openjdk.java.net/browse/JDK-8252204): AArch64: 
> Implement SHA3 accelerator/intrinsic
> 
> SHA3 java implementation have already been reworked to eliminate memory 
> consumption and make it work faster:
> - [JDK-8157495](https://bugs.openjdk.java.net/browse/JDK-8157495): SHA-3 Hash 
> algorithm performance improvements (~12x speedup)
> The latter issue addressed this previously, but there is still some room to 
> improve SHA3 performance with current C2 implementation, which is proposed in 
> this PR.
> 
> Option 1 (this PR)
> 
> With this change I unroll the internal cycles manually, inline it manually, 
> and use local variables (not array) for data processing. Such an approach 
> gives the best performance (see benchmark results). Without this change (with 
> current array data processing) we observe a lot of load/store operations in 
> comparison to processing in local variables, both on AMD64 and on AArch64.  
> Native implementations shows that on AArch64 (32 GPU registers) SHA-3 
> algorithm can hold all 25 data and all temporary variables in registers. C2 
> can't optimize it as well because many regisers are allocated for internal 
> usage: rscratch1, rscratch2, rmethod, rthread, etc.  With data in local 
> variables the number of excessive load/stores is much smaller and performance 
> result is much better.
> 
> Option 2 (alternative)
> 
> LINK: [the change](https://github.com/bulasevich/jdk/commit/095fc9db)
> 
> This is a more conservative change which minimizes code changes, but it has 
> lower performance improvement. Please let me know if you think this change is 
> better then main one: in this case I will replace it within this Pull Request.
> 
> With this change I unroll the internal cycles manually and use @Forceinline 
> annotation. Manual unrolling is necessary because C2 does not recognize there 
> is a counted cycle that can be completely unrolled. Instead of replacing the 
> loop with five loop bodies C2 splits the loop to pre- main- and post- loop 
> which is not good for this case. C2 works better when the array is created 
> locally, but in our case the array is created on object instantiation, so C2 
> can't prove the array length is constant. The second issue affecting 
> performance is that methods with unrolled loops get larger and C2 does not 
> inline them. It's addressed here by using @Forceinline annotation.
> 
> Benchmark results
> 
> We measured the change on four platforms: ARM32, PPC, AArch64 (with no 
> advanced SHA-3 instructions) and AMD on
>   
> [MessageDigests](https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/security/MessageDigests.java)
>  benchmark. Here is the result: 
> http://cr.openjdk.java.net/~bulasevich/8275914/sha3_bench.html
> - fix1 (red) is the Option 1 (this PR) change: gains 50/38/83/38% on 
> ARM32/PPC/AArch64/AMD64
> - fix2 (green) is the Option 2 (alternative) change: gains 23/33/40/17% on 
> ARM32/PPC/AArch64/AMD64
> 
> Testing
> 
> Tested with JTREG and SHA-3 Project (NIST) test vectors: run SHA3 
> implementation over the same vectors before and after the change, and 
> checking the results are the same.
> The test tool: http://cr.openjdk.java.net/~bulasevich/8275914/RunKat.java
> The test vectors compiled from the [Final Algorithm 
> Package](https://csrc.nist.gov/Projects/Hash-Functions/SHA-3-Project):
> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_224.txt
> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_256.txt
> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_384.txt
> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_512.txt

Also, going with option 2 first leaves a progression record that might be 
useful to future compiler optimizer implementors.

The option 1 patch looks good too.

-

PR: https://git.openjdk.java.net/jdk/pull/6847


Re: RFR: 8275914: SHA3: changing java implementation to help C2 create high-performance code

2022-01-27 Thread Paul Hohensee
On Wed, 15 Dec 2021 09:20:38 GMT, Boris Ulasevich  
wrote:

> Background
> 
> The goal is to improve SHA3 implementation performance as it runs up to two 
> times slower than native (OpenSSL, measured on AMD64 and AArch6464) 
> implementation. Some hardware provides SHA3 accelerators, but most (AMD64 and 
> most AArch64) do not.
> 
> For AArch64 hardware that does support SHA3 accelerators, an intrinsic was 
> implemented for ARMv8.2 with SHA3 instructions support:
> - [JDK-8252204](https://bugs.openjdk.java.net/browse/JDK-8252204): AArch64: 
> Implement SHA3 accelerator/intrinsic
> 
> SHA3 java implementation have already been reworked to eliminate memory 
> consumption and make it work faster:
> - [JDK-8157495](https://bugs.openjdk.java.net/browse/JDK-8157495): SHA-3 Hash 
> algorithm performance improvements (~12x speedup)
> The latter issue addressed this previously, but there is still some room to 
> improve SHA3 performance with current C2 implementation, which is proposed in 
> this PR.
> 
> Option 1 (this PR)
> 
> With this change I unroll the internal cycles manually, inline it manually, 
> and use local variables (not array) for data processing. Such an approach 
> gives the best performance (see benchmark results). Without this change (with 
> current array data processing) we observe a lot of load/store operations in 
> comparison to processing in local variables, both on AMD64 and on AArch64.  
> Native implementations shows that on AArch64 (32 GPU registers) SHA-3 
> algorithm can hold all 25 data and all temporary variables in registers. C2 
> can't optimize it as well because many regisers are allocated for internal 
> usage: rscratch1, rscratch2, rmethod, rthread, etc.  With data in local 
> variables the number of excessive load/stores is much smaller and performance 
> result is much better.
> 
> Option 2 (alternative)
> 
> LINK: [the change](https://github.com/bulasevich/jdk/commit/095fc9db)
> 
> This is a more conservative change which minimizes code changes, but it has 
> lower performance improvement. Please let me know if you think this change is 
> better then main one: in this case I will replace it within this Pull Request.
> 
> With this change I unroll the internal cycles manually and use @Forceinline 
> annotation. Manual unrolling is necessary because C2 does not recognize there 
> is a counted cycle that can be completely unrolled. Instead of replacing the 
> loop with five loop bodies C2 splits the loop to pre- main- and post- loop 
> which is not good for this case. C2 works better when the array is created 
> locally, but in our case the array is created on object instantiation, so C2 
> can't prove the array length is constant. The second issue affecting 
> performance is that methods with unrolled loops get larger and C2 does not 
> inline them. It's addressed here by using @Forceinline annotation.
> 
> Benchmark results
> 
> We measured the change on four platforms: ARM32, PPC, AArch64 (with no 
> advanced SHA-3 instructions) and AMD on
>   
> [MessageDigests](https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/security/MessageDigests.java)
>  benchmark. Here is the result: 
> http://cr.openjdk.java.net/~bulasevich/8275914/sha3_bench.html
> - fix1 (red) is the Option 1 (this PR) change: gains 50/38/83/38% on 
> ARM32/PPC/AArch64/AMD64
> - fix2 (green) is the Option 2 (alternative) change: gains 23/33/40/17% on 
> ARM32/PPC/AArch64/AMD64
> 
> Testing
> 
> Tested with JTREG and SHA-3 Project (NIST) test vectors: run SHA3 
> implementation over the same vectors before and after the change, and 
> checking the results are the same.
> The test tool: http://cr.openjdk.java.net/~bulasevich/8275914/RunKat.java
> The test vectors compiled from the [Final Algorithm 
> Package](https://csrc.nist.gov/Projects/Hash-Functions/SHA-3-Project):
> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_224.txt
> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_256.txt
> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_384.txt
> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_512.txt

Hi, Boris,

I'd go with option 2 for this patch and option 1 in a separate JBS issue. Safer 
and more straightforward that way. The only change I'd make to the [option 2 
patch](https://github.com/bulasevich/jdk/commit/095fc9db) is to try replace raw 
constants with derivations of DM. E.g. in smChi() replace 0, 5, 10, 15, 20, and 
25 with 0*DM, 1*DM, 2*DM, 3*DM, 4*DM, and 5*DM. In smTheta, replace a[0] with 
a[0*DM], a[1] with a[0*DM+1], etc. But, if doing so causes C2 to fail to 
constant fold, revert to raw constants.

-

PR: https://git.openjdk.java.net/jdk/pull/6847


Re: RFR: 8275914: SHA3: changing java implementation to help C2 create high-performance code

2021-12-15 Thread Sergey Kuksenko
On Wed, 15 Dec 2021 14:22:34 GMT, Dmitry Chuyko  wrote:

> @kuksenko You used to improve this code before, could you share your opinion?

I think benchmark results table tells more. :)

-

PR: https://git.openjdk.java.net/jdk/pull/6847


Re: RFR: 8275914: SHA3: changing java implementation to help C2 create high-performance code

2021-12-15 Thread Dmitry Chuyko
On Wed, 15 Dec 2021 09:20:38 GMT, Boris Ulasevich  
wrote:

> Background
> 
> The goal is to improve SHA3 implementation performance as it runs up to two 
> times slower than native (OpenSSL, measured on AMD64 and AArch6464) 
> implementation. Some hardware provides SHA3 accelerators, but most (AMD64 and 
> most AArch64) do not.
> 
> For AArch64 hardware that does support SHA3 accelerators, an intrinsic was 
> implemented for ARMv8.2 with SHA3 instructions support:
> - [JDK-8252204](https://bugs.openjdk.java.net/browse/JDK-8252204): AArch64: 
> Implement SHA3 accelerator/intrinsic
> 
> SHA3 java implementation have already been reworked to eliminate memory 
> consumption and make it work faster:
> - [JDK-8157495](https://bugs.openjdk.java.net/browse/JDK-8157495): SHA-3 Hash 
> algorithm performance improvements (~12x speedup)
> The latter issue addressed this previously, but there is still some room to 
> improve SHA3 performance with current C2 implementation, which is proposed in 
> this PR.
> 
> Option 1 (this PR)
> 
> With this change I unroll the internal cycles manually, inline it manually, 
> and use local variables (not array) for data processing. Such an approach 
> gives the best performance (see benchmark results). Without this change (with 
> current array data processing) we observe a lot of load/store operations in 
> comparison to processing in local variables, both on AMD64 and on AArch64.  
> Native implementations shows that on AArch64 (32 GPU registers) SHA-3 
> algorithm can hold all 25 data and all temporary variables in registers. C2 
> can't optimize it as well because many regisers are allocated for internal 
> usage: rscratch1, rscratch2, rmethod, rthread, etc.  With data in local 
> variables the number of excessive load/stores is much smaller and performance 
> result is much better.
> 
> Option 2 (alternative)
> 
> LINK: [the change](https://github.com/bulasevich/jdk/commit/095fc9db)
> 
> This is a more conservative change which minimizes code changes, but it has 
> lower performance improvement. Please let me know if you think this change is 
> better then main one: in this case I will replace it within this Pull Request.
> 
> With this change I unroll the internal cycles manually and use @Forceinline 
> annotation. Manual unrolling is necessary because C2 does not recognize there 
> is a counted cycle that can be completely unrolled. Instead of replacing the 
> loop with five loop bodies C2 splits the loop to pre- main- and post- loop 
> which is not good for this case. C2 works better when the array is created 
> locally, but in our case the array is created on object instantiation, so C2 
> can't prove the array length is constant. The second issue affecting 
> performance is that methods with unrolled loops get larger and C2 does not 
> inline them. It's addressed here by using @Forceinline annotation.
> 
> Benchmark results
> 
> We measured the change on four platforms: ARM32, PPC, AArch64 (with no 
> advanced SHA-3 instructions) and AMD on
>   
> [MessageDigests](https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/security/MessageDigests.java)
>  benchmark. Here is the result: 
> http://cr.openjdk.java.net/~bulasevich/8275914/sha3_bench.html
> - fix1 (red) is the Option 1 (this PR) change: gains 50/38/83/38% on 
> ARM32/PPC/AArch64/AMD64
> - fix2 (green) is the Option 2 (alternative) change: gains 23/33/40/17% on 
> ARM32/PPC/AArch64/AMD64
> 
> Testing
> 
> Tested with JTREG and SHA-3 Project (NIST) test vectors: run SHA3 
> implementation over the same vectors before and after the change, and 
> checking the results are the same.
> The test tool: http://cr.openjdk.java.net/~bulasevich/8275914/RunKat.java
> The test vectors compiled from the [Final Algorithm 
> Package](https://csrc.nist.gov/Projects/Hash-Functions/SHA-3-Project):
> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_224.txt
> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_256.txt
> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_384.txt
> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_512.txt

@kuksenko You used to improve this code before, could you share your opinion?

-

PR: https://git.openjdk.java.net/jdk/pull/6847