Re: RFR: 8275914: SHA3: changing java implementation to help C2 create high-performance code [v3]
On Tue, 1 Feb 2022 09:52:46 GMT, Boris Ulasevich wrote: >> Background >> >> The goal is to improve SHA3 implementation performance as it runs up to two >> times slower than native (OpenSSL, measured on AMD64 and AArch6464) >> implementation. Some hardware provides SHA3 accelerators, but most (AMD64 >> and most AArch64) do not. >> >> For AArch64 hardware that does support SHA3 accelerators, an intrinsic was >> implemented for ARMv8.2 with SHA3 instructions support: >> - [JDK-8252204](https://bugs.openjdk.java.net/browse/JDK-8252204): AArch64: >> Implement SHA3 accelerator/intrinsic >> >> SHA3 java implementation have already been reworked to eliminate memory >> consumption and make it work faster: >> - [JDK-8157495](https://bugs.openjdk.java.net/browse/JDK-8157495): SHA-3 >> Hash algorithm performance improvements (~12x speedup) >> The latter issue addressed this previously, but there is still some room to >> improve SHA3 performance with current C2 implementation, which is proposed >> in this PR. >> >> Option 1 (this PR) >> >> With this change I unroll the internal cycles manually, inline it manually, >> and use local variables (not array) for data processing. Such an approach >> gives the best performance (see benchmark results). Without this change >> (with current array data processing) we observe a lot of load/store >> operations in comparison to processing in local variables, both on AMD64 and >> on AArch64. >> Native implementations shows that on AArch64 (32 GPU registers) SHA-3 >> algorithm can hold all 25 data and all temporary variables in registers. C2 >> can't optimize it as well because many regisers are allocated for internal >> usage: rscratch1, rscratch2, rmethod, rthread, etc. With data in local >> variables the number of excessive load/stores is much smaller and >> performance result is much better. >> >> Option 2 (alternative) >> >> LINK: [the change](https://github.com/bulasevich/jdk/commit/095fc9db) >> >> This is a more conservative change which minimizes code changes, but it has >> lower performance improvement. Please let me know if you think this change >> is better then main one: in this case I will replace it within this Pull >> Request. >> >> With this change I unroll the internal cycles manually and use @Forceinline >> annotation. Manual unrolling is necessary because C2 does not recognize >> there is a counted cycle that can be completely unrolled. Instead of >> replacing the loop with five loop bodies C2 splits the loop to pre- main- >> and post- loop which is not good for this case. C2 works better when the >> array is created locally, but in our case the array is created on object >> instantiation, so C2 can't prove the array length is constant. The second >> issue affecting performance is that methods with unrolled loops get larger >> and C2 does not inline them. It's addressed here by using @Forceinline >> annotation. >> >> Benchmark results >> >> We measured the change on four platforms: ARM32, PPC, AArch64 (with no >> advanced SHA-3 instructions) and AMD on >> >> [MessageDigests](https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/security/MessageDigests.java) >> benchmark. Here is the result: >> http://cr.openjdk.java.net/~bulasevich/8275914/sha3_bench.html >> - fix1 (red) is the Option 1 (this PR) change: gains 50/38/83/38% on >> ARM32/PPC/AArch64/AMD64 >> - fix2 (green) is the Option 2 (alternative) change: gains 23/33/40/17% on >> ARM32/PPC/AArch64/AMD64 >> >> Testing >> >> Tested with JTREG and SHA-3 Project (NIST) test vectors: run SHA3 >> implementation over the same vectors before and after the change, and >> checking the results are the same. >> The test tool: http://cr.openjdk.java.net/~bulasevich/8275914/RunKat.java >> The test vectors compiled from the [Final Algorithm >> Package](https://csrc.nist.gov/Projects/Hash-Functions/SHA-3-Project): >> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_224.txt >> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_256.txt >> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_384.txt >> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_512.txt > > Boris Ulasevich has updated the pull request incrementally with one > additional commit since the last revision: > > copyright fix Lgtm as well. - Marked as reviewed by phh (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6847
Re: RFR: 8275914: SHA3: changing java implementation to help C2 create high-performance code [v3]
On Tue, 1 Feb 2022 09:52:46 GMT, Boris Ulasevich wrote: >> Background >> >> The goal is to improve SHA3 implementation performance as it runs up to two >> times slower than native (OpenSSL, measured on AMD64 and AArch6464) >> implementation. Some hardware provides SHA3 accelerators, but most (AMD64 >> and most AArch64) do not. >> >> For AArch64 hardware that does support SHA3 accelerators, an intrinsic was >> implemented for ARMv8.2 with SHA3 instructions support: >> - [JDK-8252204](https://bugs.openjdk.java.net/browse/JDK-8252204): AArch64: >> Implement SHA3 accelerator/intrinsic >> >> SHA3 java implementation have already been reworked to eliminate memory >> consumption and make it work faster: >> - [JDK-8157495](https://bugs.openjdk.java.net/browse/JDK-8157495): SHA-3 >> Hash algorithm performance improvements (~12x speedup) >> The latter issue addressed this previously, but there is still some room to >> improve SHA3 performance with current C2 implementation, which is proposed >> in this PR. >> >> Option 1 (this PR) >> >> With this change I unroll the internal cycles manually, inline it manually, >> and use local variables (not array) for data processing. Such an approach >> gives the best performance (see benchmark results). Without this change >> (with current array data processing) we observe a lot of load/store >> operations in comparison to processing in local variables, both on AMD64 and >> on AArch64. >> Native implementations shows that on AArch64 (32 GPU registers) SHA-3 >> algorithm can hold all 25 data and all temporary variables in registers. C2 >> can't optimize it as well because many regisers are allocated for internal >> usage: rscratch1, rscratch2, rmethod, rthread, etc. With data in local >> variables the number of excessive load/stores is much smaller and >> performance result is much better. >> >> Option 2 (alternative) >> >> LINK: [the change](https://github.com/bulasevich/jdk/commit/095fc9db) >> >> This is a more conservative change which minimizes code changes, but it has >> lower performance improvement. Please let me know if you think this change >> is better then main one: in this case I will replace it within this Pull >> Request. >> >> With this change I unroll the internal cycles manually and use @Forceinline >> annotation. Manual unrolling is necessary because C2 does not recognize >> there is a counted cycle that can be completely unrolled. Instead of >> replacing the loop with five loop bodies C2 splits the loop to pre- main- >> and post- loop which is not good for this case. C2 works better when the >> array is created locally, but in our case the array is created on object >> instantiation, so C2 can't prove the array length is constant. The second >> issue affecting performance is that methods with unrolled loops get larger >> and C2 does not inline them. It's addressed here by using @Forceinline >> annotation. >> >> Benchmark results >> >> We measured the change on four platforms: ARM32, PPC, AArch64 (with no >> advanced SHA-3 instructions) and AMD on >> >> [MessageDigests](https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/security/MessageDigests.java) >> benchmark. Here is the result: >> http://cr.openjdk.java.net/~bulasevich/8275914/sha3_bench.html >> - fix1 (red) is the Option 1 (this PR) change: gains 50/38/83/38% on >> ARM32/PPC/AArch64/AMD64 >> - fix2 (green) is the Option 2 (alternative) change: gains 23/33/40/17% on >> ARM32/PPC/AArch64/AMD64 >> >> Testing >> >> Tested with JTREG and SHA-3 Project (NIST) test vectors: run SHA3 >> implementation over the same vectors before and after the change, and >> checking the results are the same. >> The test tool: http://cr.openjdk.java.net/~bulasevich/8275914/RunKat.java >> The test vectors compiled from the [Final Algorithm >> Package](https://csrc.nist.gov/Projects/Hash-Functions/SHA-3-Project): >> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_224.txt >> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_256.txt >> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_384.txt >> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_512.txt > > Boris Ulasevich has updated the pull request incrementally with one > additional commit since the last revision: > > copyright fix Looks good to me - Marked as reviewed by ascarpino (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6847
Re: RFR: 8275914: SHA3: changing java implementation to help C2 create high-performance code [v3]
> Background > > The goal is to improve SHA3 implementation performance as it runs up to two > times slower than native (OpenSSL, measured on AMD64 and AArch6464) > implementation. Some hardware provides SHA3 accelerators, but most (AMD64 and > most AArch64) do not. > > For AArch64 hardware that does support SHA3 accelerators, an intrinsic was > implemented for ARMv8.2 with SHA3 instructions support: > - [JDK-8252204](https://bugs.openjdk.java.net/browse/JDK-8252204): AArch64: > Implement SHA3 accelerator/intrinsic > > SHA3 java implementation have already been reworked to eliminate memory > consumption and make it work faster: > - [JDK-8157495](https://bugs.openjdk.java.net/browse/JDK-8157495): SHA-3 Hash > algorithm performance improvements (~12x speedup) > The latter issue addressed this previously, but there is still some room to > improve SHA3 performance with current C2 implementation, which is proposed in > this PR. > > Option 1 (this PR) > > With this change I unroll the internal cycles manually, inline it manually, > and use local variables (not array) for data processing. Such an approach > gives the best performance (see benchmark results). Without this change (with > current array data processing) we observe a lot of load/store operations in > comparison to processing in local variables, both on AMD64 and on AArch64. > Native implementations shows that on AArch64 (32 GPU registers) SHA-3 > algorithm can hold all 25 data and all temporary variables in registers. C2 > can't optimize it as well because many regisers are allocated for internal > usage: rscratch1, rscratch2, rmethod, rthread, etc. With data in local > variables the number of excessive load/stores is much smaller and performance > result is much better. > > Option 2 (alternative) > > LINK: [the change](https://github.com/bulasevich/jdk/commit/095fc9db) > > This is a more conservative change which minimizes code changes, but it has > lower performance improvement. Please let me know if you think this change is > better then main one: in this case I will replace it within this Pull Request. > > With this change I unroll the internal cycles manually and use @Forceinline > annotation. Manual unrolling is necessary because C2 does not recognize there > is a counted cycle that can be completely unrolled. Instead of replacing the > loop with five loop bodies C2 splits the loop to pre- main- and post- loop > which is not good for this case. C2 works better when the array is created > locally, but in our case the array is created on object instantiation, so C2 > can't prove the array length is constant. The second issue affecting > performance is that methods with unrolled loops get larger and C2 does not > inline them. It's addressed here by using @Forceinline annotation. > > Benchmark results > > We measured the change on four platforms: ARM32, PPC, AArch64 (with no > advanced SHA-3 instructions) and AMD on > > [MessageDigests](https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/security/MessageDigests.java) > benchmark. Here is the result: > http://cr.openjdk.java.net/~bulasevich/8275914/sha3_bench.html > - fix1 (red) is the Option 1 (this PR) change: gains 50/38/83/38% on > ARM32/PPC/AArch64/AMD64 > - fix2 (green) is the Option 2 (alternative) change: gains 23/33/40/17% on > ARM32/PPC/AArch64/AMD64 > > Testing > > Tested with JTREG and SHA-3 Project (NIST) test vectors: run SHA3 > implementation over the same vectors before and after the change, and > checking the results are the same. > The test tool: http://cr.openjdk.java.net/~bulasevich/8275914/RunKat.java > The test vectors compiled from the [Final Algorithm > Package](https://csrc.nist.gov/Projects/Hash-Functions/SHA-3-Project): > http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_224.txt > http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_256.txt > http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_384.txt > http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_512.txt Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: copyright fix - Changes: - all: https://git.openjdk.java.net/jdk/pull/6847/files - new: https://git.openjdk.java.net/jdk/pull/6847/files/4d15fea4..de07332b Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6847&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6847&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/6847.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6847/head:pull/6847 PR: https://git.openjdk.java.net/jdk/pull/6847
Re: RFR: 8275914: SHA3: changing java implementation to help C2 create high-performance code [v2]
> Background > > The goal is to improve SHA3 implementation performance as it runs up to two > times slower than native (OpenSSL, measured on AMD64 and AArch6464) > implementation. Some hardware provides SHA3 accelerators, but most (AMD64 and > most AArch64) do not. > > For AArch64 hardware that does support SHA3 accelerators, an intrinsic was > implemented for ARMv8.2 with SHA3 instructions support: > - [JDK-8252204](https://bugs.openjdk.java.net/browse/JDK-8252204): AArch64: > Implement SHA3 accelerator/intrinsic > > SHA3 java implementation have already been reworked to eliminate memory > consumption and make it work faster: > - [JDK-8157495](https://bugs.openjdk.java.net/browse/JDK-8157495): SHA-3 Hash > algorithm performance improvements (~12x speedup) > The latter issue addressed this previously, but there is still some room to > improve SHA3 performance with current C2 implementation, which is proposed in > this PR. > > Option 1 (this PR) > > With this change I unroll the internal cycles manually, inline it manually, > and use local variables (not array) for data processing. Such an approach > gives the best performance (see benchmark results). Without this change (with > current array data processing) we observe a lot of load/store operations in > comparison to processing in local variables, both on AMD64 and on AArch64. > Native implementations shows that on AArch64 (32 GPU registers) SHA-3 > algorithm can hold all 25 data and all temporary variables in registers. C2 > can't optimize it as well because many regisers are allocated for internal > usage: rscratch1, rscratch2, rmethod, rthread, etc. With data in local > variables the number of excessive load/stores is much smaller and performance > result is much better. > > Option 2 (alternative) > > LINK: [the change](https://github.com/bulasevich/jdk/commit/095fc9db) > > This is a more conservative change which minimizes code changes, but it has > lower performance improvement. Please let me know if you think this change is > better then main one: in this case I will replace it within this Pull Request. > > With this change I unroll the internal cycles manually and use @Forceinline > annotation. Manual unrolling is necessary because C2 does not recognize there > is a counted cycle that can be completely unrolled. Instead of replacing the > loop with five loop bodies C2 splits the loop to pre- main- and post- loop > which is not good for this case. C2 works better when the array is created > locally, but in our case the array is created on object instantiation, so C2 > can't prove the array length is constant. The second issue affecting > performance is that methods with unrolled loops get larger and C2 does not > inline them. It's addressed here by using @Forceinline annotation. > > Benchmark results > > We measured the change on four platforms: ARM32, PPC, AArch64 (with no > advanced SHA-3 instructions) and AMD on > > [MessageDigests](https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/security/MessageDigests.java) > benchmark. Here is the result: > http://cr.openjdk.java.net/~bulasevich/8275914/sha3_bench.html > - fix1 (red) is the Option 1 (this PR) change: gains 50/38/83/38% on > ARM32/PPC/AArch64/AMD64 > - fix2 (green) is the Option 2 (alternative) change: gains 23/33/40/17% on > ARM32/PPC/AArch64/AMD64 > > Testing > > Tested with JTREG and SHA-3 Project (NIST) test vectors: run SHA3 > implementation over the same vectors before and after the change, and > checking the results are the same. > The test tool: http://cr.openjdk.java.net/~bulasevich/8275914/RunKat.java > The test vectors compiled from the [Final Algorithm > Package](https://csrc.nist.gov/Projects/Hash-Functions/SHA-3-Project): > http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_224.txt > http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_256.txt > http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_384.txt > http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_512.txt Boris Ulasevich has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains one additional commit since the last revision: Apply manual inline and unroll, plus moving data into local variables to help C2 with SHA3 impl - Changes: - all: https://git.openjdk.java.net/jdk/pull/6847/files - new: https://git.openjdk.java.net/jdk/pull/6847/files/dd889460..4d15fea4 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6847&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6847&range=00-01 Stats: 46135 lines in 1581 files changed: 29727 ins; 9682 del; 6726 mod Patch: https://git.openjdk.java.net/jdk/pull/6847.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6847/head:pull/6847 PR: https://git.openjdk.java.net/jdk/pull/6847
Re: RFR: 8275914: SHA3: changing java implementation to help C2 create high-performance code
On Wed, 15 Dec 2021 09:20:38 GMT, Boris Ulasevich wrote: > Background > > The goal is to improve SHA3 implementation performance as it runs up to two > times slower than native (OpenSSL, measured on AMD64 and AArch6464) > implementation. Some hardware provides SHA3 accelerators, but most (AMD64 and > most AArch64) do not. > > For AArch64 hardware that does support SHA3 accelerators, an intrinsic was > implemented for ARMv8.2 with SHA3 instructions support: > - [JDK-8252204](https://bugs.openjdk.java.net/browse/JDK-8252204): AArch64: > Implement SHA3 accelerator/intrinsic > > SHA3 java implementation have already been reworked to eliminate memory > consumption and make it work faster: > - [JDK-8157495](https://bugs.openjdk.java.net/browse/JDK-8157495): SHA-3 Hash > algorithm performance improvements (~12x speedup) > The latter issue addressed this previously, but there is still some room to > improve SHA3 performance with current C2 implementation, which is proposed in > this PR. > > Option 1 (this PR) > > With this change I unroll the internal cycles manually, inline it manually, > and use local variables (not array) for data processing. Such an approach > gives the best performance (see benchmark results). Without this change (with > current array data processing) we observe a lot of load/store operations in > comparison to processing in local variables, both on AMD64 and on AArch64. > Native implementations shows that on AArch64 (32 GPU registers) SHA-3 > algorithm can hold all 25 data and all temporary variables in registers. C2 > can't optimize it as well because many regisers are allocated for internal > usage: rscratch1, rscratch2, rmethod, rthread, etc. With data in local > variables the number of excessive load/stores is much smaller and performance > result is much better. > > Option 2 (alternative) > > LINK: [the change](https://github.com/bulasevich/jdk/commit/095fc9db) > > This is a more conservative change which minimizes code changes, but it has > lower performance improvement. Please let me know if you think this change is > better then main one: in this case I will replace it within this Pull Request. > > With this change I unroll the internal cycles manually and use @Forceinline > annotation. Manual unrolling is necessary because C2 does not recognize there > is a counted cycle that can be completely unrolled. Instead of replacing the > loop with five loop bodies C2 splits the loop to pre- main- and post- loop > which is not good for this case. C2 works better when the array is created > locally, but in our case the array is created on object instantiation, so C2 > can't prove the array length is constant. The second issue affecting > performance is that methods with unrolled loops get larger and C2 does not > inline them. It's addressed here by using @Forceinline annotation. > > Benchmark results > > We measured the change on four platforms: ARM32, PPC, AArch64 (with no > advanced SHA-3 instructions) and AMD on > > [MessageDigests](https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/security/MessageDigests.java) > benchmark. Here is the result: > http://cr.openjdk.java.net/~bulasevich/8275914/sha3_bench.html > - fix1 (red) is the Option 1 (this PR) change: gains 50/38/83/38% on > ARM32/PPC/AArch64/AMD64 > - fix2 (green) is the Option 2 (alternative) change: gains 23/33/40/17% on > ARM32/PPC/AArch64/AMD64 > > Testing > > Tested with JTREG and SHA-3 Project (NIST) test vectors: run SHA3 > implementation over the same vectors before and after the change, and > checking the results are the same. > The test tool: http://cr.openjdk.java.net/~bulasevich/8275914/RunKat.java > The test vectors compiled from the [Final Algorithm > Package](https://csrc.nist.gov/Projects/Hash-Functions/SHA-3-Project): > http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_224.txt > http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_256.txt > http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_384.txt > http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_512.txt Please update the copyright date from 2020 to 2022. I'll run regression tests to make sure things are ok. - PR: https://git.openjdk.java.net/jdk/pull/6847
Re: RFR: 8275914: SHA3: changing java implementation to help C2 create high-performance code
On Wed, 15 Dec 2021 09:20:38 GMT, Boris Ulasevich wrote: > Background > > The goal is to improve SHA3 implementation performance as it runs up to two > times slower than native (OpenSSL, measured on AMD64 and AArch6464) > implementation. Some hardware provides SHA3 accelerators, but most (AMD64 and > most AArch64) do not. > > For AArch64 hardware that does support SHA3 accelerators, an intrinsic was > implemented for ARMv8.2 with SHA3 instructions support: > - [JDK-8252204](https://bugs.openjdk.java.net/browse/JDK-8252204): AArch64: > Implement SHA3 accelerator/intrinsic > > SHA3 java implementation have already been reworked to eliminate memory > consumption and make it work faster: > - [JDK-8157495](https://bugs.openjdk.java.net/browse/JDK-8157495): SHA-3 Hash > algorithm performance improvements (~12x speedup) > The latter issue addressed this previously, but there is still some room to > improve SHA3 performance with current C2 implementation, which is proposed in > this PR. > > Option 1 (this PR) > > With this change I unroll the internal cycles manually, inline it manually, > and use local variables (not array) for data processing. Such an approach > gives the best performance (see benchmark results). Without this change (with > current array data processing) we observe a lot of load/store operations in > comparison to processing in local variables, both on AMD64 and on AArch64. > Native implementations shows that on AArch64 (32 GPU registers) SHA-3 > algorithm can hold all 25 data and all temporary variables in registers. C2 > can't optimize it as well because many regisers are allocated for internal > usage: rscratch1, rscratch2, rmethod, rthread, etc. With data in local > variables the number of excessive load/stores is much smaller and performance > result is much better. > > Option 2 (alternative) > > LINK: [the change](https://github.com/bulasevich/jdk/commit/095fc9db) > > This is a more conservative change which minimizes code changes, but it has > lower performance improvement. Please let me know if you think this change is > better then main one: in this case I will replace it within this Pull Request. > > With this change I unroll the internal cycles manually and use @Forceinline > annotation. Manual unrolling is necessary because C2 does not recognize there > is a counted cycle that can be completely unrolled. Instead of replacing the > loop with five loop bodies C2 splits the loop to pre- main- and post- loop > which is not good for this case. C2 works better when the array is created > locally, but in our case the array is created on object instantiation, so C2 > can't prove the array length is constant. The second issue affecting > performance is that methods with unrolled loops get larger and C2 does not > inline them. It's addressed here by using @Forceinline annotation. > > Benchmark results > > We measured the change on four platforms: ARM32, PPC, AArch64 (with no > advanced SHA-3 instructions) and AMD on > > [MessageDigests](https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/security/MessageDigests.java) > benchmark. Here is the result: > http://cr.openjdk.java.net/~bulasevich/8275914/sha3_bench.html > - fix1 (red) is the Option 1 (this PR) change: gains 50/38/83/38% on > ARM32/PPC/AArch64/AMD64 > - fix2 (green) is the Option 2 (alternative) change: gains 23/33/40/17% on > ARM32/PPC/AArch64/AMD64 > > Testing > > Tested with JTREG and SHA-3 Project (NIST) test vectors: run SHA3 > implementation over the same vectors before and after the change, and > checking the results are the same. > The test tool: http://cr.openjdk.java.net/~bulasevich/8275914/RunKat.java > The test vectors compiled from the [Final Algorithm > Package](https://csrc.nist.gov/Projects/Hash-Functions/SHA-3-Project): > http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_224.txt > http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_256.txt > http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_384.txt > http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_512.txt Ah yes. Sorry, I saw the replied email that left that out - PR: https://git.openjdk.java.net/jdk/pull/6847
Re: RFR: 8275914: SHA3: changing java implementation to help C2 create high-performance code
On Wed, 15 Dec 2021 09:20:38 GMT, Boris Ulasevich wrote: > Background > > The goal is to improve SHA3 implementation performance as it runs up to two > times slower than native (OpenSSL, measured on AMD64 and AArch6464) > implementation. Some hardware provides SHA3 accelerators, but most (AMD64 and > most AArch64) do not. > > For AArch64 hardware that does support SHA3 accelerators, an intrinsic was > implemented for ARMv8.2 with SHA3 instructions support: > - [JDK-8252204](https://bugs.openjdk.java.net/browse/JDK-8252204): AArch64: > Implement SHA3 accelerator/intrinsic > > SHA3 java implementation have already been reworked to eliminate memory > consumption and make it work faster: > - [JDK-8157495](https://bugs.openjdk.java.net/browse/JDK-8157495): SHA-3 Hash > algorithm performance improvements (~12x speedup) > The latter issue addressed this previously, but there is still some room to > improve SHA3 performance with current C2 implementation, which is proposed in > this PR. > > Option 1 (this PR) > > With this change I unroll the internal cycles manually, inline it manually, > and use local variables (not array) for data processing. Such an approach > gives the best performance (see benchmark results). Without this change (with > current array data processing) we observe a lot of load/store operations in > comparison to processing in local variables, both on AMD64 and on AArch64. > Native implementations shows that on AArch64 (32 GPU registers) SHA-3 > algorithm can hold all 25 data and all temporary variables in registers. C2 > can't optimize it as well because many regisers are allocated for internal > usage: rscratch1, rscratch2, rmethod, rthread, etc. With data in local > variables the number of excessive load/stores is much smaller and performance > result is much better. > > Option 2 (alternative) > > LINK: [the change](https://github.com/bulasevich/jdk/commit/095fc9db) > > This is a more conservative change which minimizes code changes, but it has > lower performance improvement. Please let me know if you think this change is > better then main one: in this case I will replace it within this Pull Request. > > With this change I unroll the internal cycles manually and use @Forceinline > annotation. Manual unrolling is necessary because C2 does not recognize there > is a counted cycle that can be completely unrolled. Instead of replacing the > loop with five loop bodies C2 splits the loop to pre- main- and post- loop > which is not good for this case. C2 works better when the array is created > locally, but in our case the array is created on object instantiation, so C2 > can't prove the array length is constant. The second issue affecting > performance is that methods with unrolled loops get larger and C2 does not > inline them. It's addressed here by using @Forceinline annotation. > > Benchmark results > > We measured the change on four platforms: ARM32, PPC, AArch64 (with no > advanced SHA-3 instructions) and AMD on > > [MessageDigests](https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/security/MessageDigests.java) > benchmark. Here is the result: > http://cr.openjdk.java.net/~bulasevich/8275914/sha3_bench.html > - fix1 (red) is the Option 1 (this PR) change: gains 50/38/83/38% on > ARM32/PPC/AArch64/AMD64 > - fix2 (green) is the Option 2 (alternative) change: gains 23/33/40/17% on > ARM32/PPC/AArch64/AMD64 > > Testing > > Tested with JTREG and SHA-3 Project (NIST) test vectors: run SHA3 > implementation over the same vectors before and after the change, and > checking the results are the same. > The test tool: http://cr.openjdk.java.net/~bulasevich/8275914/RunKat.java > The test vectors compiled from the [Final Algorithm > Package](https://csrc.nist.gov/Projects/Hash-Functions/SHA-3-Project): > http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_224.txt > http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_256.txt > http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_384.txt > http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_512.txt Hi, Thanks for looking at this! > Have you run the tests to make sure it passes the Known Answer Tests? In my original post I included the test suite and KAT - I have not found ready-to use one, so I did it from scratch using the test vectors provided for KECCAK Final Algorithm Package. > Can you also include the before and after JMH benchmarks for SHA3 in the PR? > they are under test/micro/org/openjdk/bench/java/security/MessageDigests There are before/after MessageDigests benchmark results for option1/options2 on ARM32/PPC/ARM64/AMD: http://cr.openjdk.java.net/~bulasevich/8275914/sha3_bench.html - is it OK? - PR: https://git.openjdk.java.net/jdk/pull/6847
Re: RFR: 8275914: SHA3: changing java implementation to help C2 create high-performance code
On Wed, 15 Dec 2021 09:20:38 GMT, Boris Ulasevich wrote: > Background > > The goal is to improve SHA3 implementation performance as it runs up to two > times slower than native (OpenSSL, measured on AMD64 and AArch6464) > implementation. Some hardware provides SHA3 accelerators, but most (AMD64 and > most AArch64) do not. > > For AArch64 hardware that does support SHA3 accelerators, an intrinsic was > implemented for ARMv8.2 with SHA3 instructions support: > - [JDK-8252204](https://bugs.openjdk.java.net/browse/JDK-8252204): AArch64: > Implement SHA3 accelerator/intrinsic > > SHA3 java implementation have already been reworked to eliminate memory > consumption and make it work faster: > - [JDK-8157495](https://bugs.openjdk.java.net/browse/JDK-8157495): SHA-3 Hash > algorithm performance improvements (~12x speedup) > The latter issue addressed this previously, but there is still some room to > improve SHA3 performance with current C2 implementation, which is proposed in > this PR. > > Option 1 (this PR) > > With this change I unroll the internal cycles manually, inline it manually, > and use local variables (not array) for data processing. Such an approach > gives the best performance (see benchmark results). Without this change (with > current array data processing) we observe a lot of load/store operations in > comparison to processing in local variables, both on AMD64 and on AArch64. > Native implementations shows that on AArch64 (32 GPU registers) SHA-3 > algorithm can hold all 25 data and all temporary variables in registers. C2 > can't optimize it as well because many regisers are allocated for internal > usage: rscratch1, rscratch2, rmethod, rthread, etc. With data in local > variables the number of excessive load/stores is much smaller and performance > result is much better. > > Option 2 (alternative) > > LINK: [the change](https://github.com/bulasevich/jdk/commit/095fc9db) > > This is a more conservative change which minimizes code changes, but it has > lower performance improvement. Please let me know if you think this change is > better then main one: in this case I will replace it within this Pull Request. > > With this change I unroll the internal cycles manually and use @Forceinline > annotation. Manual unrolling is necessary because C2 does not recognize there > is a counted cycle that can be completely unrolled. Instead of replacing the > loop with five loop bodies C2 splits the loop to pre- main- and post- loop > which is not good for this case. C2 works better when the array is created > locally, but in our case the array is created on object instantiation, so C2 > can't prove the array length is constant. The second issue affecting > performance is that methods with unrolled loops get larger and C2 does not > inline them. It's addressed here by using @Forceinline annotation. > > Benchmark results > > We measured the change on four platforms: ARM32, PPC, AArch64 (with no > advanced SHA-3 instructions) and AMD on > > [MessageDigests](https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/security/MessageDigests.java) > benchmark. Here is the result: > http://cr.openjdk.java.net/~bulasevich/8275914/sha3_bench.html > - fix1 (red) is the Option 1 (this PR) change: gains 50/38/83/38% on > ARM32/PPC/AArch64/AMD64 > - fix2 (green) is the Option 2 (alternative) change: gains 23/33/40/17% on > ARM32/PPC/AArch64/AMD64 > > Testing > > Tested with JTREG and SHA-3 Project (NIST) test vectors: run SHA3 > implementation over the same vectors before and after the change, and > checking the results are the same. > The test tool: http://cr.openjdk.java.net/~bulasevich/8275914/RunKat.java > The test vectors compiled from the [Final Algorithm > Package](https://csrc.nist.gov/Projects/Hash-Functions/SHA-3-Project): > http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_224.txt > http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_256.txt > http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_384.txt > http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_512.txt Sorry this feel off my stack with the holidays and everything else going on. I am fine proceeding change in this PR, option 1 I believe. Have you run the tests to make sure it passes the Known Answer Tests? Can you also include the before and after JMH benchmarks for SHA3 in the PR? they are under test/micro/org/openjdk/bench/java/security/MessageDigests Thanks - PR: https://git.openjdk.java.net/jdk/pull/6847
Re: RFR: 8275914: SHA3: changing java implementation to help C2 create high-performance code
On Wed, 15 Dec 2021 09:20:38 GMT, Boris Ulasevich wrote: > Background > > The goal is to improve SHA3 implementation performance as it runs up to two > times slower than native (OpenSSL, measured on AMD64 and AArch6464) > implementation. Some hardware provides SHA3 accelerators, but most (AMD64 and > most AArch64) do not. > > For AArch64 hardware that does support SHA3 accelerators, an intrinsic was > implemented for ARMv8.2 with SHA3 instructions support: > - [JDK-8252204](https://bugs.openjdk.java.net/browse/JDK-8252204): AArch64: > Implement SHA3 accelerator/intrinsic > > SHA3 java implementation have already been reworked to eliminate memory > consumption and make it work faster: > - [JDK-8157495](https://bugs.openjdk.java.net/browse/JDK-8157495): SHA-3 Hash > algorithm performance improvements (~12x speedup) > The latter issue addressed this previously, but there is still some room to > improve SHA3 performance with current C2 implementation, which is proposed in > this PR. > > Option 1 (this PR) > > With this change I unroll the internal cycles manually, inline it manually, > and use local variables (not array) for data processing. Such an approach > gives the best performance (see benchmark results). Without this change (with > current array data processing) we observe a lot of load/store operations in > comparison to processing in local variables, both on AMD64 and on AArch64. > Native implementations shows that on AArch64 (32 GPU registers) SHA-3 > algorithm can hold all 25 data and all temporary variables in registers. C2 > can't optimize it as well because many regisers are allocated for internal > usage: rscratch1, rscratch2, rmethod, rthread, etc. With data in local > variables the number of excessive load/stores is much smaller and performance > result is much better. > > Option 2 (alternative) > > LINK: [the change](https://github.com/bulasevich/jdk/commit/095fc9db) > > This is a more conservative change which minimizes code changes, but it has > lower performance improvement. Please let me know if you think this change is > better then main one: in this case I will replace it within this Pull Request. > > With this change I unroll the internal cycles manually and use @Forceinline > annotation. Manual unrolling is necessary because C2 does not recognize there > is a counted cycle that can be completely unrolled. Instead of replacing the > loop with five loop bodies C2 splits the loop to pre- main- and post- loop > which is not good for this case. C2 works better when the array is created > locally, but in our case the array is created on object instantiation, so C2 > can't prove the array length is constant. The second issue affecting > performance is that methods with unrolled loops get larger and C2 does not > inline them. It's addressed here by using @Forceinline annotation. > > Benchmark results > > We measured the change on four platforms: ARM32, PPC, AArch64 (with no > advanced SHA-3 instructions) and AMD on > > [MessageDigests](https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/security/MessageDigests.java) > benchmark. Here is the result: > http://cr.openjdk.java.net/~bulasevich/8275914/sha3_bench.html > - fix1 (red) is the Option 1 (this PR) change: gains 50/38/83/38% on > ARM32/PPC/AArch64/AMD64 > - fix2 (green) is the Option 2 (alternative) change: gains 23/33/40/17% on > ARM32/PPC/AArch64/AMD64 > > Testing > > Tested with JTREG and SHA-3 Project (NIST) test vectors: run SHA3 > implementation over the same vectors before and after the change, and > checking the results are the same. > The test tool: http://cr.openjdk.java.net/~bulasevich/8275914/RunKat.java > The test vectors compiled from the [Final Algorithm > Package](https://csrc.nist.gov/Projects/Hash-Functions/SHA-3-Project): > http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_224.txt > http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_256.txt > http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_384.txt > http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_512.txt Also, going with option 2 first leaves a progression record that might be useful to future compiler optimizer implementors. The option 1 patch looks good too. - PR: https://git.openjdk.java.net/jdk/pull/6847
Re: RFR: 8275914: SHA3: changing java implementation to help C2 create high-performance code
On Wed, 15 Dec 2021 09:20:38 GMT, Boris Ulasevich wrote: > Background > > The goal is to improve SHA3 implementation performance as it runs up to two > times slower than native (OpenSSL, measured on AMD64 and AArch6464) > implementation. Some hardware provides SHA3 accelerators, but most (AMD64 and > most AArch64) do not. > > For AArch64 hardware that does support SHA3 accelerators, an intrinsic was > implemented for ARMv8.2 with SHA3 instructions support: > - [JDK-8252204](https://bugs.openjdk.java.net/browse/JDK-8252204): AArch64: > Implement SHA3 accelerator/intrinsic > > SHA3 java implementation have already been reworked to eliminate memory > consumption and make it work faster: > - [JDK-8157495](https://bugs.openjdk.java.net/browse/JDK-8157495): SHA-3 Hash > algorithm performance improvements (~12x speedup) > The latter issue addressed this previously, but there is still some room to > improve SHA3 performance with current C2 implementation, which is proposed in > this PR. > > Option 1 (this PR) > > With this change I unroll the internal cycles manually, inline it manually, > and use local variables (not array) for data processing. Such an approach > gives the best performance (see benchmark results). Without this change (with > current array data processing) we observe a lot of load/store operations in > comparison to processing in local variables, both on AMD64 and on AArch64. > Native implementations shows that on AArch64 (32 GPU registers) SHA-3 > algorithm can hold all 25 data and all temporary variables in registers. C2 > can't optimize it as well because many regisers are allocated for internal > usage: rscratch1, rscratch2, rmethod, rthread, etc. With data in local > variables the number of excessive load/stores is much smaller and performance > result is much better. > > Option 2 (alternative) > > LINK: [the change](https://github.com/bulasevich/jdk/commit/095fc9db) > > This is a more conservative change which minimizes code changes, but it has > lower performance improvement. Please let me know if you think this change is > better then main one: in this case I will replace it within this Pull Request. > > With this change I unroll the internal cycles manually and use @Forceinline > annotation. Manual unrolling is necessary because C2 does not recognize there > is a counted cycle that can be completely unrolled. Instead of replacing the > loop with five loop bodies C2 splits the loop to pre- main- and post- loop > which is not good for this case. C2 works better when the array is created > locally, but in our case the array is created on object instantiation, so C2 > can't prove the array length is constant. The second issue affecting > performance is that methods with unrolled loops get larger and C2 does not > inline them. It's addressed here by using @Forceinline annotation. > > Benchmark results > > We measured the change on four platforms: ARM32, PPC, AArch64 (with no > advanced SHA-3 instructions) and AMD on > > [MessageDigests](https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/security/MessageDigests.java) > benchmark. Here is the result: > http://cr.openjdk.java.net/~bulasevich/8275914/sha3_bench.html > - fix1 (red) is the Option 1 (this PR) change: gains 50/38/83/38% on > ARM32/PPC/AArch64/AMD64 > - fix2 (green) is the Option 2 (alternative) change: gains 23/33/40/17% on > ARM32/PPC/AArch64/AMD64 > > Testing > > Tested with JTREG and SHA-3 Project (NIST) test vectors: run SHA3 > implementation over the same vectors before and after the change, and > checking the results are the same. > The test tool: http://cr.openjdk.java.net/~bulasevich/8275914/RunKat.java > The test vectors compiled from the [Final Algorithm > Package](https://csrc.nist.gov/Projects/Hash-Functions/SHA-3-Project): > http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_224.txt > http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_256.txt > http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_384.txt > http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_512.txt Hi, Boris, I'd go with option 2 for this patch and option 1 in a separate JBS issue. Safer and more straightforward that way. The only change I'd make to the [option 2 patch](https://github.com/bulasevich/jdk/commit/095fc9db) is to try replace raw constants with derivations of DM. E.g. in smChi() replace 0, 5, 10, 15, 20, and 25 with 0*DM, 1*DM, 2*DM, 3*DM, 4*DM, and 5*DM. In smTheta, replace a[0] with a[0*DM], a[1] with a[0*DM+1], etc. But, if doing so causes C2 to fail to constant fold, revert to raw constants. - PR: https://git.openjdk.java.net/jdk/pull/6847
Re: RFR: 8275914: SHA3: changing java implementation to help C2 create high-performance code
On Wed, 15 Dec 2021 14:22:34 GMT, Dmitry Chuyko wrote: > @kuksenko You used to improve this code before, could you share your opinion? I think benchmark results table tells more. :) - PR: https://git.openjdk.java.net/jdk/pull/6847
Re: RFR: 8275914: SHA3: changing java implementation to help C2 create high-performance code
On Wed, 15 Dec 2021 09:20:38 GMT, Boris Ulasevich wrote: > Background > > The goal is to improve SHA3 implementation performance as it runs up to two > times slower than native (OpenSSL, measured on AMD64 and AArch6464) > implementation. Some hardware provides SHA3 accelerators, but most (AMD64 and > most AArch64) do not. > > For AArch64 hardware that does support SHA3 accelerators, an intrinsic was > implemented for ARMv8.2 with SHA3 instructions support: > - [JDK-8252204](https://bugs.openjdk.java.net/browse/JDK-8252204): AArch64: > Implement SHA3 accelerator/intrinsic > > SHA3 java implementation have already been reworked to eliminate memory > consumption and make it work faster: > - [JDK-8157495](https://bugs.openjdk.java.net/browse/JDK-8157495): SHA-3 Hash > algorithm performance improvements (~12x speedup) > The latter issue addressed this previously, but there is still some room to > improve SHA3 performance with current C2 implementation, which is proposed in > this PR. > > Option 1 (this PR) > > With this change I unroll the internal cycles manually, inline it manually, > and use local variables (not array) for data processing. Such an approach > gives the best performance (see benchmark results). Without this change (with > current array data processing) we observe a lot of load/store operations in > comparison to processing in local variables, both on AMD64 and on AArch64. > Native implementations shows that on AArch64 (32 GPU registers) SHA-3 > algorithm can hold all 25 data and all temporary variables in registers. C2 > can't optimize it as well because many regisers are allocated for internal > usage: rscratch1, rscratch2, rmethod, rthread, etc. With data in local > variables the number of excessive load/stores is much smaller and performance > result is much better. > > Option 2 (alternative) > > LINK: [the change](https://github.com/bulasevich/jdk/commit/095fc9db) > > This is a more conservative change which minimizes code changes, but it has > lower performance improvement. Please let me know if you think this change is > better then main one: in this case I will replace it within this Pull Request. > > With this change I unroll the internal cycles manually and use @Forceinline > annotation. Manual unrolling is necessary because C2 does not recognize there > is a counted cycle that can be completely unrolled. Instead of replacing the > loop with five loop bodies C2 splits the loop to pre- main- and post- loop > which is not good for this case. C2 works better when the array is created > locally, but in our case the array is created on object instantiation, so C2 > can't prove the array length is constant. The second issue affecting > performance is that methods with unrolled loops get larger and C2 does not > inline them. It's addressed here by using @Forceinline annotation. > > Benchmark results > > We measured the change on four platforms: ARM32, PPC, AArch64 (with no > advanced SHA-3 instructions) and AMD on > > [MessageDigests](https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/security/MessageDigests.java) > benchmark. Here is the result: > http://cr.openjdk.java.net/~bulasevich/8275914/sha3_bench.html > - fix1 (red) is the Option 1 (this PR) change: gains 50/38/83/38% on > ARM32/PPC/AArch64/AMD64 > - fix2 (green) is the Option 2 (alternative) change: gains 23/33/40/17% on > ARM32/PPC/AArch64/AMD64 > > Testing > > Tested with JTREG and SHA-3 Project (NIST) test vectors: run SHA3 > implementation over the same vectors before and after the change, and > checking the results are the same. > The test tool: http://cr.openjdk.java.net/~bulasevich/8275914/RunKat.java > The test vectors compiled from the [Final Algorithm > Package](https://csrc.nist.gov/Projects/Hash-Functions/SHA-3-Project): > http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_224.txt > http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_256.txt > http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_384.txt > http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_512.txt @kuksenko You used to improve this code before, could you share your opinion? - PR: https://git.openjdk.java.net/jdk/pull/6847