Background

The goal is to improve SHA3 implementation performance as it runs up to two 
times slower than native (OpenSSL, measured on AMD64 and AArch6464) 
implementation. Some hardware provides SHA3 accelerators, but most (AMD64 and 
most ARM64) do not.

For ARM64 hardware that does support SHA3 accelerators, an intrinsic was 
implemented for ARMv8.2 with SHA3 instructions support:
- [JDK-8252204](https://bugs.openjdk.java.net/browse/JDK-8252204): AArch64: 
Implement SHA3 accelerator/intrinsic

SHA3 java implementation have already been reworked to eliminate memory 
consumption and make it work faster:
- [JDK-8157495](https://bugs.openjdk.java.net/browse/JDK-8157495): SHA-3 Hash 
algorithm performance improvements (~12x speedup)
The latter issue addressed this previously, but there is still some room to 
improve SHA3 performance with current C2 implementation, which is proposed in 
this PR.

Option 1 (this PR)

With this change I unroll the internal cycles manually, inline it manually, and 
use local variables (not array) for data processing. Such an approach gives the 
best performance (see benchmark results). Without this change (with current 
array data processing) we observe a lot of load/store operations in comparison 
to processing in local variables, both on AMD64 and on AArch64.  
Native implementations shows that on AArch64 (32 GPU registers) SHA-3 algorithm 
can hold all 25 data and all temporary variables in registers. C2 can't 
optimize it as well because many regisers are allocated for internal usage: 
rscratch1, rscratch2, rmethod, rthread, etc.  With data in local variables the 
number of excessive load/stores is much smaller and performance result is much 
better.

Option 2 (alternative)

LINK: [the change](https://github.com/bulasevich/jdk/commit/095fc9db)

This is a more conservative change which minimizes code changes, but it has 
lower performance improvement. Please let me know if you think this change is 
better then main one: in this case I will replace it within this Pull Request.

With this change I unroll the internal cycles manually and use @Forceinline 
annotation. Manual unrolling is necessary because C2 does not recognize there 
is a counted cycle that can be completely unrolled. Instead of replacing the 
loop with five loop bodies C2 splits the loop to pre- main- and post- loop 
which is not good for this case. C2 works better when the array is created 
locally, but in our case the array is created on object instantiation, so C2 
can't prove the array length is constant. The second issue affecting 
performance is that methods with unrolled loops get larger and C2 does not 
inline them. It's addressed here by using @Forceinline annotation.

Benchmark results

We measured the change on four platforms: ARM32, PPC, ARM64 (with no advanced 
SHA-3 instructions) and AMD on
  
[MessageDigests](https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/security/MessageDigests.java)
 benchmark. Here is the result: 
http://cr.openjdk.java.net/~bulasevich/8275914/sha3_bench.html
- fix1 (red) is the Option 1 (this PR) change: gains 50/38/83/38% on 
ARM32/PPC/AArch64/AMD64
- fix2 (green) is the Option 2 (alternative) change: gains 23/33/40/17% on 
ARM32/PPC/AArch64/AMD64

Testing

Tested with JTREG and SHA-3 Project (NIST) test vectors: run SHA3 
implementation over the same vectors before and after the change, and checking 
the results are the same.
The test tool: http://cr.openjdk.java.net/~bulasevich/8275914/RunKat.java
The test vectors compiled from the [Final Algorithm 
Package](https://csrc.nist.gov/Projects/Hash-Functions/SHA-3-Project):
http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_224.txt
http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_256.txt
http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_384.txt
http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_512.txt

-------------

Commit messages:
 - Apply manual inline and unroll, plus moving data into local variables to 
help C2 with SHA3 impl

Changes: https://git.openjdk.java.net/jdk/pull/6847/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6847&range=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8275914
  Stats: 208 lines in 1 file changed: 114 ins; 93 del; 1 mod
  Patch: https://git.openjdk.java.net/jdk/pull/6847.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/6847/head:pull/6847

PR: https://git.openjdk.java.net/jdk/pull/6847

Reply via email to