On Tue, 20 Sep 2022 22:12:04 GMT, Xue-Lei Andrew Fan <[email protected]> wrote:
> Hi, > > In the message digest implementation, for example SHA256, in JDK, two bitwise > operations could be improved with equivalent arithmetic, and then the number > bitwise operations could be reduced accordingly. Specifically > "(x and y) xor ((complement x) and z)" could be replaced with the equivalent > "z xor (x and (y xor z))", and "(x and y) xor (x and z) xor (y and z)" could > be replaced with the equivalent "(x and y) xor ((x xor y) and z)". Each > replacement reduces one bitwise operation, and thus improve the performance. > > Per my testing on my MacOS laptop, the update on SHA256 improves the message > digest throughput by 0.5%-0.8%. The improvement is not significant, but > might be worthy of it as the update is pretty simple and trivial, for those > platforms that do not support CPU intrinsic for a certain hash algorithm. > > This patch update SHA2 implementation only. Please let me know what do you > think. If you are good with this little bit performance, I will update more > message digest implementations. If no one interested in these little > benefits, I will close this PR later. > > Thanks, > Xuelei (Thinking out loud, because I am on the fence with this change) First (negative), it departs from SHA function codes as stated in most papers. This is a minor thing if we can prove the equivalence, but it still adds to maintenance overhead. While the translation of `maj` (boolean majority) function is more or less straightforward, using the `xor`/`and` distributivity, the translation for `ch` is less so. I had to scribble down things to reveal that new expressions' DNF is the old one. Second (negative), I suspect current code is optimized more for the instruction-level parallelism than for the number of operations. (This might be the reason why SHA papers do this, to optimize latency in hardware implementations?) In case of `ch`, we have one operation less, but it is probably a wash, since in the original expression that spare operation can run in parallel with the rest of the code. In `maj` case it is worse: we now have three operations on critical path (expression tree is one level deeper) instead of two, which inhibits the ILP. Third (positive), I think the real target are VM modes which cannot exploit any ILP, so reduction in op count / bytecode size is sensible. For example, Zero that runs only in interpreter, and where bytecode-level ILP cannot be exploited as every bytecode instruction is handled by large chunk of code. The improvements/regressions on this code are usually visible when hashing the JMODs or jlinking the image. I see about 0.3% faster linux-x86_64-zero-release builds with this patch. Fourth (positive), it is not unprecedented to optimize for interpreters here. I did https://bugs.openjdk.org/browse/JDK-8256523 the last time in the same code. Are you getting +0.8% throughput on microbenchmarks? ------------- PR: https://git.openjdk.org/jdk/pull/10365
