On Tue, 20 Sep 2022 22:12:04 GMT, Xue-Lei Andrew Fan <[email protected]> wrote:

> Hi,
> 
> In the message digest implementation, for example SHA256, in JDK, two bitwise 
> operations could be improved with equivalent arithmetic, and then the number 
> bitwise operations could be reduced accordingly.  Specifically
> "(x and y) xor ((complement x) and z)" could be replaced with the equivalent 
> "z xor (x and (y xor z))", and "(x and y) xor (x and z) xor (y and z)" could 
> be replaced with the equivalent "(x and y) xor ((x xor y) and z)".  Each 
> replacement reduces one bitwise operation, and thus improve the performance.
> 
> Per my testing on my MacOS laptop, the update on SHA256 improves the message 
> digest throughput by 0.5%-0.8%.  The improvement is not significant, but 
> might be worthy of it as the update is pretty simple and trivial, for those 
> platforms that do not support CPU intrinsic for a certain hash algorithm.
> 
> This patch update SHA2 implementation only.  Please let me know what do you 
> think.  If you are good  with this little bit performance, I will update more 
> message digest implementations.  If no one interested in these little 
> benefits, I will close this PR later.
> 
> Thanks,
> Xuelei

(Thinking out loud, because I am on the fence with this change)

First (negative), it departs from SHA function codes as stated in most papers. 
This is a minor thing if we can prove the equivalence, but it still adds to 
maintenance overhead. While the translation of `maj` (boolean majority) 
function is more or less straightforward, using the `xor`/`and` distributivity, 
the translation for `ch` is less so. I had to scribble down things to reveal 
that new expressions' DNF is the old one.

Second (negative), I suspect current code is optimized more for the 
instruction-level parallelism than for the number of operations. (This might be 
the reason why SHA papers do this, to optimize latency in hardware 
implementations?) In case of `ch`, we have one operation less, but it is 
probably a wash, since in the original expression that spare operation can run 
in parallel with the rest of the code. In `maj` case it is worse: we now have 
three operations on critical path (expression tree is one level deeper) instead 
of two, which inhibits the ILP.

Third (positive), I think the real target are VM modes which cannot exploit any 
ILP, so reduction in op count / bytecode size is sensible. For example, Zero 
that runs only in interpreter, and where bytecode-level ILP cannot be exploited 
as every bytecode instruction is handled by large chunk of code. The 
improvements/regressions on this code are usually visible when hashing the 
JMODs or jlinking the image. I see about 0.3% faster linux-x86_64-zero-release 
builds with this patch.

Fourth (positive), it is not unprecedented to optimize for interpreters here. I 
did https://bugs.openjdk.org/browse/JDK-8256523 the last time in the same code.

Are you getting +0.8% throughput on microbenchmarks?

-------------

PR: https://git.openjdk.org/jdk/pull/10365

Reply via email to