Re: RFR: 8259498: Reduce overhead of MD5 and SHA digests

Claes Redestad Fri, 15 Jan 2021 15:24:57 -0800

On Fri, 15 Jan 2021 22:54:32 GMT, Valerie Peng <valer...@openjdk.org> wrote:


>> - The MD5 intrinsics added by 
>> [JDK-8250902](https://bugs.openjdk.java.net/browse/JDK-8250902) shows that 
>> the `int[] x` isn't actually needed. This also applies to the SHA intrinsics 
>> from which the MD5 intrinsic takes inspiration
>> - Using VarHandles we can simplify the code in `ByteArrayAccess` enough to 
>> make it acceptable to use inline and replace the array in MD5 wholesale. 
>> This improves performance both in the presence and the absence of the 
>> intrinsic optimization.
>> - Doing the exact same thing in the SHA impls would be unwieldy (64+ element 
>> arrays), but allocating the array lazily gets most of the speed-up in the 
>> presence of an intrinsic while being neutral in its absence.
>> 
>> Baseline:
>>                               (digesterName)  (length)    Cnt     Score      
>> Error   Units
>> MessageDigests.digest                                MD5        16     15  
>> 2714.307 ±   21.133  ops/ms
>> MessageDigests.digest                                MD5      1024     15   
>> 318.087 ±    0.637  ops/ms
>> MessageDigests.digest                              SHA-1        16     15  
>> 1387.266 ±   40.932  ops/ms
>> MessageDigests.digest                              SHA-1      1024     15   
>> 109.273 ±    0.149  ops/ms
>> MessageDigests.digest                            SHA-256        16     15   
>> 995.566 ±   21.186  ops/ms
>> MessageDigests.digest                            SHA-256      1024     15    
>> 89.104 ±    0.079  ops/ms
>> MessageDigests.digest                            SHA-512        16     15   
>> 803.030 ±   15.722  ops/ms
>> MessageDigests.digest                            SHA-512      1024     15   
>> 115.611 ±    0.234  ops/ms
>> MessageDigests.getAndDigest                          MD5        16     15  
>> 2190.367 ±   97.037  ops/ms
>> MessageDigests.getAndDigest                          MD5      1024     15   
>> 302.903 ±    1.809  ops/ms
>> MessageDigests.getAndDigest                        SHA-1        16     15  
>> 1262.656 ±   43.751  ops/ms
>> MessageDigests.getAndDigest                        SHA-1      1024     15   
>> 104.889 ±    3.554  ops/ms
>> MessageDigests.getAndDigest                      SHA-256        16     15   
>> 914.541 ±   55.621  ops/ms
>> MessageDigests.getAndDigest                      SHA-256      1024     15    
>> 85.708 ±    1.394  ops/ms
>> MessageDigests.getAndDigest                      SHA-512        16     15   
>> 737.719 ±   53.671  ops/ms
>> MessageDigests.getAndDigest                      SHA-512      1024     15   
>> 112.307 ±    1.950  ops/ms
>> 
>> GC:
>> MessageDigests.getAndDigest:·gc.alloc.rate.norm      MD5        16     15   
>> 312.011 ±    0.005    B/op
>> MessageDigests.getAndDigest:·gc.alloc.rate.norm    SHA-1        16     15   
>> 584.020 ±    0.006    B/op
>> MessageDigests.getAndDigest:·gc.alloc.rate.norm  SHA-256        16     15   
>> 544.019 ±    0.016    B/op
>> MessageDigests.getAndDigest:·gc.alloc.rate.norm  SHA-512        16     15  
>> 1056.037 ±    0.003    B/op
>> 
>> Target:
>> Benchmark                                 (digesterName)  (length)    Cnt    
>>  Score      Error   Units
>> MessageDigests.digest                                MD5        16     15  
>> 3134.462 ±   43.685  ops/ms
>> MessageDigests.digest                                MD5      1024     15   
>> 323.667 ±    0.633  ops/ms
>> MessageDigests.digest                              SHA-1        16     15  
>> 1418.742 ±   38.223  ops/ms
>> MessageDigests.digest                              SHA-1      1024     15   
>> 110.178 ±    0.788  ops/ms
>> MessageDigests.digest                            SHA-256        16     15  
>> 1037.949 ±   21.214  ops/ms
>> MessageDigests.digest                            SHA-256      1024     15    
>> 89.671 ±    0.228  ops/ms
>> MessageDigests.digest                            SHA-512        16     15   
>> 812.028 ±   39.489  ops/ms
>> MessageDigests.digest                            SHA-512      1024     15   
>> 116.738 ±    0.249  ops/ms
>> MessageDigests.getAndDigest                          MD5        16     15  
>> 2314.379 ±  229.294  ops/ms
>> MessageDigests.getAndDigest                          MD5      1024     15   
>> 307.835 ±    5.730  ops/ms
>> MessageDigests.getAndDigest                        SHA-1        16     15  
>> 1326.887 ±   63.263  ops/ms
>> MessageDigests.getAndDigest                        SHA-1      1024     15   
>> 106.611 ±    2.292  ops/ms
>> MessageDigests.getAndDigest                      SHA-256        16     15   
>> 961.589 ±   82.052  ops/ms
>> MessageDigests.getAndDigest                      SHA-256      1024     15    
>> 88.646 ±    0.194  ops/ms
>> MessageDigests.getAndDigest                      SHA-512        16     15   
>> 775.417 ±   56.775  ops/ms
>> MessageDigests.getAndDigest                      SHA-512      1024     15   
>> 112.904 ±    2.014  ops/ms
>> 
>> GC
>> MessageDigests.getAndDigest:·gc.alloc.rate.norm      MD5        16     15   
>> 232.009 ±    0.006    B/op
>> MessageDigests.getAndDigest:·gc.alloc.rate.norm    SHA-1        16     15   
>> 584.021 ±    0.001    B/op
>> MessageDigests.getAndDigest:·gc.alloc.rate.norm  SHA-256        16     15   
>> 272.012 ±    0.015    B/op
>> MessageDigests.getAndDigest:·gc.alloc.rate.norm  SHA-512        16     15   
>> 400.017 ±    0.019    B/op
>> 
>> For the `digest` micro digesting small inputs is faster with all algorithms, 
>> ranging from ~1% for SHA-512 up to ~15% for MD5. The gain stems from not 
>> allocating and reading into a temporary buffer once outside of the 
>> intrinsic. SHA-1 does not see a statistically gain because the intrinsic is 
>> disabled by default on my HW.
>> 
>> For the `getAndDigest` micro - which tests 
>> `MessageDigest.getInstance(..).digest(..)` there are similar gains with this 
>> patch. The interesting aspect here is verifying the reduction in allocations 
>> per operation when there's an active intrinsic (again, not for SHA-1). 
>> JDK-8259065 (#1933) reduced allocations on each of these with 144B/op, which 
>> means allocation pressure for SHA-512 is down two thirds from 1200B/op to 
>> 400B/op in this contrived test.
>> 
>> I've verified there are no regressions in the absence of the intrinsic - 
>> which the SHA-1 numbers here help show.
>
> src/java.base/share/classes/sun/security/provider/ByteArrayAccess.java line 
> 214:
> 
> 
> Why do we remove the index checking from all methods? Isn't it safer to check 
> here in case the caller didn't? Or is it such checking is already implemented 
> inside the the various methods of VarHandle?

Yes, IOOBE checking is done by the VarHandle methods, while the Unsafe API is 
unsafe and needs careful precondition checking. It doesn't seem to matter for 
performance (interpreted code sees some benefit by the removal).

With the current usage an IOOBE is probably not observable, but there's a test 
that reflects into ByteArrayAccess and verifies exceptions are thrown as 
expected on faulty inputs.

-------------

PR: https://git.openjdk.java.net/jdk/pull/1855

Re: RFR: 8259498: Reduce overhead of MD5 and SHA digests

Reply via email to