Hi Ludovic,

On 8/3/20 9:07 PM, Ludovic Henry wrote:
Updated webrev: http://cr.openjdk.java.net/~luhenry/8250902/webrev.02

Next code in inline_digestBase_implCompressMB should be reversed (get_long_*() 
should be called for long_state):

    if (long_state) {
      state = get_state_from_digestBase_object(digestBase_obj);
    } else {
      state = get_long_state_from_digestBase_object(digestBase_obj);
    }

Thanks for pointing that out. I tested everything with `hotspot:tier1` and 
`jdk:tier1` in fastdebug on Windows-x86, Windows-x64 and Linux-x64.

Code in library_call.cpp is good now.


It seems that the algorithm can be optimized further using SSE/AVX 
instructions. I am not aware of any specific SSE/AVX implementation which 
leverages those instructions in the best possible way. Sandhya can chime in 
more on that.

I have done some research prior to implementing this intrinsic and the only 
pointers I could find to vectorized MD5 is on computing _multiple_ MD5 hashes 
in parallel but not a _single_ MD5 hash. Using vectors effectively parallelize 
the computation of many MD5 hash, but it does not accelerate the computation of 
a single MD5 hash. And looking at the algorithm, every step depends on the 
previous step's result, which make it particularly hard to 
parallelize/vectorize.

As far as I know, I came across this which points to MD5 SSE/AVX 
implementation. 
https://software.intel.com/content/www/us/en/develop/articles/intel-isa-l-cryptographic-hashes-for-cloud-storage.html

That library points to computing many MD5 hashes in parallel. Quoting: "Intel® ISA-L 
uses a novel technique called multi-buffer hashing, which [...] compute several hashes at 
once within a single core." That is similar to what I found in researching how to 
vectorize MD5. I also did not find any reference of an ISA-level implementation of MD5, 
neither in x86 nor ARM.

If you can point me to a document describing how to vectorize MD5, I would be 
more than happy to take a look and implement the algorithm. However, my 
understanding is that MD5 is not vectorizable by-design.

I would leave this investigation to Intel's Java group. They are expert in this 
area!

For now, lets put current implementation into JDK.


Add tests to verify intrinsic implementation. You can use 
test/hotspot/jtreg/compiler/intrinsics/sha/ as examples.

I looked at these tests and they already cover MD5. I am not sure what's the 
best way to add tests here: 1. should I rename ` compiler/intrinsics/sha` to ` 
compiler/intrinsics/digest` and add the md5 tests there, 2. should I just add ` 
compiler/intrinsics/md5`, or 3. the name doesn't matter and I can just add it 
in ` compiler/intrinsics/sha`?

3. Just add MD5 tests into existing SHA directory.

Note, compiler/intrinsics/sha testing is done in tier2. I ran it and it passed but it does not test MD5 a lot as I understand.


In vm_version_x86.cpp move UseMD5Intrinsics flag setting near UseSHA flag 
setting.

Fixed.

It is not moved in webrev.02


In new file macroAssembler_x86_md5.cpp no need empty line after copyright line. 
There is also typo 'rrdistribute':

   * This code is free software; you can rrdistribute it and/or modify it

Our validate-headers check failed. See GPL header template: 
./make/templates/gpl-header

I updated the header, and added the license for the original code for the MD5 
core algorithm.

You don't need to use Oracle copyright line. Using original Microsoft's 
copyright line is fine since you are author.
Thank you for adding license for original code.


Did you test it on 32-bit x86?

I did run `hotspot:tier1` and `jdk:tier1` on Windows-x86, Windows-x64 and 
Linux-x64.

Would be interesting to see result of artificially switching off AVX and SSE:
'-XX:UseSSE=0 -XX:UseAVX=0'. It will make sure that only general instructions 
are needed.

The results are below:

Very good. Thank you for testing it.

Regards,
Vladimir


-XX:-UseMD5Intrinsics
Benchmark              (digesterName)  (length)  (provider)   Mode  Cnt     
Score    Error   Units
MessageDigests.digest             md5        64     DEFAULT  thrpt   10  
3512.618 ± 9.384  ops/ms
MessageDigests.digest             md5      1024     DEFAULT  thrpt   10   
450.037 ± 1.213  ops/ms
MessageDigests.digest             md5     16384     DEFAULT  thrpt   10    
29.887 ± 0.057  ops/ms
MessageDigests.digest             md5   1048576     DEFAULT  thrpt   10     
0.485 ± 0.002  ops/ms

-XX:+UseMD5Intrinsics
Benchmark              (digesterName)  (length)  (provider)   Mode  Cnt     
Score   Error   Units
MessageDigests.digest             md5        64     DEFAULT  thrpt   10  4212.156 
± 7.781  ops/ ms => 19% speedup
MessageDigests.digest             md5      1024     DEFAULT  thrpt   10   548.609 
± 1.374  ops/ ms => 22% speedup
MessageDigests.digest             md5     16384     DEFAULT  thrpt   10    37.961 
± 0.079  ops/ ms => 27% speedup
MessageDigests.digest             md5   1048576     DEFAULT  thrpt   10     0.596 
± 0.006  ops/ ms => 23% speedup

-XX:-UseMD5Intrinsics -XX:UseSSE=0 -XX:UseAVX=0
Benchmark              (digesterName)  (length)  (provider)   Mode  Cnt     
Score    Error   Units
MessageDigests.digest             md5        64     DEFAULT  thrpt   10  
3462.769 ± 4.992  ops/ms
MessageDigests.digest             md5      1024     DEFAULT  thrpt   10   
443.858 ± 0.576  ops/ms
MessageDigests.digest             md5     16384     DEFAULT  thrpt   10    
29.723 ± 0.480  ops/ms
MessageDigests.digest             md5   1048576     DEFAULT  thrpt   10     
0.470 ± 0.001  ops/ms

-XX:+UseMD5Intrinsics -XX:UseSSE=0 -XX:UseAVX=0
Benchmark              (digesterName)  (length)  (provider)   Mode  Cnt     
Score   Error   Units
MessageDigests.digest             md5        64     DEFAULT  thrpt   10  4237.219 
± 15.627  ops/ms => 22% speedup
MessageDigests.digest             md5      1024     DEFAULT  thrpt   10   564.625 
±  1.510  ops/ms => 27% speedup
MessageDigests.digest             md5     16384     DEFAULT  thrpt   10    38.004 
±  0.078  ops/ms => 28% speedup
MessageDigests.digest             md5   1048576     DEFAULT  thrpt   10     0.597 
±  0.002  ops/ms => 27% speedup

Thank you,
Ludovic

Reply via email to