Thanks Ludovic Detailed explanation and Sandhya for clarification on the vectorization.
Regards, Vivek On Mon, Aug 3, 2020 at 9:07 PM Ludovic Henry <luhe...@microsoft.com> wrote: > Updated webrev: http://cr.openjdk.java.net/~luhenry/8250902/webrev.02 > > > Next code in inline_digestBase_implCompressMB should be reversed > (get_long_*() should be called for long_state): > > > > if (long_state) { > > state = get_state_from_digestBase_object(digestBase_obj); > > } else { > > state = get_long_state_from_digestBase_object(digestBase_obj); > > } > > Thanks for pointing that out. I tested everything with `hotspot:tier1` and > `jdk:tier1` in fastdebug on Windows-x86, Windows-x64 and Linux-x64. > > > It seems that the algorithm can be optimized further using SSE/AVX > instructions. I am not aware of any specific SSE/AVX implementation which > leverages those instructions in the best possible way. Sandhya can chime in > more on that. > > I have done some research prior to implementing this intrinsic and the > only pointers I could find to vectorized MD5 is on computing _multiple_ MD5 > hashes in parallel but not a _single_ MD5 hash. Using vectors effectively > parallelize the computation of many MD5 hash, but it does not accelerate > the computation of a single MD5 hash. And looking at the algorithm, every > step depends on the previous step's result, which make it particularly hard > to parallelize/vectorize. > > > As far as I know, I came across this which points to MD5 SSE/AVX > implementation. > https://software.intel.com/content/www/us/en/develop/articles/intel-isa-l-cryptographic-hashes-for-cloud-storage.html > > That library points to computing many MD5 hashes in parallel. Quoting: > "Intel® ISA-L uses a novel technique called multi-buffer hashing, which > [...] compute several hashes at once within a single core." That is similar > to what I found in researching how to vectorize MD5. I also did not find > any reference of an ISA-level implementation of MD5, neither in x86 nor ARM. > > If you can point me to a document describing how to vectorize MD5, I would > be more than happy to take a look and implement the algorithm. However, my > understanding is that MD5 is not vectorizable by-design. > > > Add tests to verify intrinsic implementation. You can use > test/hotspot/jtreg/compiler/intrinsics/sha/ as examples. > > I looked at these tests and they already cover MD5. I am not sure what's > the best way to add tests here: 1. should I rename ` > compiler/intrinsics/sha` to ` compiler/intrinsics/digest` and add the md5 > tests there, 2. should I just add ` compiler/intrinsics/md5`, or 3. the > name doesn't matter and I can just add it in ` compiler/intrinsics/sha`? > > > In vm_version_x86.cpp move UseMD5Intrinsics flag setting near UseSHA > flag setting. > > Fixed. > > > In new file macroAssembler_x86_md5.cpp no need empty line after > copyright line. There is also typo 'rrdistribute': > > > > * This code is free software; you can rrdistribute it and/or modify it > > > > Our validate-headers check failed. See GPL header template: > ./make/templates/gpl-header > > I updated the header, and added the license for the original code for the > MD5 core algorithm. > > > Did you test it on 32-bit x86? > > I did run `hotspot:tier1` and `jdk:tier1` on Windows-x86, Windows-x64 and > Linux-x64. > > > Would be interesting to see result of artificially switching off AVX and > SSE: > > '-XX:UseSSE=0 -XX:UseAVX=0'. It will make sure that only general > instructions are needed. > > The results are below: > > -XX:-UseMD5Intrinsics > Benchmark (digesterName) (length) (provider) Mode Cnt > Score Error Units > MessageDigests.digest md5 64 DEFAULT thrpt 10 > 3512.618 ± 9.384 ops/ms > MessageDigests.digest md5 1024 DEFAULT thrpt 10 > 450.037 ± 1.213 ops/ms > MessageDigests.digest md5 16384 DEFAULT thrpt 10 > 29.887 ± 0.057 ops/ms > MessageDigests.digest md5 1048576 DEFAULT thrpt 10 > 0.485 ± 0.002 ops/ms > > -XX:+UseMD5Intrinsics > Benchmark (digesterName) (length) (provider) Mode Cnt > Score Error Units > MessageDigests.digest md5 64 DEFAULT thrpt 10 > 4212.156 ± 7.781 ops/ ms => 19% speedup > MessageDigests.digest md5 1024 DEFAULT thrpt 10 > 548.609 ± 1.374 ops/ ms => 22% speedup > MessageDigests.digest md5 16384 DEFAULT thrpt 10 > 37.961 ± 0.079 ops/ ms => 27% speedup > MessageDigests.digest md5 1048576 DEFAULT thrpt 10 > 0.596 ± 0.006 ops/ ms => 23% speedup > > -XX:-UseMD5Intrinsics -XX:UseSSE=0 -XX:UseAVX=0 > Benchmark (digesterName) (length) (provider) Mode Cnt > Score Error Units > MessageDigests.digest md5 64 DEFAULT thrpt 10 > 3462.769 ± 4.992 ops/ms > MessageDigests.digest md5 1024 DEFAULT thrpt 10 > 443.858 ± 0.576 ops/ms > MessageDigests.digest md5 16384 DEFAULT thrpt 10 > 29.723 ± 0.480 ops/ms > MessageDigests.digest md5 1048576 DEFAULT thrpt 10 > 0.470 ± 0.001 ops/ms > > -XX:+UseMD5Intrinsics -XX:UseSSE=0 -XX:UseAVX=0 > Benchmark (digesterName) (length) (provider) Mode Cnt > Score Error Units > MessageDigests.digest md5 64 DEFAULT thrpt 10 > 4237.219 ± 15.627 ops/ms => 22% speedup > MessageDigests.digest md5 1024 DEFAULT thrpt 10 > 564.625 ± 1.510 ops/ms => 27% speedup > MessageDigests.digest md5 16384 DEFAULT thrpt 10 > 38.004 ± 0.078 ops/ms => 28% speedup > MessageDigests.digest md5 1048576 DEFAULT thrpt 10 > 0.597 ± 0.002 ops/ms => 27% speedup > > Thank you, > Ludovic > -- Thanks and Regards, Vivek Deshpande viv.d...@gmail.com