Thanks Ludovic Detailed explanation and Sandhya for clarification on the
vectorization.

Regards,
Vivek


On Mon, Aug 3, 2020 at 9:07 PM Ludovic Henry <luhe...@microsoft.com> wrote:

> Updated webrev: http://cr.openjdk.java.net/~luhenry/8250902/webrev.02
>
> > Next code in inline_digestBase_implCompressMB should be reversed
> (get_long_*() should be called for long_state):
> >
> >    if (long_state) {
> >      state = get_state_from_digestBase_object(digestBase_obj);
> >    } else {
> >      state = get_long_state_from_digestBase_object(digestBase_obj);
> >    }
>
> Thanks for pointing that out. I tested everything with `hotspot:tier1` and
> `jdk:tier1` in fastdebug on Windows-x86, Windows-x64 and Linux-x64.
>
> > It seems that the algorithm can be optimized further using SSE/AVX
> instructions. I am not aware of any specific SSE/AVX implementation which
> leverages those instructions in the best possible way. Sandhya can chime in
> more on that.
>
> I have done some research prior to implementing this intrinsic and the
> only pointers I could find to vectorized MD5 is on computing _multiple_ MD5
> hashes in parallel but not a _single_ MD5 hash. Using vectors effectively
> parallelize the computation of many MD5 hash, but it does not accelerate
> the computation of a single MD5 hash. And looking at the algorithm, every
> step depends on the previous step's result, which make it particularly hard
> to parallelize/vectorize.
>
> > As far as I know, I came across this which points to MD5 SSE/AVX
> implementation.
> https://software.intel.com/content/www/us/en/develop/articles/intel-isa-l-cryptographic-hashes-for-cloud-storage.html
>
> That library points to computing many MD5 hashes in parallel. Quoting:
> "Intel® ISA-L uses a novel technique called multi-buffer hashing, which
> [...] compute several hashes at once within a single core." That is similar
> to what I found in researching how to vectorize MD5. I also did not find
> any reference of an ISA-level implementation of MD5, neither in x86 nor ARM.
>
> If you can point me to a document describing how to vectorize MD5, I would
> be more than happy to take a look and implement the algorithm. However, my
> understanding is that MD5 is not vectorizable by-design.
>
> > Add tests to verify intrinsic implementation. You can use
> test/hotspot/jtreg/compiler/intrinsics/sha/ as examples.
>
> I looked at these tests and they already cover MD5. I am not sure what's
> the best way to add tests here: 1. should I rename `
> compiler/intrinsics/sha` to ` compiler/intrinsics/digest` and add the md5
> tests there, 2. should I just add ` compiler/intrinsics/md5`, or 3. the
> name doesn't matter and I can just add it in ` compiler/intrinsics/sha`?
>
> > In vm_version_x86.cpp move UseMD5Intrinsics flag setting near UseSHA
> flag setting.
>
> Fixed.
>
> > In new file macroAssembler_x86_md5.cpp no need empty line after
> copyright line. There is also typo 'rrdistribute':
> >
> >   * This code is free software; you can rrdistribute it and/or modify it
> >
> > Our validate-headers check failed. See GPL header template:
> ./make/templates/gpl-header
>
> I updated the header, and added the license for the original code for the
> MD5 core algorithm.
>
> > Did you test it on 32-bit x86?
>
> I did run `hotspot:tier1` and `jdk:tier1` on Windows-x86, Windows-x64 and
> Linux-x64.
>
> > Would be interesting to see result of artificially switching off AVX and
> SSE:
> > '-XX:UseSSE=0 -XX:UseAVX=0'. It will make sure that only general
> instructions are needed.
>
> The results are below:
>
> -XX:-UseMD5Intrinsics
> Benchmark              (digesterName)  (length)  (provider)   Mode  Cnt
>  Score    Error   Units
> MessageDigests.digest             md5        64     DEFAULT  thrpt   10
> 3512.618 ± 9.384  ops/ms
> MessageDigests.digest             md5      1024     DEFAULT  thrpt   10
>  450.037 ± 1.213  ops/ms
> MessageDigests.digest             md5     16384     DEFAULT  thrpt   10
> 29.887 ± 0.057  ops/ms
> MessageDigests.digest             md5   1048576     DEFAULT  thrpt   10
>  0.485 ± 0.002  ops/ms
>
> -XX:+UseMD5Intrinsics
> Benchmark              (digesterName)  (length)  (provider)   Mode  Cnt
>  Score   Error   Units
> MessageDigests.digest             md5        64     DEFAULT  thrpt   10
> 4212.156 ± 7.781  ops/ ms => 19% speedup
> MessageDigests.digest             md5      1024     DEFAULT  thrpt   10
>  548.609 ± 1.374  ops/ ms => 22% speedup
> MessageDigests.digest             md5     16384     DEFAULT  thrpt   10
> 37.961 ± 0.079  ops/ ms => 27% speedup
> MessageDigests.digest             md5   1048576     DEFAULT  thrpt   10
>  0.596 ± 0.006  ops/ ms => 23% speedup
>
> -XX:-UseMD5Intrinsics -XX:UseSSE=0 -XX:UseAVX=0
> Benchmark              (digesterName)  (length)  (provider)   Mode  Cnt
>  Score    Error   Units
> MessageDigests.digest             md5        64     DEFAULT  thrpt   10
> 3462.769 ± 4.992  ops/ms
> MessageDigests.digest             md5      1024     DEFAULT  thrpt   10
>  443.858 ± 0.576  ops/ms
> MessageDigests.digest             md5     16384     DEFAULT  thrpt   10
> 29.723 ± 0.480  ops/ms
> MessageDigests.digest             md5   1048576     DEFAULT  thrpt   10
>  0.470 ± 0.001  ops/ms
>
> -XX:+UseMD5Intrinsics -XX:UseSSE=0 -XX:UseAVX=0
> Benchmark              (digesterName)  (length)  (provider)   Mode  Cnt
>  Score   Error   Units
> MessageDigests.digest             md5        64     DEFAULT  thrpt   10
> 4237.219 ± 15.627  ops/ms => 22% speedup
> MessageDigests.digest             md5      1024     DEFAULT  thrpt   10
>  564.625 ±  1.510  ops/ms => 27% speedup
> MessageDigests.digest             md5     16384     DEFAULT  thrpt   10
> 38.004 ±  0.078  ops/ms => 28% speedup
> MessageDigests.digest             md5   1048576     DEFAULT  thrpt   10
>  0.597 ±  0.002  ops/ms => 27% speedup
>
> Thank you,
> Ludovic
>
-- 
Thanks and Regards,

Vivek Deshpande
viv.d...@gmail.com

Reply via email to