Hi David,

The intrinsic that can be implemented for CRC32C will probably be even faster 
as it can make use of the specific crc32c instruction instead of using carry 
less multiplication.
It might not hurt to mock up a quick benchmark using the CRC32C
instruction against jdk8’s java.util.zip.CRC32 intrinsic.  CRC32
seems to consume 64 bits per instruction, and I think each instruction
depends on its predecessor (naively coded, at least; there’s probably
a hack to run them in parallel, just as there is a hack that allows fork/join.
So maybe we should think about that hack, too; simply computing the
two half-CRCs in parallel, then combining them as if fork/join, might
break the data dependence and allow higher speed).

The carryless multiply inner loop also consumes 64 bits per carryless
multiply, but the instructions are not data dependent among themselves,
so it might go faster.  Haswell chips execute clmul especially quickly.

The assembly language I wrote to make that go fast is agnostic about P;
it is parameterized by a structure that contains a small number of constants
that make it all go (6 64-bit numbers) based on powers of x (regarded as a
polynomial) modulo the CRC polynomial P.  If the endianness conventions
match, and if zip CRC runs faster than the CRC32 instruction, I could
revive it, and modern tools might let us avoid coding in hex.
Yes, when an intrinsic is added it will definitely need to be compared the the current crc32. The calculation speed of the two should at least be the same, and hopefully we will be able to get crc32c even faster. Here is a paper describing a algorithm that uses the Intel crc32 instruction, http://dl.acm.org/citation.cfm?id=2108885, and utilizes the fact that it is a 3 cycle instruction to execute 3 independent crc32 instructions. (Presume it is using a similar algorithm to merge the results as you do in your fork/join implementation.)

I have also been thinking that using fork-join to help for large arrays. I 
decided to not go after it for this first implementation of the new API. The 
other question is if that should be something that is more explicit, similar to 
parallel streams, as not everyone might want/expect that the API to suddenly 
start using multiple threads to calculate the result.
That’s why we have a system fork/join pool; if you don’t want lots of 
concurrency,
limit the number of threads in that pool.  Soon enough people are going to
yell at us for not using concurrency when it is available, and forcing them to
use a different API just to get concurrency will hinder the ability to write 
portable
code.

We definiately should try to do a parallel version that can further speed up Checksum's on very large arrays/buffers, but we probably need to discuss further and get input on how it should be enabled. Do we need to add an optional parallel method or do we simply make the current method parallel? I believe that is what needs to be discussed and decided on a more general level, as we might want to parallelize other APIs as well when possible and preferably should follow the same principle across the whole Java standard library.

//Staffan

Reply via email to