Re: RFR JDK-6321472: Add CRC-32C API

Staffan Friberg Mon, 20 Oct 2014 14:00:35 -0700

Hi David,

The intrinsic that can be implemented for CRC32C will probably be even faster 
as it can make use of the specific crc32c instruction instead of using carry 
less multiplication.

It might not hurt to mock up a quick benchmark using the CRC32C
instruction against jdk8’s java.util.zip.CRC32 intrinsic.  CRC32
seems to consume 64 bits per instruction, and I think each instruction
depends on its predecessor (naively coded, at least; there’s probably
a hack to run them in parallel, just as there is a hack that allows fork/join.
So maybe we should think about that hack, too; simply computing the
two half-CRCs in parallel, then combining them as if fork/join, might
break the data dependence and allow higher speed).


The carryless multiply inner loop also consumes 64 bits per carryless
multiply, but the instructions are not data dependent among themselves,
so it might go faster.  Haswell chips execute clmul especially quickly.

The assembly language I wrote to make that go fast is agnostic about P;
it is parameterized by a structure that contains a small number of constants
that make it all go (6 64-bit numbers) based on powers of x (regarded as a
polynomial) modulo the CRC polynomial P.  If the endianness conventions
match, and if zip CRC runs faster than the CRC32 instruction, I could
revive it, and modern tools might let us avoid coding in hex.

Yes, when an intrinsic is added it will definitely need to be comparedthe the current crc32. The calculation speed of the two should at leastbe the same, and hopefully we will be able to get crc32c even faster.Here is a paper describing a algorithm that uses the Intel crc32instruction, http://dl.acm.org/citation.cfm?id=2108885, and utilizes thefact that it is a 3 cycle instruction to execute 3 independent crc32instructions. (Presume it is using a similar algorithm to merge theresults as you do in your fork/join implementation.)

I have also been thinking that using fork-join to help for large arrays. I 
decided to not go after it for this first implementation of the new API. The 
other question is if that should be something that is more explicit, similar to 
parallel streams, as not everyone might want/expect that the API to suddenly 
start using multiple threads to calculate the result.

That’s why we have a system fork/join pool; if you don’t want lots of 
concurrency,
limit the number of threads in that pool.  Soon enough people are going to
yell at us for not using concurrency when it is available, and forcing them to
use a different API just to get concurrency will hinder the ability to write 
portable
code.

We definiately should try to do a parallel version that can furtherspeed up Checksum's on very large arrays/buffers, but we probably needto discuss further and get input on how it should be enabled. Do we needto add an optional parallel method or do we simply make the currentmethod parallel? I believe that is what needs to be discussed anddecided on a more general level, as we might want to parallelize otherAPIs as well when possible and preferably should follow the sameprinciple across the whole Java standard library.


//Staffan

Re: RFR JDK-6321472: Add CRC-32C API

Reply via email to