RFR :7088419 : (L) Use x86 Hardware CRC32 Instruction with java.util.zip.CRC32 and java.util.zip.Adler32

David Chase Thu, 16 May 2013 07:55:28 -0700

webrev: http://cr.openjdk.java.net/~drchase/7088419/webrev.01/


problem: Some applications spend a surprising amount of time computing CRC32s
(Not sure I'm supposed to be precise on an open mailing list).  Recent Intel
architectures provide instructions that might be useful in addressing this.

See https://jbs.oracle.com/bugs/browse/JDK-7088419

I turned this into a general attack on performance in Adler32 and CRC32, partly 
because the bug report was not clear on the size of the problematic inputs.  
The general approach turned out to be useful, because it's tricky to get the 
small-case overheads down for the accelerated-instruction version of the code.


fix: 
1) For CRC32 and Adler32, break out the "small" case (single bytes, and up to 
around 60-80 bytes)
to be computed on the Java side, avoiding JNI overheads.

2) For CRC32 and Adler32, figure out the "combine" operations for the checksum 
of a concatenated pair of byte sequences, and add fork-join parallelism for 
large-enough inputs.  Tuning this is hard, so large "small" buffer sizes were 
chosen (< 512K for unaccelerated CRC32,  1MB for Adler32 and accelerated CRC32) 
to make this a safe optimization.  This can be disabled by setting the 
(not-yet-documented, perhaps wrongly named) property "sun.zip.serialOnly=true". 
 I just now noticed that this is not the case for Adler32; assuming we agree on 
the existence of this flag and its name, this needs to be added there, too.

3) For Adler32, defer calculation of the actual "Adler" checksum until it is 
requested.  This is an optimization for byte-at-a-time use.

4) For CRC32, use the new-ish PCLMULQDQ instruction (64-bit by 64-bit carryless 
multiply, yielding a 128-bit result) in the style described in Intel's white 
paper on using this instruction to compute CRCs.  All the constants are 
different because the CRC32 is bit-reversed from the checksums computed in 
Intel's paper, but the unrolling is the same, and the fill and drain code is 
also similar.  This is by default enabled whenever CLMUL and AVX features are 
both present, and can be disabled at the command line with -XX:-UseCLMUL (or  
-XX:-UseAVX).

There is a companion webrev that puts information about the availability of the 
PCLMULQDQ in 3-operand form into a hidden property:

http://cr.openjdk.java.net/~drchase/8014362/webrev.02/


testing:

The CRC benchmark test was adjusted to also warm-up the small-stuff case.

A new test was written that was designed to exercise all the corner cases of 
the new Adler32 and CRC32 implementations; byte-at-a-time is compared to 
byte-at-a-time-not-looking is compared to arrays of bytes is compared to 
DirectBuffers, at a variety of sizes and alignments designed to exercise 
combinations of non-empty fill and drain for the Intel-accelerated CRC32, and 
corner cases for the fork-join code.

The new test takes just under 2 minutes on a T1; that seems like it is cutting 
it close, so I should either trim the test or give it a little more time.

Separate C unit tests (not checked in, not sure where to put them) were written 
to exhaustively test the accelerated code and research its portability to 
various compilers.


problems and justifications:

Using our currently specified compilers for building the JDK, the C access to 
the new Intel intrinsics is not available.  This is unfortunately also true of 
the assemblers paired with our currently specified compilers.
( What IS the specified compiler for building on Linux?  Gcc 4.2 on the Mac, if 
asked for a 16-byte alignment, helpfully tells me that is too big, and it will 
use 15 instead.  This might be an issue on Linux, if 4.2 is the build compiler 
version there. )  Therefore, the native code for crc32 includes:

1) the C code with intrinsics for the accelerated CRC32 kernel (compilable with 
gcc 4.6 or clang, and almost compilable with Sun Studio 12.3; it parses and 
optimizes, but triggers a crash in the back end, which needs to be reported).

2) the 64-bit version of the assembly language, including commented byte 
sequences for the instructions that the build-specified assemblers do not 
recognize.

3) the 32-bit version of the assembly language, including commented byte 
sequences for the instructions that the build-specified assemblers do not 
recognize.

My hope is that the assembly language is temporary, though it is only a modest 
step higher in abstraction from the C code, which is almost entirely intrinsic 
calls.  If we have to wait for the C compilers, perhaps the assemblers will all 
become modern enough to handle the instructions by their proper names.  I have 
a small tool that I wrote to automatically generate the asm statement from the 
output lldb x/i.  (Where does such a weird little tool belong?)

I tried to use the instruction just as a stand-alone intrinsic (that might be 
wrapped in unsafe code and called from Java) but that did not yield a 
performance gain; it only seems to work well if embedded in an unrolled loop 
with several independent pipelines of computation running, all feeding 128-bit 
accumulator registers.  That is to say, Intel wrote that white paper for a 
reason, which was to help guide people like me towards happy results.  To my 
knowledge, we don't support compilation to 128-bit wide xmm registers in 
hotspot, so this approach was unlikely to work.

Another possibility is to wrap the entire kernel up as a single 80-instruction 
intrinsic.  This didn't seem like a win to me, and it would require the use of 
Unsafe code to interface to it.  Even then it's not clear it's possible, 
because the kernel requires that its input be 16-byte aligned.  I don't know if 
the alignment is guaranteed by the garbage collector, and I could not find a 
way in Unsafe to even get the (temporarily valid) address of a GC-allocated 
object.  It would be helpful to know the address even if it is only probably 
true, because for fork-join it is helpful to arrange splits so that they land 
on a 16-byte boundary so that the eventual serial case will (probably) go as 
fast as possible with empty fill/drain steps.

Yet another possibility is to move all of "fastcrc32" (the outer fill and drain 
code written in C) into a single intrinsic, but that's 208 instructions total 
(I just looked).

Oh yeah, performance improvements.  On an Ivy Bridge laptop (2 cores, 2 
threads/core), the accelerated CRC without workstealing is 2.5x faster on large 
inputs, 2x faster at about 512 bytes.  Modest amounts of fork-join parallelism 
(only fork for work 1M or larger if CLMUL present) provide about 1.5x on top of 
that, for a 3.75x speedup. 

Fork-join benchmarks pretty well on the Sun T1, T2, T4 series of processors 
down to somewhat smaller block sizes, but for now I am conservative an use a 
serial-if-smaller-than setting of 512K.  I've observed speedups as high as 13x 
on a T1 (1(32), dr-evil), 24x on a T2+ ( 4(256), mrspock), and 8x on a T4 
(8(64), sc11d1901).

Parallel performance is a little harder to reason about on big x86 boxes (both 
Intel and AMD), so I am leaving the threshold high.  Dave Dice thought this 
might be an artifact of cores being put into a power-saving mode and being slow 
to wake (the particular benchmark I wrote would have been pessimal for this, 
since it alternated between serial and parallel phases).  The eventual speedups 
were often impressive (6x-12x) but it was unclear how many hardware threads 
(out of the 32-64 available) I was using to obtain this.  Yes, I need to plug 
this into JMH for fine-tuning.  I'm using the system fork-join pool because 
that initially seemed like the good-citizen thing to do (balance CRC/Adler 
needs against those of anyone else who might be doing work) but I am starting 
to wonder if it would make more sense to establish a small private pool with a 
bounded number of threads, so that I don't need to worry about being a good 
citizen so much.  It occurs to me, late in the game, that using big-ish units 
of work is another, different way to be a bad citizen.  (I would prefer to get 
this checked in if it represents a net improvement, and then work on the tuning 
afterwards.)

Thanks for your consideration, I know that this is large and somewhat late.  It 
took a while to get the "last" bug out.

David

RFR :7088419 : (L) Use x86 Hardware CRC32 Instruction with java.util.zip.CRC32 and java.util.zip.Adler32

Reply via email to