On 2013-05-19, at 12:50 PM, Alan Bateman <alan.bate...@oracle.com> wrote:

> On 16/05/2013 15:50, David Chase wrote:
>> webrev: http://cr.openjdk.java.net/~drchase/7088419/webrev.01/
>> 
>> problem: Some applications spend a surprising amount of time computing CRC32s
>> (Not sure I'm supposed to be precise on an open mailing list).  Recent Intel
>> architectures provide instructions that might be useful in addressing this.
>> 
>> See https://jbs.oracle.com/bugs/browse/JDK-7088419
>> 
>> I turned this into a general attack on performance in Adler32 and CRC32, 
>> partly because the bug report was not clear on the size of the problematic 
>> inputs.  The general approach turned out to be useful, because it's tricky 
>> to get the small-case overheads down for the accelerated-instruction version 
>> of the code.
>> 
> I took a second pass over the webrev. If we've going to have assembly in 
> CRC32.c then it makes me wonder if this should be the place to probe if CLMUL 
> is supported. Clearly there may be others cases where the VM needs to probe 
> too but I'm just wondering if the capability really needs to be communicated 
> via a property (at least for now). The other thing that comes to mind is 
> whether we could re-write it in java in a way that allows C2 to generate code 
> that uses CLMUL. I realize that is probably a much bigger effort but I can 
> imagine other cases (like crypto where it could be useful too).

The reason for passing it through as a property is two-fold.  First, it's a 
mess of code to probe that property, and I'd rather not duplicate it.  Second, 
for better of worse, right now we have command-line flags to disable the use of 
these hardware features, for example -XX:-UseCLMUL, -XX:-UseAES.  Routing this 
through the jvm and a property means that it all goes together.

I think it would be a Big Deal to get this working through the compiler in any 
quick way, but it's being discussed for the future.  I tried using the Magic 
Instruction in simpler idioms, and it was slower, not faster.  We also need to 
be certain of alignment, and that's not possible in the current Unsafe 
interface (the code that you see there is making a best-effort guess).

Also, the reason the assembly language is there is not because it is 
technically necessary, but rather because we get our build compilers (at least 
for now, I think it will change soon) from the software museum.  I compiled the 
C version of the algorithm with clang, and with gcc 4.6, and (almost) with 
Solaris Studio 12.3.  Visual Studio 12, the one we're not using yet, also 
claims to process those intrinsics, though I have not tested it yet.

> On using Unsafe to get the alignment then you can use Unsafe.getUnsafe() 
> rather than reflection.

Okay, I'll get to that.

> On my comment about whether a parallelUpdate method has been considered then 
> my personal view this would make it obvious when looking at the code. That 
> is, have the long standing update methods be serial by default and introduce 
> new parallelUpdate methods. For existing code then you could have the system 
> property to opt-in to be parallel when the input length is over the threshold 
> (and that should okay to backport to jdk7u if needed).

I don't like this approach for several reasons.

First, we're not done finding places that fork-join parallelism can make things 
go faster.  If, each time we find a new one, we must split it into the parallel 
and serial versions, we're going to get a tiresome proliferation of interfaces. 
 We'll need to document, test, etc, and programmers will need to spend time 
choosing "the right one" for each application.  This will be another 
opportunity to split application source bases into "old" and "new" -- these 
chances are always there, but why add more?

Second, this doesn't actually address the bug.  This was done for a bug, they 
want CRC32 to go faster in various places, some of them open source.  They were 
not looking for a different faster CRC, they were looking for the same CRC to 
go faster.  They don't want to edit their source code or split their source 
base, and as we all know, Java doesn't have #ifdef.

Third, I've done a fair amount of benchmarking, one with "unlimited" fork join 
running down to relatively small task sizes, the other with fork-join capped at 
4 threads (or in one case, 2 threads) of parallelism.  Across a range of inputs 
and block sizes I checked the efficiency, meaning the departure from ideal 
speedup (2x or 4x).  For 4M or larger inputs, across a range of machines, with 
parallelism capped at 2 (laptop, and single-split fork-joins) or 4, efficiency 
never dropped below 75%.  The machines ranged from a core-i5 laptop, to an old 
T1000, to various Intel boxes, to a good-sized T4.

Out of 216 runs (9 machines, inputs 16M/8M/4M, task sizes 32K to 8M),
10 runs had efficiency 75% <= eff < 80%
52 runs, 80% <= eff < 90%
139 runs,  90% <= eff < 110% 
15 runs had superlinear speedup of 110% or better "efficiency" (I checked for 
noisy data, it was not noisy).  

We can pick a minimum-parallel size that will pretty much assure no inefficient 
surprises (I think it is 4 megabytes, but once committed to FJ, it looks like a 
good minimum task size is 128k), and there's a knob for controlling fork-join 
parallelism if people are in an environment where they noticed these momentary 
surprises and care (a T-1000/Niagara does about 9 serial 16M CRC32s per second, 
so it's not a long-lived blip).  If necessary/tasteful, we can add a knob for 
people who want more parallelism than that.

If it's appropriate to put the benchmarks (PDF) in a public place, I can do 
that.

Fourth, I think there's actually a bit of needing to lead by example.  If we 
treat fork/join parallelism as something that is so risky and potentially 
destabilizing that parallelized algorithms deserve their own interface, then 
what will other people think?  I've got plenty of concerns about efficient use 
of processors, but I also checked what happens if the forkjoin pool is 
throttled, and it works pretty well.

David

Reply via email to