On 20/05/2013 02:24, David Chase wrote:
:

I don't like this approach for several reasons.

First, we're not done finding places that fork-join parallelism can make things go faster.  If, each time we 
find a new one, we must split it into the parallel and serial versions, we're going to get a tiresome 
proliferation of interfaces.  We'll need to document, test, etc, and programmers will need to spend time 
choosing "the right one" for each application.  This will be another opportunity to split 
application source bases into "old" and "new" -- these chances are always there, but why 
add more?

Second, this doesn't actually address the bug.  This was done for a bug, they 
want CRC32 to go faster in various places, some of them open source.  They were 
not looking for a different faster CRC, they were looking for the same CRC to 
go faster.  They don't want to edit their source code or split their source 
base, and as we all know, Java doesn't have #ifdef.

Third, I've done a fair amount of benchmarking, one with "unlimited" fork join 
running down to relatively small task sizes, the other with fork-join capped at 4 threads 
(or in one case, 2 threads) of parallelism.  Across a range of inputs and block sizes I 
checked the efficiency, meaning the departure from ideal speedup (2x or 4x).  For 4M or 
larger inputs, across a range of machines, with parallelism capped at 2 (laptop, and 
single-split fork-joins) or 4, efficiency never dropped below 75%.  The machines ranged 
from a core-i5 laptop, to an old T1000, to various Intel boxes, to a good-sized T4.

Out of 216 runs (9 machines, inputs 16M/8M/4M, task sizes 32K to 8M),
10 runs had efficiency 75%<= eff<  80%
52 runs, 80%<= eff<  90%
139 runs,  90%<= eff<  110%
15 runs had superlinear speedup of 110% or better "efficiency" (I checked for 
noisy data, it was not noisy).

We can pick a minimum-parallel size that will pretty much assure no inefficient 
surprises (I think it is 4 megabytes, but once committed to FJ, it looks like a 
good minimum task size is 128k), and there's a knob for controlling fork-join 
parallelism if people are in an environment where they noticed these momentary 
surprises and care (a T-1000/Niagara does about 9 serial 16M CRC32s per second, 
so it's not a long-lived blip).  If necessary/tasteful, we can add a knob for 
people who want more parallelism than that.

If it's appropriate to put the benchmarks (PDF) in a public place, I can do 
that.

Fourth, I think there's actually a bit of needing to lead by example.  If we 
treat fork/join parallelism as something that is so risky and potentially 
destabilizing that parallelized algorithms deserve their own interface, then 
what will other people think?  I've got plenty of concerns about efficient use 
of processors, but I also checked what happens if the forkjoin pool is 
throttled, and it works pretty well.

David

I think we need to get more experience with parallel operations before considering changing the default behavior of long standing methods. This it why I am suggesting this should be opt-in, meaning you run with something like -Djdk.enableParallelCRC32Update=true to have the existing methods use FJ. Having it opt-in rather than opt-out would also reduce concerns if this is proposed to be back-ported to jdk7u. I don't have an opinion as to whether other tuning knobs are required.

At this point, we have Arrays.parallelSort and the Streams API defines the parallel() method to get a stream that is parallel. Having the word "parallel" in the code means it is clear and obvious when reading the code (no surprises). Maybe going forward that this will be unnecessary, meaning it will be transparent. For now though, I think we should at least consider adding parallelUpdate methods.

-Alan.


Reply via email to