On 20/05/2013 02:24, David Chase wrote:
:
I don't like this approach for several reasons.
First, we're not done finding places that fork-join parallelism can make things go faster. If, each time we
find a new one, we must split it into the parallel and serial versions, we're going to get a tiresome
proliferation of interfaces. We'll need to document, test, etc, and programmers will need to spend time
choosing "the right one" for each application. This will be another opportunity to split
application source bases into "old" and "new" -- these chances are always there, but why
add more?
Second, this doesn't actually address the bug. This was done for a bug, they
want CRC32 to go faster in various places, some of them open source. They were
not looking for a different faster CRC, they were looking for the same CRC to
go faster. They don't want to edit their source code or split their source
base, and as we all know, Java doesn't have #ifdef.
Third, I've done a fair amount of benchmarking, one with "unlimited" fork join
running down to relatively small task sizes, the other with fork-join capped at 4 threads
(or in one case, 2 threads) of parallelism. Across a range of inputs and block sizes I
checked the efficiency, meaning the departure from ideal speedup (2x or 4x). For 4M or
larger inputs, across a range of machines, with parallelism capped at 2 (laptop, and
single-split fork-joins) or 4, efficiency never dropped below 75%. The machines ranged
from a core-i5 laptop, to an old T1000, to various Intel boxes, to a good-sized T4.
Out of 216 runs (9 machines, inputs 16M/8M/4M, task sizes 32K to 8M),
10 runs had efficiency 75%<= eff< 80%
52 runs, 80%<= eff< 90%
139 runs, 90%<= eff< 110%
15 runs had superlinear speedup of 110% or better "efficiency" (I checked for
noisy data, it was not noisy).
We can pick a minimum-parallel size that will pretty much assure no inefficient
surprises (I think it is 4 megabytes, but once committed to FJ, it looks like a
good minimum task size is 128k), and there's a knob for controlling fork-join
parallelism if people are in an environment where they noticed these momentary
surprises and care (a T-1000/Niagara does about 9 serial 16M CRC32s per second,
so it's not a long-lived blip). If necessary/tasteful, we can add a knob for
people who want more parallelism than that.
If it's appropriate to put the benchmarks (PDF) in a public place, I can do
that.
Fourth, I think there's actually a bit of needing to lead by example. If we
treat fork/join parallelism as something that is so risky and potentially
destabilizing that parallelized algorithms deserve their own interface, then
what will other people think? I've got plenty of concerns about efficient use
of processors, but I also checked what happens if the forkjoin pool is
throttled, and it works pretty well.
David
I think we need to get more experience with parallel operations before
considering changing the default behavior of long standing methods. This
it why I am suggesting this should be opt-in, meaning you run with
something like -Djdk.enableParallelCRC32Update=true to have the existing
methods use FJ. Having it opt-in rather than opt-out would also reduce
concerns if this is proposed to be back-ported to jdk7u. I don't have an
opinion as to whether other tuning knobs are required.
At this point, we have Arrays.parallelSort and the Streams API defines
the parallel() method to get a stream that is parallel. Having the word
"parallel" in the code means it is clear and obvious when reading the
code (no surprises). Maybe going forward that this will be unnecessary,
meaning it will be transparent. For now though, I think we should at
least consider adding parallelUpdate methods.
-Alan.