On Wed, 24 Nov 2021 15:20:42 GMT, kabutz <[email protected]> wrote:
>> BigInteger currently uses three different algorithms for multiply. The
>> simple quadratic algorithm, then the slightly better Karatsuba if we exceed
>> a bit count and then Toom Cook 3 once we go into the several thousands of
>> bits. Since Toom Cook 3 is a recursive algorithm, it is trivial to
>> parallelize it. I have demonstrated this several times in conference talks.
>> In order to be consistent with other classes such as Arrays and Collection,
>> I have added a parallelMultiply() method. Internally we have added a
>> parameter to the private multiply method to indicate whether the calculation
>> should be done in parallel.
>>
>> The performance improvements are as should be expected. Fibonacci of 100
>> million (using a single-threaded Dijkstra's sum of squares version)
>> completes in 9.2 seconds with the parallelMultiply() vs 25.3 seconds with
>> the sequential multiply() method. This is on my 1-8-2 laptop. The final
>> multiplications are with very large numbers, which then benefit from the
>> parallelization of Toom-Cook 3. Fibonacci 100 million is a 347084 bit number.
>>
>> We have also parallelized the private square() method. Internally, the
>> square() method defaults to be sequential.
>>
>> Some benchmark results, run on my 1-6-2 server:
>>
>>
>> Benchmark (n) Mode Cnt Score
>> Error Units
>> BigIntegerParallelMultiply.multiply 1000000 ss 4 51.707
>> ± 11.194 ms/op
>> BigIntegerParallelMultiply.multiply 10000000 ss 4 988.302
>> ± 235.977 ms/op
>> BigIntegerParallelMultiply.multiply 100000000 ss 4 24662.063
>> ± 1123.329 ms/op
>> BigIntegerParallelMultiply.parallelMultiply 1000000 ss 4 49.337
>> ± 26.611 ms/op
>> BigIntegerParallelMultiply.parallelMultiply 10000000 ss 4 527.560
>> ± 268.903 ms/op
>> BigIntegerParallelMultiply.parallelMultiply 100000000 ss 4 9076.551
>> ± 1899.444 ms/op
>>
>>
>> We can see that for larger calculations (fib 100m), the execution is 2.7x
>> faster in parallel. For medium size (fib 10m) it is 1.873x faster. And for
>> small (fib 1m) it is roughly the same. Considering that the fibonacci
>> algorithm that we used was in itself sequential, and that the last 3
>> calculations would dominate, 2.7x faster should probably be considered quite
>> good on a 1-6-2 machine.
>
> kabutz has updated the pull request incrementally with one additional commit
> since the last revision:
>
> Made forkOrInvoke() method protected to avoid strange compiler error
IIRC you are not overly concerned about the additional object creation of
`RecursiveOp` instances?
If that changes the operation method could return `Object` and choose to
perform the operation directly returning the `BigInteger` result or returning
the particular `RecursiveOp`.
private static Object multiply(BigInteger a, BigInteger b, boolean parallel,
int depth) {
if (isParallel(parallel, depth)) {
return new RecursiveMultiply(a, b, parallel, depth).fork();
} else {
// Also called by RecursiveMultiply.compute()
return RecursiveMultiply.compute(a, b, false, depth);
}
}
Then we could have another method on `RecursiveOp` that pattern matches e.g.:
static BigInteger join(Object o) {
// replace with pattern match switch when it exits preview
if (o instanceof BigInteger b) {
return b;
} else if (o instanceof RecursiveOp r) {
return r.join();
} else {
throw new InternalError("Cannot reach here);
}
}
That seems simple enough it might be worth doing anyway.
src/java.base/share/classes/java/math/BigInteger.java line 1874:
> 1872: */
> 1873: private static final int PARALLEL_FORK_THRESHOLD =
> Integer.getInteger(
> 1874: "java.math.BigInteger.parallelForkThreshold",
I suspect we don't need this system property, there is no such property for
streams. We use `ForkJoinPool.getCommonPoolParallelism()`, which is
configurable as an escape hatch.
src/java.base/share/classes/java/math/BigInteger.java line 1875:
> 1873: private static final int PARALLEL_FORK_THRESHOLD =
> Integer.getInteger(
> 1874: "java.math.BigInteger.parallelForkThreshold",
> 1875: (int)
> Math.ceil(Math.log(Runtime.getRuntime().availableProcessors()) /
> Math.log(2)));
We can simplify to `32 -
Integer.numberOfLeadingZeros(ForkJoinPool.getCommonPoolParallelism() - 1))`.
`ForkJoinPool.getCommonPoolParallelism()` is guaranteed to return a value `> 0`
src/java.base/share/classes/java/math/BigInteger.java line 1878:
> 1876:
> 1877: @Serial
> 1878: private static final long serialVersionUID = 0L;
I don't think you need to declare these, the instances are never intended to
support serialization e.g. in the stream implementation for `AbstractTask` that
extends `CountedCompleter` we state:
* <p>Serialization is not supported as there is no intention to serialize
* tasks managed by stream ops.
src/java.base/share/classes/java/math/BigInteger.java line 1909:
> 1907: @Override
> 1908: public BigInteger compute() {
> 1909: return a.multiply(b, true, super.parallel, super.depth);
Using `super.` is a little unusual, it's not wrong :-) but not really idiomatic
in the source code base.
src/java.base/share/classes/java/math/BigInteger.java line 1930:
> 1928: }
> 1929:
> 1930: private static RecursiveTask<BigInteger> exec(RecursiveOp op) {
Unused.
src/java.base/share/classes/java/math/BigInteger.java line 2000:
> 1998: da1 = a2.add(a0);
> 1999: db1 = b2.add(b0);
> 2000: var vm1_task = RecursiveOp.multiply(da1.subtract(a1),
> db1.subtract(b1), parallel, depth + 1);
I recommend incrementing the depth in the `RecursiveOp` constructor, thereby
reducing the repetition.
-------------
PR: https://git.openjdk.java.net/jdk/pull/6409