On Wed, 24 Nov 2021 15:20:42 GMT, kabutz <d...@openjdk.java.net> wrote:
>> BigInteger currently uses three different algorithms for multiply. The >> simple quadratic algorithm, then the slightly better Karatsuba if we exceed >> a bit count and then Toom Cook 3 once we go into the several thousands of >> bits. Since Toom Cook 3 is a recursive algorithm, it is trivial to >> parallelize it. I have demonstrated this several times in conference talks. >> In order to be consistent with other classes such as Arrays and Collection, >> I have added a parallelMultiply() method. Internally we have added a >> parameter to the private multiply method to indicate whether the calculation >> should be done in parallel. >> >> The performance improvements are as should be expected. Fibonacci of 100 >> million (using a single-threaded Dijkstra's sum of squares version) >> completes in 9.2 seconds with the parallelMultiply() vs 25.3 seconds with >> the sequential multiply() method. This is on my 1-8-2 laptop. The final >> multiplications are with very large numbers, which then benefit from the >> parallelization of Toom-Cook 3. Fibonacci 100 million is a 347084 bit number. >> >> We have also parallelized the private square() method. Internally, the >> square() method defaults to be sequential. >> >> Some benchmark results, run on my 1-6-2 server: >> >> >> Benchmark (n) Mode Cnt Score >> Error Units >> BigIntegerParallelMultiply.multiply 1000000 ss 4 51.707 >> ± 11.194 ms/op >> BigIntegerParallelMultiply.multiply 10000000 ss 4 988.302 >> ± 235.977 ms/op >> BigIntegerParallelMultiply.multiply 100000000 ss 4 24662.063 >> ± 1123.329 ms/op >> BigIntegerParallelMultiply.parallelMultiply 1000000 ss 4 49.337 >> ± 26.611 ms/op >> BigIntegerParallelMultiply.parallelMultiply 10000000 ss 4 527.560 >> ± 268.903 ms/op >> BigIntegerParallelMultiply.parallelMultiply 100000000 ss 4 9076.551 >> ± 1899.444 ms/op >> >> >> We can see that for larger calculations (fib 100m), the execution is 2.7x >> faster in parallel. For medium size (fib 10m) it is 1.873x faster. And for >> small (fib 1m) it is roughly the same. Considering that the fibonacci >> algorithm that we used was in itself sequential, and that the last 3 >> calculations would dominate, 2.7x faster should probably be considered quite >> good on a 1-6-2 machine. > > kabutz has updated the pull request incrementally with one additional commit > since the last revision: > > Made forkOrInvoke() method protected to avoid strange compiler error IIRC you are not overly concerned about the additional object creation of `RecursiveOp` instances? If that changes the operation method could return `Object` and choose to perform the operation directly returning the `BigInteger` result or returning the particular `RecursiveOp`. private static Object multiply(BigInteger a, BigInteger b, boolean parallel, int depth) { if (isParallel(parallel, depth)) { return new RecursiveMultiply(a, b, parallel, depth).fork(); } else { // Also called by RecursiveMultiply.compute() return RecursiveMultiply.compute(a, b, false, depth); } } Then we could have another method on `RecursiveOp` that pattern matches e.g.: static BigInteger join(Object o) { // replace with pattern match switch when it exits preview if (o instanceof BigInteger b) { return b; } else if (o instanceof RecursiveOp r) { return r.join(); } else { throw new InternalError("Cannot reach here); } } That seems simple enough it might be worth doing anyway. src/java.base/share/classes/java/math/BigInteger.java line 1874: > 1872: */ > 1873: private static final int PARALLEL_FORK_THRESHOLD = > Integer.getInteger( > 1874: "java.math.BigInteger.parallelForkThreshold", I suspect we don't need this system property, there is no such property for streams. We use `ForkJoinPool.getCommonPoolParallelism()`, which is configurable as an escape hatch. src/java.base/share/classes/java/math/BigInteger.java line 1875: > 1873: private static final int PARALLEL_FORK_THRESHOLD = > Integer.getInteger( > 1874: "java.math.BigInteger.parallelForkThreshold", > 1875: (int) > Math.ceil(Math.log(Runtime.getRuntime().availableProcessors()) / > Math.log(2))); We can simplify to `32 - Integer.numberOfLeadingZeros(ForkJoinPool.getCommonPoolParallelism() - 1))`. `ForkJoinPool.getCommonPoolParallelism()` is guaranteed to return a value `> 0` src/java.base/share/classes/java/math/BigInteger.java line 1878: > 1876: > 1877: @Serial > 1878: private static final long serialVersionUID = 0L; I don't think you need to declare these, the instances are never intended to support serialization e.g. in the stream implementation for `AbstractTask` that extends `CountedCompleter` we state: * <p>Serialization is not supported as there is no intention to serialize * tasks managed by stream ops. src/java.base/share/classes/java/math/BigInteger.java line 1909: > 1907: @Override > 1908: public BigInteger compute() { > 1909: return a.multiply(b, true, super.parallel, super.depth); Using `super.` is a little unusual, it's not wrong :-) but not really idiomatic in the source code base. src/java.base/share/classes/java/math/BigInteger.java line 1930: > 1928: } > 1929: > 1930: private static RecursiveTask<BigInteger> exec(RecursiveOp op) { Unused. src/java.base/share/classes/java/math/BigInteger.java line 2000: > 1998: da1 = a2.add(a0); > 1999: db1 = b2.add(b0); > 2000: var vm1_task = RecursiveOp.multiply(da1.subtract(a1), > db1.subtract(b1), parallel, depth + 1); I recommend incrementing the depth in the `RecursiveOp` constructor, thereby reducing the repetition. ------------- PR: https://git.openjdk.java.net/jdk/pull/6409