On Wed, 24 Nov 2021 15:20:42 GMT, kabutz <d...@openjdk.java.net> wrote:
>> BigInteger currently uses three different algorithms for multiply. The >> simple quadratic algorithm, then the slightly better Karatsuba if we exceed >> a bit count and then Toom Cook 3 once we go into the several thousands of >> bits. Since Toom Cook 3 is a recursive algorithm, it is trivial to >> parallelize it. I have demonstrated this several times in conference talks. >> In order to be consistent with other classes such as Arrays and Collection, >> I have added a parallelMultiply() method. Internally we have added a >> parameter to the private multiply method to indicate whether the calculation >> should be done in parallel. >> >> The performance improvements are as should be expected. Fibonacci of 100 >> million (using a single-threaded Dijkstra's sum of squares version) >> completes in 9.2 seconds with the parallelMultiply() vs 25.3 seconds with >> the sequential multiply() method. This is on my 1-8-2 laptop. The final >> multiplications are with very large numbers, which then benefit from the >> parallelization of Toom-Cook 3. Fibonacci 100 million is a 347084 bit number. >> >> We have also parallelized the private square() method. Internally, the >> square() method defaults to be sequential. >> >> Some benchmark results, run on my 1-6-2 server: >> >> >> Benchmark (n) Mode Cnt Score >> Error Units >> BigIntegerParallelMultiply.multiply 1000000 ss 4 51.707 >> ± 11.194 ms/op >> BigIntegerParallelMultiply.multiply 10000000 ss 4 988.302 >> ± 235.977 ms/op >> BigIntegerParallelMultiply.multiply 100000000 ss 4 24662.063 >> ± 1123.329 ms/op >> BigIntegerParallelMultiply.parallelMultiply 1000000 ss 4 49.337 >> ± 26.611 ms/op >> BigIntegerParallelMultiply.parallelMultiply 10000000 ss 4 527.560 >> ± 268.903 ms/op >> BigIntegerParallelMultiply.parallelMultiply 100000000 ss 4 9076.551 >> ± 1899.444 ms/op >> >> >> We can see that for larger calculations (fib 100m), the execution is 2.7x >> faster in parallel. For medium size (fib 10m) it is 1.873x faster. And for >> small (fib 1m) it is roughly the same. Considering that the fibonacci >> algorithm that we used was in itself sequential, and that the last 3 >> calculations would dominate, 2.7x faster should probably be considered quite >> good on a 1-6-2 machine. > > kabutz has updated the pull request incrementally with one additional commit > since the last revision: > > Made forkOrInvoke() method protected to avoid strange compiler error Based on the excellent feedback in this thread, I have added a depth threshold for the parallelMultiply(). This is based on log2(processors) (int)Math.ceil(Math.log(Runtime.getRuntime().availableProcessors()) / Math.log(2)) and can be overridden with `-Djava.math.BigInteger.parallelForkThreshold=num` We can also influence this number with `-XX:ActiveProcessorCount=num`, but that would have a wider impact. Here are results for running the benchmark `BigIntegerParallelMultiply` out of the box on my 1-6-2 server: Benchmark (n) Mode Cnt Score Error Units BigIntegerParallelMultiply.multiply 1000000 ss 4 77.314 ± 35.690 ms/op BigIntegerParallelMultiply.multiply 10000000 ss 4 938.650 ± 69.232 ms/op BigIntegerParallelMultiply.multiply 100000000 ss 4 24811.339 ± 1457.943 ms/op BigIntegerParallelMultiply.parallelMultiply 1000000 ss 4 67.550 ± 39.059 ms/op BigIntegerParallelMultiply.parallelMultiply 10000000 ss 4 489.738 ± 140.965 ms/op BigIntegerParallelMultiply.parallelMultiply 100000000 ss 4 8814.412 ± 542.567 ms/op Setting ActiveProcessorCount to 1 means that we do not fork at all: Benchmark (n) Mode Cnt Score Error Units BigIntegerParallelMultiply.multiply 1000000 ss 4 72.752 ± 26.336 ms/op BigIntegerParallelMultiply.multiply 10000000 ss 4 949.417 ± 19.686 ms/op BigIntegerParallelMultiply.multiply 100000000 ss 4 24530.183 ± 1319.775 ms/op BigIntegerParallelMultiply.parallelMultiply 1000000 ss 4 71.492 ± 21.860 ms/op BigIntegerParallelMultiply.parallelMultiply 10000000 ss 4 926.025 ± 41.539 ms/op BigIntegerParallelMultiply.parallelMultiply 100000000 ss 4 23947.256 ± 90.076 ms/op Similarly, when we set `-Djava.math.BigInteger.parallelForkThreshold=0` we do not fork either: Benchmark (n) Mode Cnt Score Error Units BigIntegerParallelMultiply.multiply 1000000 ss 4 73.121 ± 30.315 ms/op BigIntegerParallelMultiply.multiply 10000000 ss 4 931.979 ± 34.988 ms/op BigIntegerParallelMultiply.multiply 100000000 ss 4 24779.179 ± 5167.362 ms/op BigIntegerParallelMultiply.parallelMultiply 1000000 ss 4 72.280 ± 32.011 ms/op BigIntegerParallelMultiply.parallelMultiply 10000000 ss 4 957.168 ± 37.957 ms/op BigIntegerParallelMultiply.parallelMultiply 100000000 ss 4 25032.332 ± 994.664 ms/op On the 1-6-2 machine, the maximum parallel forked tasks that we get with our Fibonacci algorithm is 468 (4 x 9 x 13). The parallel recursion depth is 4 in this case. Note that before this change we could get millions of forks, depending on the size of the numbers multiplied. Once we get to Toom-Cook 3, we are multiplying very large numbers. Also, for convenience we create RecursiveTasks whether we fork or invoke directly. This means that object allocation goes up slightly (less than 1%) when doing sequential multiply() on very large numbers. ------------- PR: https://git.openjdk.java.net/jdk/pull/6409