On Wed, 17 Nov 2021 19:48:25 GMT, kabutz <[email protected]> wrote:
>> BigInteger currently uses three different algorithms for multiply. The
>> simple quadratic algorithm, then the slightly better Karatsuba if we exceed
>> a bit count and then Toom Cook 3 once we go into the several thousands of
>> bits. Since Toom Cook 3 is a recursive algorithm, it is trivial to
>> parallelize it. I have demonstrated this several times in conference talks.
>> In order to be consistent with other classes such as Arrays and Collection,
>> I have added a parallelMultiply() method. Internally we have added a
>> parameter to the private multiply method to indicate whether the calculation
>> should be done in parallel.
>>
>> The performance improvements are as should be expected. Fibonacci of 100
>> million (using a single-threaded Dijkstra's sum of squares version)
>> completes in 9.2 seconds with the parallelMultiply() vs 25.3 seconds with
>> the sequential multiply() method. This is on my 1-8-2 laptop. The final
>> multiplications are with very large numbers, which then benefit from the
>> parallelization of Toom-Cook 3. Fibonacci 100 million is a 347084 bit number.
>>
>> We have also parallelized the private square() method. Internally, the
>> square() method defaults to be sequential.
>>
>> Some benchmark results, run on my 1-6-2 server:
>>
>>
>> Benchmark (n) Mode Cnt Score
>> Error Units
>> BigIntegerParallelMultiply.multiply 1000000 ss 4 51.707
>> ± 11.194 ms/op
>> BigIntegerParallelMultiply.multiply 10000000 ss 4 988.302
>> ± 235.977 ms/op
>> BigIntegerParallelMultiply.multiply 100000000 ss 4 24662.063
>> ± 1123.329 ms/op
>> BigIntegerParallelMultiply.parallelMultiply 1000000 ss 4 49.337
>> ± 26.611 ms/op
>> BigIntegerParallelMultiply.parallelMultiply 10000000 ss 4 527.560
>> ± 268.903 ms/op
>> BigIntegerParallelMultiply.parallelMultiply 100000000 ss 4 9076.551
>> ± 1899.444 ms/op
>>
>>
>> We can see that for larger calculations (fib 100m), the execution is 2.7x
>> faster in parallel. For medium size (fib 10m) it is 1.873x faster. And for
>> small (fib 1m) it is roughly the same. Considering that the fibonacci
>> algorithm that we used was in itself sequential, and that the last 3
>> calculations would dominate, 2.7x faster should probably be considered quite
>> good on a 1-6-2 machine.
>
> kabutz has updated the pull request incrementally with one additional commit
> since the last revision:
>
> Removed JVM flags from benchmark
> I think the functionality in this PR is worth pursuing, but with the JDK 18
> rampdown 1 date fast approaching, as a non-urgent issue, I think we shouldn't
> try to rush it into JDK 18.
> I looked more closely and now understand a bit more about the threshold. It
> would be useful to have some implementation notes detailing approximately
> when the parallel execution kicks in. For `a*a` it's `>=
> TOOM_COOK_SQUARE_THRESHOLD(216)` and for `a*b` it's `>=
> TOOM_COOK_THRESHOLD(240)`, so we could refer to certain bit lengths.
>
> The branching factor is 4 (well 3 i guess, but its easier to think in powers
> of 2). It might be reasonable to assume the problem gets split equally in 4
> parts. I don't know in practice what the depth of recursion might, its hard
> to see this getting completely out of control, but we could always keep track
> of the depth and cut off in proportion to the # runtime processors if need be.
>
> Given the existing degree of specialization, the minimal changes to code, and
> the fact that the creation of recursive tasks is in the noise this PR looks
> quite reasonable.
I have run some more tests. For my fibonacci algorithm, here are the worst
cases for the various calculations.
n most_bits tasks time_ms
1000 694 0 1
10_000 6_942 0 1
100_000 69_424 18 13
1_000_000 694_241 468 143
10_000_000 6_942_418 11_718 1049
100_000_000 69_424_191 292_968 13695
1_000_000_000 694_241_913 7_324_218 237282
Each data point has 10x the number of bits in the final result and the number
of tasks in the final calculation is 25x more. Perhaps I can make the threshold
the recursive depth up to which we would run in parallel. And that recursive
depth could the availableProcessors() or some multiple thereof.
>
> I think we can simplify the creation of the recursive tasks (assuming we
> don't track the depth):
>
> ```java
> private static final class RecursiveOp {
> private static <T> RecursiveTask<T> exec(RecursiveTask<T> op, boolean
> parallel) {
> if (parallel) {
> op.fork();
> } else {
> op.invoke();
> }
> return op;
> }
>
> private static RecursiveTask<BigInteger> multiply(BigInteger a,
> BigInteger b, boolean parallel) {
> var op = new RecursiveTask<BigInteger>() {
> @Override
> protected BigInteger compute() {
> return a.multiply(b, true, parallel);
> }
> };
>
> return exec(op, parallel);
> }
>
> private static RecursiveTask<BigInteger> square(BigInteger a, boolean
> parallel) {
> var op = new RecursiveTask<BigInteger>() {
> @Override
> protected BigInteger compute() {
> return a.square(true, parallel);
> }
> };
>
> return exec(op, parallel);
> }
> }
> ```
Good idea, will change that.
-------------
PR: https://git.openjdk.java.net/jdk/pull/6409