On Wed, 17 Nov 2021 19:48:25 GMT, kabutz <d...@openjdk.java.net> wrote:

>> BigInteger currently uses three different algorithms for multiply. The 
>> simple quadratic algorithm, then the slightly better Karatsuba if we exceed 
>> a bit count and then Toom Cook 3 once we go into the several thousands of 
>> bits. Since Toom Cook 3 is a recursive algorithm, it is trivial to 
>> parallelize it. I have demonstrated this several times in conference talks. 
>> In order to be consistent with other classes such as Arrays and Collection, 
>> I have added a parallelMultiply() method. Internally we have added a 
>> parameter to the private multiply method to indicate whether the calculation 
>> should be done in parallel.
>> 
>> The performance improvements are as should be expected. Fibonacci of 100 
>> million (using a single-threaded Dijkstra's sum of squares version) 
>> completes in 9.2 seconds with the parallelMultiply() vs 25.3 seconds with 
>> the sequential multiply() method. This is on my 1-8-2 laptop. The final 
>> multiplications are with very large numbers, which then benefit from the 
>> parallelization of Toom-Cook 3. Fibonacci 100 million is a 347084 bit number.
>> 
>> We have also parallelized the private square() method. Internally, the 
>> square() method defaults to be sequential.
>> 
>> Some benchmark results, run on my 1-6-2 server:
>> 
>> 
>> Benchmark                                          (n)  Mode  Cnt      Score 
>>      Error  Units
>> BigIntegerParallelMultiply.multiply            1000000    ss    4     51.707 
>> ±   11.194  ms/op
>> BigIntegerParallelMultiply.multiply           10000000    ss    4    988.302 
>> ±  235.977  ms/op
>> BigIntegerParallelMultiply.multiply          100000000    ss    4  24662.063 
>> ± 1123.329  ms/op
>> BigIntegerParallelMultiply.parallelMultiply    1000000    ss    4     49.337 
>> ±   26.611  ms/op
>> BigIntegerParallelMultiply.parallelMultiply   10000000    ss    4    527.560 
>> ±  268.903  ms/op
>> BigIntegerParallelMultiply.parallelMultiply  100000000    ss    4   9076.551 
>> ± 1899.444  ms/op
>> 
>> 
>> We can see that for larger calculations (fib 100m), the execution is 2.7x 
>> faster in parallel. For medium size (fib 10m) it is 1.873x faster. And for 
>> small (fib 1m) it is roughly the same. Considering that the fibonacci 
>> algorithm that we used was in itself sequential, and that the last 3 
>> calculations would dominate, 2.7x faster should probably be considered quite 
>> good on a 1-6-2 machine.
>
> kabutz has updated the pull request incrementally with one additional commit 
> since the last revision:
> 
>   Removed JVM flags from benchmark

> I think the functionality in this PR is worth pursuing, but with the JDK 18 
> rampdown 1 date fast approaching, as a non-urgent issue, I think we shouldn't 
> try to rush it into JDK 18.





> I looked more closely and now understand a bit more about the threshold. It 
> would be useful to have some implementation notes detailing approximately 
> when the parallel execution kicks in. For `a*a` it's `>= 
> TOOM_COOK_SQUARE_THRESHOLD(216)` and for `a*b` it's `>= 
> TOOM_COOK_THRESHOLD(240)`, so we could refer to certain bit lengths.
> 
> The branching factor is 4 (well 3 i guess, but its easier to think in powers 
> of 2). It might be reasonable to assume the problem gets split equally in 4 
> parts. I don't know in practice what the depth of recursion might, its hard 
> to see this getting completely out of control, but we could always keep track 
> of the depth and cut off in proportion to the # runtime processors if need be.
> 
> Given the existing degree of specialization, the minimal changes to code, and 
> the fact that the creation of recursive tasks is in the noise this PR looks 
> quite reasonable.

I have run some more tests. For my fibonacci algorithm, here are the worst 
cases for the various calculations.


            n   most_bits     tasks  time_ms
         1000         694         0      1
       10_000       6_942         0      1
      100_000      69_424        18     13
    1_000_000     694_241       468    143
   10_000_000   6_942_418    11_718   1049
  100_000_000  69_424_191   292_968  13695
1_000_000_000 694_241_913 7_324_218 237282


Each data point has 10x the number of bits in the final result and the number 
of tasks in the final calculation is 25x more. Perhaps I can make the threshold 
the recursive depth up to which we would run in parallel. And that recursive 
depth could the availableProcessors() or some multiple thereof.


> 
> I think we can simplify the creation of the recursive tasks (assuming we 
> don't track the depth):
> 
> ```java
>     private static final class RecursiveOp {
>         private static <T> RecursiveTask<T> exec(RecursiveTask<T> op, boolean 
> parallel) {
>             if (parallel) {
>                 op.fork();
>             } else {
>                 op.invoke();
>             }
>             return op;
>         }
> 
>         private static RecursiveTask<BigInteger> multiply(BigInteger a, 
> BigInteger b, boolean parallel) {
>             var op = new RecursiveTask<BigInteger>() {
>                 @Override
>                 protected BigInteger compute() {
>                     return a.multiply(b, true, parallel);
>                 }
>             };
> 
>             return exec(op, parallel);
>         }
> 
>         private static RecursiveTask<BigInteger> square(BigInteger a, boolean 
> parallel) {
>             var op = new RecursiveTask<BigInteger>() {
>                 @Override
>                 protected BigInteger compute() {
>                     return a.square(true, parallel);
>                 }
>             };
> 
>             return exec(op, parallel);
>         }
>     }
> ```

Good idea, will change that.

-------------

PR: https://git.openjdk.java.net/jdk/pull/6409

Reply via email to