On Wed, 24 Nov 2021 15:20:42 GMT, kabutz <d...@openjdk.java.net> wrote:

>> BigInteger currently uses three different algorithms for multiply. The 
>> simple quadratic algorithm, then the slightly better Karatsuba if we exceed 
>> a bit count and then Toom Cook 3 once we go into the several thousands of 
>> bits. Since Toom Cook 3 is a recursive algorithm, it is trivial to 
>> parallelize it. I have demonstrated this several times in conference talks. 
>> In order to be consistent with other classes such as Arrays and Collection, 
>> I have added a parallelMultiply() method. Internally we have added a 
>> parameter to the private multiply method to indicate whether the calculation 
>> should be done in parallel.
>> 
>> The performance improvements are as should be expected. Fibonacci of 100 
>> million (using a single-threaded Dijkstra's sum of squares version) 
>> completes in 9.2 seconds with the parallelMultiply() vs 25.3 seconds with 
>> the sequential multiply() method. This is on my 1-8-2 laptop. The final 
>> multiplications are with very large numbers, which then benefit from the 
>> parallelization of Toom-Cook 3. Fibonacci 100 million is a 347084 bit number.
>> 
>> We have also parallelized the private square() method. Internally, the 
>> square() method defaults to be sequential.
>> 
>> Some benchmark results, run on my 1-6-2 server:
>> 
>> 
>> Benchmark                                          (n)  Mode  Cnt      Score 
>>      Error  Units
>> BigIntegerParallelMultiply.multiply            1000000    ss    4     51.707 
>> ±   11.194  ms/op
>> BigIntegerParallelMultiply.multiply           10000000    ss    4    988.302 
>> ±  235.977  ms/op
>> BigIntegerParallelMultiply.multiply          100000000    ss    4  24662.063 
>> ± 1123.329  ms/op
>> BigIntegerParallelMultiply.parallelMultiply    1000000    ss    4     49.337 
>> ±   26.611  ms/op
>> BigIntegerParallelMultiply.parallelMultiply   10000000    ss    4    527.560 
>> ±  268.903  ms/op
>> BigIntegerParallelMultiply.parallelMultiply  100000000    ss    4   9076.551 
>> ± 1899.444  ms/op
>> 
>> 
>> We can see that for larger calculations (fib 100m), the execution is 2.7x 
>> faster in parallel. For medium size (fib 10m) it is 1.873x faster. And for 
>> small (fib 1m) it is roughly the same. Considering that the fibonacci 
>> algorithm that we used was in itself sequential, and that the last 3 
>> calculations would dominate, 2.7x faster should probably be considered quite 
>> good on a 1-6-2 machine.
>
> kabutz has updated the pull request incrementally with one additional commit 
> since the last revision:
> 
>   Made forkOrInvoke() method protected to avoid strange compiler error

IIRC you are not overly concerned about the additional object creation of 
`RecursiveOp` instances?

If that changes the operation method could return `Object` and choose to 
perform the operation directly returning the `BigInteger` result or returning 
the particular `RecursiveOp`.


private static Object multiply(BigInteger a, BigInteger b, boolean parallel, 
int depth) {
    if (isParallel(parallel, depth)) {
        return new RecursiveMultiply(a, b, parallel, depth).fork();
    } else {
        // Also called by RecursiveMultiply.compute()
        return RecursiveMultiply.compute(a, b, false, depth);
    }
}


Then we could have another method on `RecursiveOp` that pattern matches e.g.:

  static BigInteger join(Object o) {
    // replace with pattern match switch when it exits preview
    if (o instanceof BigInteger b) {
        return b;
    } else if (o instanceof RecursiveOp r) {
        return r.join();
    } else {
        throw new InternalError("Cannot reach here);
    }
  }


That seems simple enough it might be worth doing anyway.

src/java.base/share/classes/java/math/BigInteger.java line 1874:

> 1872:          */
> 1873:         private static final int PARALLEL_FORK_THRESHOLD = 
> Integer.getInteger(
> 1874:                 "java.math.BigInteger.parallelForkThreshold",

I suspect we don't need this system property, there is no such property for 
streams. We use `ForkJoinPool.getCommonPoolParallelism()`, which is 
configurable as an escape hatch.

src/java.base/share/classes/java/math/BigInteger.java line 1875:

> 1873:         private static final int PARALLEL_FORK_THRESHOLD = 
> Integer.getInteger(
> 1874:                 "java.math.BigInteger.parallelForkThreshold",
> 1875:                 (int) 
> Math.ceil(Math.log(Runtime.getRuntime().availableProcessors()) / 
> Math.log(2)));

We can simplify to `32 - 
Integer.numberOfLeadingZeros(ForkJoinPool.getCommonPoolParallelism() - 1))`.

`ForkJoinPool.getCommonPoolParallelism()` is guaranteed to return a value `> 0`

src/java.base/share/classes/java/math/BigInteger.java line 1878:

> 1876: 
> 1877:         @Serial
> 1878:         private static final long serialVersionUID = 0L;

I don't think you need to declare these, the instances are never intended to 
support serialization e.g. in the stream implementation for `AbstractTask` that 
extends `CountedCompleter` we state:

 * <p>Serialization is not supported as there is no intention to serialize
 * tasks managed by stream ops.

src/java.base/share/classes/java/math/BigInteger.java line 1909:

> 1907:             @Override
> 1908:             public BigInteger compute() {
> 1909:                 return a.multiply(b, true, super.parallel, super.depth);

Using `super.` is a little unusual, it's not wrong :-) but not really idiomatic 
in the source code base.

src/java.base/share/classes/java/math/BigInteger.java line 1930:

> 1928:         }
> 1929: 
> 1930:         private static RecursiveTask<BigInteger> exec(RecursiveOp op) {

Unused.

src/java.base/share/classes/java/math/BigInteger.java line 2000:

> 1998:         da1 = a2.add(a0);
> 1999:         db1 = b2.add(b0);
> 2000:         var vm1_task = RecursiveOp.multiply(da1.subtract(a1), 
> db1.subtract(b1), parallel, depth + 1);

I recommend incrementing the depth in the `RecursiveOp` constructor, thereby 
reducing the repetition.

-------------

PR: https://git.openjdk.java.net/jdk/pull/6409

Reply via email to