On Wed, 24 Nov 2021 15:20:42 GMT, kabutz <d...@openjdk.java.net> wrote:

>> BigInteger currently uses three different algorithms for multiply. The 
>> simple quadratic algorithm, then the slightly better Karatsuba if we exceed 
>> a bit count and then Toom Cook 3 once we go into the several thousands of 
>> bits. Since Toom Cook 3 is a recursive algorithm, it is trivial to 
>> parallelize it. I have demonstrated this several times in conference talks. 
>> In order to be consistent with other classes such as Arrays and Collection, 
>> I have added a parallelMultiply() method. Internally we have added a 
>> parameter to the private multiply method to indicate whether the calculation 
>> should be done in parallel.
>> 
>> The performance improvements are as should be expected. Fibonacci of 100 
>> million (using a single-threaded Dijkstra's sum of squares version) 
>> completes in 9.2 seconds with the parallelMultiply() vs 25.3 seconds with 
>> the sequential multiply() method. This is on my 1-8-2 laptop. The final 
>> multiplications are with very large numbers, which then benefit from the 
>> parallelization of Toom-Cook 3. Fibonacci 100 million is a 347084 bit number.
>> 
>> We have also parallelized the private square() method. Internally, the 
>> square() method defaults to be sequential.
>> 
>> Some benchmark results, run on my 1-6-2 server:
>> 
>> 
>> Benchmark                                          (n)  Mode  Cnt      Score 
>>      Error  Units
>> BigIntegerParallelMultiply.multiply            1000000    ss    4     51.707 
>> ±   11.194  ms/op
>> BigIntegerParallelMultiply.multiply           10000000    ss    4    988.302 
>> ±  235.977  ms/op
>> BigIntegerParallelMultiply.multiply          100000000    ss    4  24662.063 
>> ± 1123.329  ms/op
>> BigIntegerParallelMultiply.parallelMultiply    1000000    ss    4     49.337 
>> ±   26.611  ms/op
>> BigIntegerParallelMultiply.parallelMultiply   10000000    ss    4    527.560 
>> ±  268.903  ms/op
>> BigIntegerParallelMultiply.parallelMultiply  100000000    ss    4   9076.551 
>> ± 1899.444  ms/op
>> 
>> 
>> We can see that for larger calculations (fib 100m), the execution is 2.7x 
>> faster in parallel. For medium size (fib 10m) it is 1.873x faster. And for 
>> small (fib 1m) it is roughly the same. Considering that the fibonacci 
>> algorithm that we used was in itself sequential, and that the last 3 
>> calculations would dominate, 2.7x faster should probably be considered quite 
>> good on a 1-6-2 machine.
>
> kabutz has updated the pull request incrementally with one additional commit 
> since the last revision:
> 
>   Made forkOrInvoke() method protected to avoid strange compiler error

Based on the excellent feedback in this thread, I have added a depth threshold 
for the parallelMultiply(). This is based on log2(processors)


(int)Math.ceil(Math.log(Runtime.getRuntime().availableProcessors()) / 
Math.log(2))


and can be overridden with `-Djava.math.BigInteger.parallelForkThreshold=num`

We can also influence this number with `-XX:ActiveProcessorCount=num`, but that 
would have a wider impact.

Here are results for running the benchmark `BigIntegerParallelMultiply` out of 
the box on my 1-6-2 server:


Benchmark                                          (n)  Mode  Cnt      Score    
  Error  Units
BigIntegerParallelMultiply.multiply            1000000    ss    4     77.314 ±  
 35.690  ms/op
BigIntegerParallelMultiply.multiply           10000000    ss    4    938.650 ±  
 69.232  ms/op
BigIntegerParallelMultiply.multiply          100000000    ss    4  24811.339 ± 
1457.943  ms/op
BigIntegerParallelMultiply.parallelMultiply    1000000    ss    4     67.550 ±  
 39.059  ms/op
BigIntegerParallelMultiply.parallelMultiply   10000000    ss    4    489.738 ±  
140.965  ms/op
BigIntegerParallelMultiply.parallelMultiply  100000000    ss    4   8814.412 ±  
542.567  ms/op


Setting ActiveProcessorCount to 1 means that we do not fork at all:


Benchmark                                          (n)  Mode  Cnt      Score    
  Error  Units                
BigIntegerParallelMultiply.multiply            1000000    ss    4     72.752 ±  
 26.336  ms/op                
BigIntegerParallelMultiply.multiply           10000000    ss    4    949.417 ±  
 19.686  ms/op            
BigIntegerParallelMultiply.multiply          100000000    ss    4  24530.183 ± 
1319.775  ms/op             
BigIntegerParallelMultiply.parallelMultiply    1000000    ss    4     71.492 ±  
 21.860  ms/op      
BigIntegerParallelMultiply.parallelMultiply   10000000    ss    4    926.025 ±  
 41.539  ms/op             
BigIntegerParallelMultiply.parallelMultiply  100000000    ss    4  23947.256 ±  
 90.076  ms/op      


Similarly, when we set `-Djava.math.BigInteger.parallelForkThreshold=0` we do 
not fork either:


Benchmark                                          (n)  Mode  Cnt      Score    
  Error  Units
BigIntegerParallelMultiply.multiply            1000000    ss    4     73.121 ±  
 30.315  ms/op
BigIntegerParallelMultiply.multiply           10000000    ss    4    931.979 ±  
 34.988  ms/op
BigIntegerParallelMultiply.multiply          100000000    ss    4  24779.179 ± 
5167.362  ms/op
BigIntegerParallelMultiply.parallelMultiply    1000000    ss    4     72.280 ±  
 32.011  ms/op
BigIntegerParallelMultiply.parallelMultiply   10000000    ss    4    957.168 ±  
 37.957  ms/op
BigIntegerParallelMultiply.parallelMultiply  100000000    ss    4  25032.332 ±  
994.664  ms/op


On the 1-6-2 machine, the maximum parallel forked tasks that we get with our 
Fibonacci algorithm is 468 (4 x 9 x 13). The parallel recursion depth is 4 in 
this case. Note that before this change we could get millions of forks, 
depending on the size of the numbers multiplied.

Once we get to Toom-Cook 3, we are multiplying very large numbers. Also, for 
convenience we create RecursiveTasks whether we fork or invoke directly. This 
means that object allocation goes up slightly (less than 1%) when doing 
sequential multiply() on very large numbers.

-------------

PR: https://git.openjdk.java.net/jdk/pull/6409

Reply via email to