Re: [math] matrix multiplication surprisingly slow?

Chris Lucas Thu, 07 Apr 2016 08:19:01 -0700

Thanks for suggesting the BlockRealMatrix. I've run a few benchmarks
comparing the two, along with some other libraries.


I'm using jmh 1.11.3, JDK 1.8.0_05, VM 25.5-b02, with 2 warmups and 10 test
iterations.

The suffixes denote what's being tested:
  MC: AMC 3.6.1 using Array2DRowRealMatrix
  SB: Scala Breeze 0.12 with native libraries
  OJ: OJAlgo 39.0. Some other benchmarks suggest OJAlgo is an especially
speedy pure-java library.
  BMC: MC using a BlockRealMatrix.

I'm using the same matrix creation/multiplication/inverse code as I
mentioned in my previous note. When testing BlockReadMatrices, I'm using
fixed random matrices and annotating my class with @State(Scope.Benchmark).
For the curious, my rationale for building matrices on the spot is that I'm
already caching results pretty heavily and rarely perform the same
operation on the same matrix twice -- I'm most curious about performance in
the absence of caching benefits.

Test 1a: Creating and multiplying two random/uniform (i.e., all entries
drawn from math.random()) 100x100 matrices:
[info] MatrixBenchmarks.buildMultTestMC100  thrpt   10         836.579
±       7.120  ops/s
[info] MatrixBenchmarks.buildMultTestSB100  thrpt   10        1649.089
±     170.649  ops/s
[info] MatrixBenchmarks.buildMultTestOJ100  thrpt   10  1485.014 ± 44.158
ops/s
[info] MatrixBenchmarks.multBMC100    thrpt   10  1051.055 ±  2.290  ops/s

Test 1b: Creating and multiplying two random/uniform 500x500 matrices:
[info] MatrixBenchmarks.buildMultTestMC500  thrpt   10           8.997
±       0.064  ops/s
[info] MatrixBenchmarks.buildMultTestSB500  thrpt   10          80.558
±       0.627  ops/s
[info] MatrixBenchmarks.buildMultTestOJ500  thrpt   10          34.626
±       2.505  ops/s
[info] MatrixBenchmarks.multBMC500    thrpt   10     9.132 ±  0.059  ops/s
[info] MatrixBenchmarks.multCacheBMC500  thrpt   10  13.630 ± 0.107  ops/s
[no matrix creation]

---

Test 2a: Creating a single 500x500 matrix (to get a sense of whether the
mult itself is driving differences:
[info] MatrixBenchmarks.buildMC500          thrpt   10         155.026
±       2.456  ops/s
[info] MatrixBenchmarks.buildSB500          thrpt   10         197.257
±       4.619  ops/s
[info] MatrixBenchmarks.buildOJ500          thrpt   10         176.036
±       2.121  ops/s

Test 2b: Creating a single 1000x1000 matrix (to show it scales w/N):
[info] MatrixBenchmarks.buildMC1000  thrpt   10    36.112 ±  2.775  ops/s
[info] MatrixBenchmarks.buildSB1000  thrpt   10    45.557 ±  2.893  ops/s
[info] MatrixBenchmarks.buildOJ1000  thrpt   10    47.438 ±  1.370  ops/s
[info] MatrixBenchmarks.buildBMC1000  thrpt   10    37.779 ±  0.871  ops/s

---

Test 3a: Inverting a random 100x100 matrix:
[info] MatrixBenchmarks.invMC100   thrpt   10  1033.011 ±  9.928  ops/s
[info] MatrixBenchmarks.invSB100   thrpt   10  1664.833 ± 15.170  ops/s
[info] MatrixBenchmarks.invOJ100   thrpt   10  1174.044 ± 52.083  ops/s
[info] MatrixBenchmarks.invBMC100     thrpt   10  1043.858 ± 22.144  ops/s

Test 3b: Inverting a random 500x500 matrix:
[info] MatrixBenchmarks.invMC500        thrpt   10           9.089 ±
0.284  ops/s
[info] MatrixBenchmarks.invSB500        thrpt   10          43.857 ±
1.083  ops/s
[info] MatrixBenchmarks.invOJ500        thrpt   10          23.444 ±
1.484  ops/s
[info] MatrixBenchmarks.invBMC500     thrpt   10     9.156 ±  0.052  ops/s
[info] MatrixBenchmarks.invCacheBMC500   thrpt   10   9.627 ± 0.084  ops/s
[no matrix creation]

Test 3c:
[info] MatrixBenchmarks.invMC1000  thrpt   10     0.704 ±  0.040  ops/s
[info] MatrixBenchmarks.invSB1000           thrpt   10     8.611 ±  0.557
ops/s
[info] MatrixBenchmarks.invOJ1000  thrpt   10     3.539 ±  0.229  ops/s
[info] MatrixBenchmarks.invBMC1000    thrpt   10     0.691 ±  0.095  ops/s

Also, isn't matrix multiplication at least O(N^2.37), rather than O(N^2)?

On 6 April 2016 at 12:55, Chris Lucas <[email protected]> wrote:

> Thanks for the quick reply!
>
> I've pasted a small self-contained example at the bottom. It creates the
> matrices in advance, but nothing meaningful changes if they're created on a
> per-operation basis.
>
> Results for 50 multiplications of [size]x[size] matrices:
>    Size: 10, total time: 0.012 seconds, time per: 0.000 seconds
>    Size: 100, total time: 0.062 seconds, time per: 0.001 seconds
>    Size: 300, total time: 3.050 seconds, time per: 0.061 seconds
>    Size: 500, total time: 15.186 seconds, time per: 0.304 seconds
>    Size: 600, total time: 17.532 seconds, time per: 0.351 seconds
>
> For comparison:
>
> Results for 50 additions of [size]x[size] matrices (which should be
> faster, be the extent of the difference is nonetheless striking to me):
>    Size: 10, total time: 0.011 seconds, time per: 0.000 seconds
>    Size: 100, total time: 0.012 seconds, time per: 0.000 seconds
>    Size: 300, total time: 0.020 seconds, time per: 0.000 seconds
>    Size: 500, total time: 0.032 seconds, time per: 0.001 seconds
>    Size: 600, total time: 0.050 seconds, time per: 0.001 seconds
>
> Results for 50 inversions of a [size]x[size] matrix, which I'd expect to
> be slower than multiplication for larger matrices:
>    Size: 10, total time: 0.014 seconds, time per: 0.000 seconds
>    Size: 100, total time: 0.074 seconds, time per: 0.001 seconds
>    Size: 300, total time: 1.005 seconds, time per: 0.020 seconds
>    Size: 500, total time: 5.024 seconds, time per: 0.100 seconds
>    Size: 600, total time: 9.297 seconds, time per: 0.186 seconds
>
> I hope this is useful, and if I'm doing something wrong that's leading to
> this performance gap, I'd love to know.
>
> ----
> import org.apache.commons.math3.linear.LUDecomposition;
> import org.apache.commons.math3.linear.Array2DRowRealMatrix;
> import org.apache.commons.math3.linear.RealMatrix;
>
>
> public class AMCMatrices {
>
>   public static void main(String[] args) {
>     miniTest(0);
>   }
>
>   public static void miniTest(int tType) {
>     int samples = 50;
>
>     int sizes[] = { 10, 100, 300, 500, 600 };
>
>     for (int sI = 0; sI < sizes.length; sI++) {
>       int mSize = sizes[sI];
>
>       org.apache.commons.math3.linear.RealMatrix m0 = buildM(mSize, mSize);
>       RealMatrix m1 = buildM(mSize, mSize);
>
>       long start = System.nanoTime();
>       for (int n = 0; n < samples; n++) {
>         switch (tType) {
>         case 0:
>           m0.multiply(m1);
>           break;
>         case 1:
>           m0.add(m1);
>           break;
>         case 2:
>           new LUDecomposition(m0).getSolver().getInverse();
>           break;
>         }
>
>       }
>       long end = System.nanoTime();
>
>       double dt = ((double) (end - start)) / 1E9;
>       System.out.println(String.format(
>           "Size: %d, total time: %3.3f seconds, time per: %3.3f seconds",
>           mSize, dt, dt / samples));
>     }
>   }
>
>   public static Array2DRowRealMatrix buildM(int r, int c) {
>     double[][] matVals = new double[r][c];
>     for (int i = 0; i < r; i++) {
>       for (int j = 0; j < c; j++) {
>         matVals[i][j] = Math.random();
>       }
>     }
>     return new Array2DRowRealMatrix(matVals);
>   }
> }
>
> ----
>
> On 5 April 2016 at 19:36, Gilles <[email protected]> wrote:
>
>> Hi.
>>
>> On Tue, 5 Apr 2016 15:43:04 +0100, Chris Lucas wrote:
>>
>>> I recently ran a benchmark comparing the performance math commons 3.6.1's
>>> linear algebra library to the that of scala Breeze (
>>> https://github.com/scalanlp/breeze).
>>>
>>> I looked at det, inverse, Cholesky factorization, addition, and
>>> multiplication, including matrices with 10, 100, 500, and 1000 elements,
>>> with symmetric, non-symmetric, and non-square cases where applicable.
>>>
>>
>> It would be interesting to add this to the CM documentation:
>>   https://issues.apache.org/jira/browse/MATH-1194
>>
>> In general, I was pleasantly surprised: math commons performed about as
>>> well as Breeze, despite the latter relying on native libraries. There was
>>> one exception, however:
>>>
>>>     m0.multiply(m1)
>>>
>>> where m0 and m1 are both Array2DRowRealMatrix instances. It scaled very
>>> poorly in math commons, being much slower than nominally more expensive
>>> operations like inv and the Breeze implementation. Does anyone have a
>>> thought as to what's going on?
>>>
>>
>> Could your provide more information such as a plot of performance vs size?
>> A self-contained and minimal code to run would be nice too.
>> See the CM micro-benchmarking tool here:
>>
>> https://github.com/apache/commons-math/blob/master/src/test/java/org/apache/commons/math4/PerfTestUtils.java
>> And an example of how to use it:
>>
>> https://github.com/apache/commons-math/blob/master/src/userguide/java/org/apache/commons/math4/userguide/FastMathTestPerformance.java
>>
>> In case it's useful, one representative test
>>> involves multiplying two instances of
>>>
>>>     new Array2DRowRealMatrix(matVals)
>>>
>>> where matVals is 1000x1000 entries of math.random and the second instance
>>> is created as part of the loop. This part of the benchmark is not
>>> specific
>>> to the expensive multiplication step, and takes very little time relative
>>> to the multiplication itself. I'm using System.nanotime for the timing,
>>> and
>>> take the average time over several consecutive iterations, on a 3.5 GHz
>>> Intel Core i7, Oracle JRE (build 1.8.0_05-b13).
>>>
>>
>> You might want to confirm the behaviour using JMH (becoming the Java
>> standard
>> for benchmarking):
>>   http://openjdk.java.net/projects/code-tools/jmh/
>>
>>
>> Best regards,
>> Gilles
>>
>>
>>> Thanks,
>>>
>>> Chris
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>

Re: [math] matrix multiplication surprisingly slow?

Reply via email to