Re: [math] matrix multiplication surprisingly slow?

Chris Lucas Thu, 07 Apr 2016 11:26:21 -0700

On 7 April 2016 at 18:41, Gilles <[email protected]> wrote:


>
> It's certainly worth ascertaining with more data points (to make a nice
> plot).
>

I can run mult on both for {10, 100, 500, 1000, 1500, 2000, 2500, 5000} if
that would help.


>
> Naturally, better asymptotic
>> performance would be nice,
>>
>
> What do you mean?
>

When I wrote that I was thinking it might make sense to use Strassen's
algorithm (or something else with better asymptotic performance; this isn't
my area of expertise) for large square matrices. After looking a bit at the
literature it seems unlikely to be worth it, given numerical stability
concerns etc.


>
> but I recognize that the AMC devs likely have
>> better things to do.
>>
>> Observation 2: When multiplying new matrices, BlockMatrix offers some
>> improvement when multiplying the same matrices repeatedly
>>
>
> Why would someone do that?
>

Good question! Maybe if matrices are being selected/changed at runtime and
someone can't be bothered to check/cache?


>
> Observation 3: OJAlgo doesn't scale better asymptotically from what I can
>> tell. That's no great surprise based on
>> https://github.com/optimatika/ojAlgo/wiki/Multiply.
>> <https://github.com/optimatika/ojAlgo/wiki/Multiply> Scala Breeze
>> w/native
>> libs scales closer to O(N^2.4).
>>
>
> Details on how you arrive to those numbers would really be a useful
> addition
> to the CM documentation.
>

I can share code snippets. I'd guess that breeze actually scales w/N^3 in a
single thread and that the faster-seeming scaling is related to
parallelization, e.g., adding threads for larger matrices.


>
> Observation 4: Breeze with native libraries does better than one might
>> guess based on some previous benchmarks. People still might want to use
>> math commons, not least because those native libs might be encumbered with
>> undesirable licenses.
>>
>
> Do they use multi-threading?
>

Yes, they do. I suppose that's important to note given that there are some
applications where running entire separate pipelines in parallel will be
more efficient than having parallel matrix ops.


>
> Regards,
> Gilles
>
>
>
>> [^1]:
>>
>>
>> https://git-wip-us.apache.org/repos/asf?p=commons-math.git;a=blob;f=src/main/java/org/apache/commons/math4/linear/Array2DRowRealMatrix.java;h=3778db56ba406beb973e1355234593bc006adb59;hb=HEAD
>>
>>
>> On 7 April 2016 at 16:50, Gilles <[email protected]> wrote:
>>
>> On Thu, 7 Apr 2016 16:17:52 +0100, Chris Lucas wrote:
>>>
>>> Thanks for suggesting the BlockRealMatrix. I've run a few benchmarks
>>>> comparing the two, along with some other libraries.
>>>>
>>>>
>>> The email is not really suited for tables (see your message below).
>>> What are the points which you want to make?
>>>
>>> As said before, "reports" on CM features (a.o. benchmarks) are welcome
>>> but
>>> should be formatted for inclusion in the userguide (for now, until we
>>> might
>>> create a separate document for data that is subject to change more or
>>> less
>>> rapidly).
>>>
>>> If you suspect a problem with the code, then I suggest that you file a
>>> JIRA
>>> report, where files can be attached (or tables can be viewed without
>>> being
>>> deformed thanks to the "{noformat}" tag).
>>>
>>> Regards,
>>> Gilles
>>>
>>>
>>> I'm using jmh 1.11.3, JDK 1.8.0_05, VM 25.5-b02, with 2 warmups and 10
>>> test
>>>
>>>> iterations.
>>>>
>>>> The suffixes denote what's being tested:
>>>>   MC: AMC 3.6.1 using Array2DRowRealMatrix
>>>>   SB: Scala Breeze 0.12 with native libraries
>>>>   OJ: OJAlgo 39.0. Some other benchmarks suggest OJAlgo is an especially
>>>> speedy pure-java library.
>>>>   BMC: MC using a BlockRealMatrix.
>>>>
>>>> I'm using the same matrix creation/multiplication/inverse code as I
>>>> mentioned in my previous note. When testing BlockReadMatrices, I'm using
>>>> fixed random matrices and annotating my class with
>>>> @State(Scope.Benchmark).
>>>> For the curious, my rationale for building matrices on the spot is that
>>>> I'm
>>>> already caching results pretty heavily and rarely perform the same
>>>> operation on the same matrix twice -- I'm most curious about performance
>>>> in
>>>> the absence of caching benefits.
>>>>
>>>> Test 1a: Creating and multiplying two random/uniform (i.e., all entries
>>>> drawn from math.random()) 100x100 matrices:
>>>> [info] MatrixBenchmarks.buildMultTestMC100  thrpt   10         836.579
>>>> ±       7.120  ops/s
>>>> [info] MatrixBenchmarks.buildMultTestSB100  thrpt   10        1649.089
>>>> ±     170.649  ops/s
>>>> [info] MatrixBenchmarks.buildMultTestOJ100  thrpt   10  1485.014 ±
>>>> 44.158
>>>> ops/s
>>>> [info] MatrixBenchmarks.multBMC100    thrpt   10  1051.055 ±  2.290
>>>> ops/s
>>>>
>>>> Test 1b: Creating and multiplying two random/uniform 500x500 matrices:
>>>> [info] MatrixBenchmarks.buildMultTestMC500  thrpt   10           8.997
>>>> ±       0.064  ops/s
>>>> [info] MatrixBenchmarks.buildMultTestSB500  thrpt   10          80.558
>>>> ±       0.627  ops/s
>>>> [info] MatrixBenchmarks.buildMultTestOJ500  thrpt   10          34.626
>>>> ±       2.505  ops/s
>>>> [info] MatrixBenchmarks.multBMC500    thrpt   10     9.132 ±  0.059
>>>> ops/s
>>>> [info] MatrixBenchmarks.multCacheBMC500  thrpt   10  13.630 ± 0.107
>>>> ops/s
>>>> [no matrix creation]
>>>>
>>>> ---
>>>>
>>>> Test 2a: Creating a single 500x500 matrix (to get a sense of whether the
>>>> mult itself is driving differences:
>>>> [info] MatrixBenchmarks.buildMC500          thrpt   10         155.026
>>>> ±       2.456  ops/s
>>>> [info] MatrixBenchmarks.buildSB500          thrpt   10         197.257
>>>> ±       4.619  ops/s
>>>> [info] MatrixBenchmarks.buildOJ500          thrpt   10         176.036
>>>> ±       2.121  ops/s
>>>>
>>>> Test 2b: Creating a single 1000x1000 matrix (to show it scales w/N):
>>>> [info] MatrixBenchmarks.buildMC1000  thrpt   10    36.112 ±  2.775
>>>> ops/s
>>>> [info] MatrixBenchmarks.buildSB1000  thrpt   10    45.557 ±  2.893
>>>> ops/s
>>>> [info] MatrixBenchmarks.buildOJ1000  thrpt   10    47.438 ±  1.370
>>>> ops/s
>>>> [info] MatrixBenchmarks.buildBMC1000  thrpt   10    37.779 ±  0.871
>>>> ops/s
>>>>
>>>> ---
>>>>
>>>> Test 3a: Inverting a random 100x100 matrix:
>>>> [info] MatrixBenchmarks.invMC100   thrpt   10  1033.011 ±  9.928  ops/s
>>>> [info] MatrixBenchmarks.invSB100   thrpt   10  1664.833 ± 15.170  ops/s
>>>> [info] MatrixBenchmarks.invOJ100   thrpt   10  1174.044 ± 52.083  ops/s
>>>> [info] MatrixBenchmarks.invBMC100     thrpt   10  1043.858 ± 22.144
>>>> ops/s
>>>>
>>>> Test 3b: Inverting a random 500x500 matrix:
>>>> [info] MatrixBenchmarks.invMC500        thrpt   10           9.089 ±
>>>> 0.284  ops/s
>>>> [info] MatrixBenchmarks.invSB500        thrpt   10          43.857 ±
>>>> 1.083  ops/s
>>>> [info] MatrixBenchmarks.invOJ500        thrpt   10          23.444 ±
>>>> 1.484  ops/s
>>>> [info] MatrixBenchmarks.invBMC500     thrpt   10     9.156 ±  0.052
>>>> ops/s
>>>> [info] MatrixBenchmarks.invCacheBMC500   thrpt   10   9.627 ± 0.084
>>>> ops/s
>>>> [no matrix creation]
>>>>
>>>> Test 3c:
>>>> [info] MatrixBenchmarks.invMC1000  thrpt   10     0.704 ±  0.040  ops/s
>>>> [info] MatrixBenchmarks.invSB1000           thrpt   10     8.611 ±
>>>> 0.557
>>>> ops/s
>>>> [info] MatrixBenchmarks.invOJ1000  thrpt   10     3.539 ±  0.229  ops/s
>>>> [info] MatrixBenchmarks.invBMC1000    thrpt   10     0.691 ±  0.095
>>>> ops/s
>>>>
>>>> Also, isn't matrix multiplication at least O(N^2.37), rather than
>>>> O(N^2)?
>>>>
>>>> On 6 April 2016 at 12:55, Chris Lucas <[email protected]> wrote:
>>>>
>>>> Thanks for the quick reply!
>>>>
>>>>>
>>>>> I've pasted a small self-contained example at the bottom. It creates
>>>>> the
>>>>> matrices in advance, but nothing meaningful changes if they're created
>>>>> on a
>>>>> per-operation basis.
>>>>>
>>>>> Results for 50 multiplications of [size]x[size] matrices:
>>>>>    Size: 10, total time: 0.012 seconds, time per: 0.000 seconds
>>>>>    Size: 100, total time: 0.062 seconds, time per: 0.001 seconds
>>>>>    Size: 300, total time: 3.050 seconds, time per: 0.061 seconds
>>>>>    Size: 500, total time: 15.186 seconds, time per: 0.304 seconds
>>>>>    Size: 600, total time: 17.532 seconds, time per: 0.351 seconds
>>>>>
>>>>> For comparison:
>>>>>
>>>>> Results for 50 additions of [size]x[size] matrices (which should be
>>>>> faster, be the extent of the difference is nonetheless striking to me):
>>>>>    Size: 10, total time: 0.011 seconds, time per: 0.000 seconds
>>>>>    Size: 100, total time: 0.012 seconds, time per: 0.000 seconds
>>>>>    Size: 300, total time: 0.020 seconds, time per: 0.000 seconds
>>>>>    Size: 500, total time: 0.032 seconds, time per: 0.001 seconds
>>>>>    Size: 600, total time: 0.050 seconds, time per: 0.001 seconds
>>>>>
>>>>> Results for 50 inversions of a [size]x[size] matrix, which I'd expect
>>>>> to
>>>>> be slower than multiplication for larger matrices:
>>>>>    Size: 10, total time: 0.014 seconds, time per: 0.000 seconds
>>>>>    Size: 100, total time: 0.074 seconds, time per: 0.001 seconds
>>>>>    Size: 300, total time: 1.005 seconds, time per: 0.020 seconds
>>>>>    Size: 500, total time: 5.024 seconds, time per: 0.100 seconds
>>>>>    Size: 600, total time: 9.297 seconds, time per: 0.186 seconds
>>>>>
>>>>> I hope this is useful, and if I'm doing something wrong that's leading
>>>>> to
>>>>> this performance gap, I'd love to know.
>>>>>
>>>>> ----
>>>>> import org.apache.commons.math3.linear.LUDecomposition;
>>>>> import org.apache.commons.math3.linear.Array2DRowRealMatrix;
>>>>> import org.apache.commons.math3.linear.RealMatrix;
>>>>>
>>>>>
>>>>> public class AMCMatrices {
>>>>>
>>>>>   public static void main(String[] args) {
>>>>>     miniTest(0);
>>>>>   }
>>>>>
>>>>>   public static void miniTest(int tType) {
>>>>>     int samples = 50;
>>>>>
>>>>>     int sizes[] = { 10, 100, 300, 500, 600 };
>>>>>
>>>>>     for (int sI = 0; sI < sizes.length; sI++) {
>>>>>       int mSize = sizes[sI];
>>>>>
>>>>>       org.apache.commons.math3.linear.RealMatrix m0 = buildM(mSize,
>>>>> mSize);
>>>>>       RealMatrix m1 = buildM(mSize, mSize);
>>>>>
>>>>>       long start = System.nanoTime();
>>>>>       for (int n = 0; n < samples; n++) {
>>>>>         switch (tType) {
>>>>>         case 0:
>>>>>           m0.multiply(m1);
>>>>>           break;
>>>>>         case 1:
>>>>>           m0.add(m1);
>>>>>           break;
>>>>>         case 2:
>>>>>           new LUDecomposition(m0).getSolver().getInverse();
>>>>>           break;
>>>>>         }
>>>>>
>>>>>       }
>>>>>       long end = System.nanoTime();
>>>>>
>>>>>       double dt = ((double) (end - start)) / 1E9;
>>>>>       System.out.println(String.format(
>>>>>           "Size: %d, total time: %3.3f seconds, time per: %3.3f
>>>>> seconds",
>>>>>           mSize, dt, dt / samples));
>>>>>     }
>>>>>   }
>>>>>
>>>>>   public static Array2DRowRealMatrix buildM(int r, int c) {
>>>>>     double[][] matVals = new double[r][c];
>>>>>     for (int i = 0; i < r; i++) {
>>>>>       for (int j = 0; j < c; j++) {
>>>>>         matVals[i][j] = Math.random();
>>>>>       }
>>>>>     }
>>>>>     return new Array2DRowRealMatrix(matVals);
>>>>>   }
>>>>> }
>>>>>
>>>>> ----
>>>>>
>>>>> On 5 April 2016 at 19:36, Gilles <[email protected]> wrote:
>>>>>
>>>>> Hi.
>>>>>
>>>>>>
>>>>>> On Tue, 5 Apr 2016 15:43:04 +0100, Chris Lucas wrote:
>>>>>>
>>>>>> I recently ran a benchmark comparing the performance math commons
>>>>>>
>>>>>>> 3.6.1's
>>>>>>> linear algebra library to the that of scala Breeze (
>>>>>>> https://github.com/scalanlp/breeze).
>>>>>>>
>>>>>>> I looked at det, inverse, Cholesky factorization, addition, and
>>>>>>> multiplication, including matrices with 10, 100, 500, and 1000
>>>>>>> elements,
>>>>>>> with symmetric, non-symmetric, and non-square cases where applicable.
>>>>>>>
>>>>>>>
>>>>>>> It would be interesting to add this to the CM documentation:
>>>>>>   https://issues.apache.org/jira/browse/MATH-1194
>>>>>>
>>>>>> In general, I was pleasantly surprised: math commons performed about
>>>>>> as
>>>>>>
>>>>>> well as Breeze, despite the latter relying on native libraries. There
>>>>>>> was
>>>>>>> one exception, however:
>>>>>>>
>>>>>>>     m0.multiply(m1)
>>>>>>>
>>>>>>> where m0 and m1 are both Array2DRowRealMatrix instances. It scaled
>>>>>>> very
>>>>>>> poorly in math commons, being much slower than nominally more
>>>>>>> expensive
>>>>>>> operations like inv and the Breeze implementation. Does anyone have a
>>>>>>> thought as to what's going on?
>>>>>>>
>>>>>>>
>>>>>>> Could your provide more information such as a plot of performance vs
>>>>>> size?
>>>>>> A self-contained and minimal code to run would be nice too.
>>>>>> See the CM micro-benchmarking tool here:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> https://github.com/apache/commons-math/blob/master/src/test/java/org/apache/commons/math4/PerfTestUtils.java
>>>>>> And an example of how to use it:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> https://github.com/apache/commons-math/blob/master/src/userguide/java/org/apache/commons/math4/userguide/FastMathTestPerformance.java
>>>>>>
>>>>>> In case it's useful, one representative test
>>>>>>
>>>>>> involves multiplying two instances of
>>>>>>>
>>>>>>>     new Array2DRowRealMatrix(matVals)
>>>>>>>
>>>>>>> where matVals is 1000x1000 entries of math.random and the second
>>>>>>> instance
>>>>>>> is created as part of the loop. This part of the benchmark is not
>>>>>>> specific
>>>>>>> to the expensive multiplication step, and takes very little time
>>>>>>> relative
>>>>>>> to the multiplication itself. I'm using System.nanotime for the
>>>>>>> timing,
>>>>>>> and
>>>>>>> take the average time over several consecutive iterations, on a 3.5
>>>>>>> GHz
>>>>>>> Intel Core i7, Oracle JRE (build 1.8.0_05-b13).
>>>>>>>
>>>>>>>
>>>>>>> You might want to confirm the behaviour using JMH (becoming the Java
>>>>>> standard
>>>>>> for benchmarking):
>>>>>>   http://openjdk.java.net/projects/code-tools/jmh/
>>>>>>
>>>>>>
>>>>>> Best regards,
>>>>>> Gilles
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>>>
>>>>>>> Chris
>>>>>>>
>>>>>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: [math] matrix multiplication surprisingly slow?

Reply via email to