Re: [math] matrix multiplication surprisingly slow?

Gilles Fri, 08 Apr 2016 06:54:42 -0700

On Thu, 7 Apr 2016 19:25:43 +0100, Chris Lucas wrote:

On 7 April 2016 at 18:41, Gilles <gil...@harfang.homelinux.org>wrote:
It's certainly worth ascertaining with more data points (to make anice
plot).
I can run mult on both for {10, 100, 500, 1000, 1500, 2000, 2500,5000} if
that would help.

That would help if you are willing to also format the results andenhance

the CM userguide. ;-)

Naturally, better asymptotic
performance would be nice,
What do you mean?
When I wrote that I was thinking it might make sense to useStrassen'salgorithm (or something else with better asymptotic performance; thisisn'tmy area of expertise) for large square matrices. After looking a bitat theliterature it seems unlikely to be worth it, given numericalstability
concerns etc.


Thanks for checking!

It would also help if this information is retained in the doc so thatthe

same questioning does not pop every couple of years.

but I recognize that the AMC devs likely have
better things to do.
Observation 2: When multiplying new matrices, BlockMatrix offerssome
improvement when multiplying the same matrices repeatedly
Why would someone do that?
Good question! Maybe if matrices are being selected/changed atruntime and
someone can't be bothered to check/cache?

Having immutable implementation(s) could perhaps help in some cases(?).


None of the regular CM contributors have had use-cases for enhancing
algorithms/data structures that would deal with large datasets.
[Up to contemplating dropping the sparse matrix implementation, due to
identified shortcomings (cf. JIRA/ML archive).]

Perhaps, this wold also require a lot work to reinvent a wheel that is
out of scope for CM (?).

Observation 3: OJAlgo doesn't scale better asymptotically from whatI can

tell. That's no great surprise based on
https://github.com/optimatika/ojAlgo/wiki/Multiply.
<https://github.com/optimatika/ojAlgo/wiki/Multiply> Scala Breeze
w/native
libs scales closer to O(N^2.4).


Details on how you arrive to those numbers would really be a useful
addition
to the CM documentation.

I can share code snippets. I'd guess that breeze actually scalesw/N^3 in a

single thread and that the faster-seeming scaling is related to
parallelization, e.g., adding threads for larger matrices.

If performance can be improved through parallelization, it wouldcertainly

be useful to know the expected gain, and open a "Wish" request in JIRA.

There are several features in CM where parallelism is the obvious next
thing to do (e.g. FFT), either internally, or by providing interface
(and thread-safe instances) to be used by external framework.

Unfortunately, I must state that (almost) nobody seems interested in
enhancing CM!

Observation 4: Breeze with native libraries does better than onemight
guess based on some previous benchmarks. People still might want tousemath commons, not least because those native libs might beencumbered with
undesirable licenses.
Do they use multi-threading?
Yes, they do. I suppose that's important to note given that there aresomeapplications where running entire separate pipelines in parallel willbe
more efficient than having parallel matrix ops.


There are many issues with CM's "linear" package; see e.g.
  https://issues.apache.org/jira/browse/MATH-765

A clean-up should certainly consider parallelism as a requirement for anew

design.

The task is obviously daunting given that the MATH-765 issue has been
open for more than 4 years now... :-/


Regards,
Gilles

[^1]:



https://git-wip-us.apache.org/repos/asf?p=commons-math.git;a=blob;f=src/main/java/org/apache/commons/math4/linear/Array2DRowRealMatrix.java;h=3778db56ba406beb973e1355234593bc006adb59;hb=HEAD
On 7 April 2016 at 16:50, Gilles <gil...@harfang.homelinux.org>wrote:
On Thu, 7 Apr 2016 16:17:52 +0100, Chris Lucas wrote:
Thanks for suggesting the BlockRealMatrix. I've run a fewbenchmarks
comparing the two, along with some other libraries.
The email is not really suited for tables (see your messagebelow).
What are the points which you want to make?
As said before, "reports" on CM features (a.o. benchmarks) arewelcome
but
should be formatted for inclusion in the userguide (for now, untilwe
might
create a separate document for data that is subject to change moreor
less
rapidly).
If you suspect a problem with the code, then I suggest that youfile a
JIRA
report, where files can be attached (or tables can be viewedwithout
being
deformed thanks to the "{noformat}" tag).

Regards,
Gilles
I'm using jmh 1.11.3, JDK 1.8.0_05, VM 25.5-b02, with 2 warmupsand 10
test
iterations.

The suffixes denote what's being tested:
  MC: AMC 3.6.1 using Array2DRowRealMatrix
  SB: Scala Breeze 0.12 with native libraries
OJ: OJAlgo 39.0. Some other benchmarks suggest OJAlgo is anespecially
speedy pure-java library.
  BMC: MC using a BlockRealMatrix.
I'm using the same matrix creation/multiplication/inverse code asImentioned in my previous note. When testing BlockReadMatrices,I'm using
fixed random matrices and annotating my class with
@State(Scope.Benchmark).
For the curious, my rationale for building matrices on the spotis that
I'm
already caching results pretty heavily and rarely perform thesameoperation on the same matrix twice -- I'm most curious aboutperformance
in
the absence of caching benefits.
Test 1a: Creating and multiplying two random/uniform (i.e., allentries
drawn from math.random()) 100x100 matrices:
[info] MatrixBenchmarks.buildMultTestMC100 thrpt 10836.579
±       7.120  ops/s
[info] MatrixBenchmarks.buildMultTestSB100 thrpt 101649.089
±     170.649  ops/s
[info] MatrixBenchmarks.buildMultTestOJ100 thrpt 10 1485.014±
44.158
ops/s
[info] MatrixBenchmarks.multBMC100 thrpt 10 1051.055 ±2.290
ops/s
Test 1b: Creating and multiplying two random/uniform 500x500matrices:[info] MatrixBenchmarks.buildMultTestMC500 thrpt 108.997
±       0.064  ops/s
[info] MatrixBenchmarks.buildMultTestSB500 thrpt 1080.558
±       0.627  ops/s
[info] MatrixBenchmarks.buildMultTestOJ500 thrpt 1034.626
±       2.505  ops/s
[info] MatrixBenchmarks.multBMC500 thrpt 10 9.132 ±0.059
ops/s
[info] MatrixBenchmarks.multCacheBMC500 thrpt 10 13.630 ±0.107
ops/s
[no matrix creation]

---
Test 2a: Creating a single 500x500 matrix (to get a sense ofwhether the
mult itself is driving differences:
[info] MatrixBenchmarks.buildMC500 thrpt 10155.026
±       2.456  ops/s
[info] MatrixBenchmarks.buildSB500 thrpt 10197.257
±       4.619  ops/s
[info] MatrixBenchmarks.buildOJ500 thrpt 10176.036
±       2.121  ops/s
Test 2b: Creating a single 1000x1000 matrix (to show it scalesw/N):[info] MatrixBenchmarks.buildMC1000 thrpt 10 36.112 ±2.775
ops/s
[info] MatrixBenchmarks.buildSB1000 thrpt 10 45.557 ±2.893
ops/s
[info] MatrixBenchmarks.buildOJ1000 thrpt 10 47.438 ±1.370
ops/s
[info] MatrixBenchmarks.buildBMC1000 thrpt 10 37.779 ±0.871
ops/s

---

Test 3a: Inverting a random 100x100 matrix:
[info] MatrixBenchmarks.invMC100 thrpt 10 1033.011 ± 9.928ops/s[info] MatrixBenchmarks.invSB100 thrpt 10 1664.833 ± 15.170ops/s[info] MatrixBenchmarks.invOJ100 thrpt 10 1174.044 ± 52.083ops/s[info] MatrixBenchmarks.invBMC100 thrpt 10 1043.858 ±22.144
ops/s

Test 3b: Inverting a random 500x500 matrix:
[info] MatrixBenchmarks.invMC500 thrpt 109.089 ±
0.284  ops/s
[info] MatrixBenchmarks.invSB500 thrpt 1043.857 ±
1.083  ops/s
[info] MatrixBenchmarks.invOJ500 thrpt 1023.444 ±
1.484  ops/s
[info] MatrixBenchmarks.invBMC500 thrpt 10 9.156 ±0.052
ops/s
[info] MatrixBenchmarks.invCacheBMC500 thrpt 10 9.627 ±0.084
ops/s
[no matrix creation]

Test 3c:
[info] MatrixBenchmarks.invMC1000 thrpt 10 0.704 ± 0.040ops/s[info] MatrixBenchmarks.invSB1000 thrpt 10 8.611±
0.557
ops/s
[info] MatrixBenchmarks.invOJ1000 thrpt 10 3.539 ± 0.229ops/s[info] MatrixBenchmarks.invBMC1000 thrpt 10 0.691 ±0.095
ops/s

Also, isn't matrix multiplication at least O(N^2.37), rather than
O(N^2)?
On 6 April 2016 at 12:55, Chris Lucas <cluc...@inf.ed.ac.uk>wrote:
Thanks for the quick reply!
I've pasted a small self-contained example at the bottom. Itcreates
the
matrices in advance, but nothing meaningful changes if they'recreated
on a
per-operation basis.

Results for 50 multiplications of [size]x[size] matrices:
   Size: 10, total time: 0.012 seconds, time per: 0.000 seconds
   Size: 100, total time: 0.062 seconds, time per: 0.001 seconds
   Size: 300, total time: 3.050 seconds, time per: 0.061 seconds
Size: 500, total time: 15.186 seconds, time per: 0.304secondsSize: 600, total time: 17.532 seconds, time per: 0.351seconds
For comparison:
Results for 50 additions of [size]x[size] matrices (which shouldbefaster, be the extent of the difference is nonetheless strikingto me):
   Size: 10, total time: 0.011 seconds, time per: 0.000 seconds
   Size: 100, total time: 0.012 seconds, time per: 0.000 seconds
   Size: 300, total time: 0.020 seconds, time per: 0.000 seconds
   Size: 500, total time: 0.032 seconds, time per: 0.001 seconds
   Size: 600, total time: 0.050 seconds, time per: 0.001 seconds
Results for 50 inversions of a [size]x[size] matrix, which I'dexpect
to
be slower than multiplication for larger matrices:
   Size: 10, total time: 0.014 seconds, time per: 0.000 seconds
   Size: 100, total time: 0.074 seconds, time per: 0.001 seconds
   Size: 300, total time: 1.005 seconds, time per: 0.020 seconds
   Size: 500, total time: 5.024 seconds, time per: 0.100 seconds
   Size: 600, total time: 9.297 seconds, time per: 0.186 seconds
I hope this is useful, and if I'm doing something wrong that'sleading
to
this performance gap, I'd love to know.

----
import org.apache.commons.math3.linear.LUDecomposition;
import org.apache.commons.math3.linear.Array2DRowRealMatrix;
import org.apache.commons.math3.linear.RealMatrix;


public class AMCMatrices {

  public static void main(String[] args) {
    miniTest(0);
  }

  public static void miniTest(int tType) {
    int samples = 50;

    int sizes[] = { 10, 100, 300, 500, 600 };

    for (int sI = 0; sI < sizes.length; sI++) {
      int mSize = sizes[sI];
org.apache.commons.math3.linear.RealMatrix m0 =buildM(mSize,
mSize);
      RealMatrix m1 = buildM(mSize, mSize);

      long start = System.nanoTime();
      for (int n = 0; n < samples; n++) {
        switch (tType) {
        case 0:
          m0.multiply(m1);
          break;
        case 1:
          m0.add(m1);
          break;
        case 2:
          new LUDecomposition(m0).getSolver().getInverse();
          break;
        }

      }
      long end = System.nanoTime();

      double dt = ((double) (end - start)) / 1E9;
      System.out.println(String.format(
          "Size: %d, total time: %3.3f seconds, time per: %3.3f
seconds",
          mSize, dt, dt / samples));
    }
  }

  public static Array2DRowRealMatrix buildM(int r, int c) {
    double[][] matVals = new double[r][c];
    for (int i = 0; i < r; i++) {
      for (int j = 0; j < c; j++) {
        matVals[i][j] = Math.random();
      }
    }
    return new Array2DRowRealMatrix(matVals);
  }
}

----
On 5 April 2016 at 19:36, Gilles <gil...@harfang.homelinux.org>wrote:
Hi.
On Tue, 5 Apr 2016 15:43:04 +0100, Chris Lucas wrote:
I recently ran a benchmark comparing the performance mathcommons
3.6.1's
linear algebra library to the that of scala Breeze (
https://github.com/scalanlp/breeze).
I looked at det, inverse, Cholesky factorization, addition,and
multiplication, including matrices with 10, 100, 500, and 1000
elements,
with symmetric, non-symmetric, and non-square cases whereapplicable.
It would be interesting to add this to the CM documentation:
  https://issues.apache.org/jira/browse/MATH-1194
In general, I was pleasantly surprised: math commons performedabout
as
well as Breeze, despite the latter relying on native libraries.There
was
one exception, however:

    m0.multiply(m1)
where m0 and m1 are both Array2DRowRealMatrix instances. Itscaled
very
poorly in math commons, being much slower than nominally more
expensive
operations like inv and the Breeze implementation. Does anyonehave a
thought as to what's going on?
Could your provide more information such as a plot ofperformance vs
size?
A self-contained and minimal code to run would be nice too.
See the CM micro-benchmarking tool here:






https://github.com/apache/commons-math/blob/master/src/test/java/org/apache/commons/math4/PerfTestUtils.java
And an example of how to use it:






https://github.com/apache/commons-math/blob/master/src/userguide/java/org/apache/commons/math4/userguide/FastMathTestPerformance.java

In case it's useful, one representative test

involves multiplying two instances of
    new Array2DRowRealMatrix(matVals)
where matVals is 1000x1000 entries of math.random and thesecond
instance
is created as part of the loop. This part of the benchmark isnot
specific
to the expensive multiplication step, and takes very littletime
relative
to the multiplication itself. I'm using System.nanotime forthe
timing,
and
take the average time over several consecutive iterations, ona 3.5
GHz
Intel Core i7, Oracle JRE (build 1.8.0_05-b13).
You might want to confirm the behaviour using JMH (becomingthe Java
standard
for benchmarking):
  http://openjdk.java.net/projects/code-tools/jmh/


Best regards,
Gilles


Thanks,
Chris



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@commons.apache.org
For additional commands, e-mail: user-h...@commons.apache.org

Re: [math] matrix multiplication surprisingly slow?

Reply via email to