Re: [math] matrix multiplication surprisingly slow?

2016-04-08 Thread Gilles

On Thu, 7 Apr 2016 19:25:43 +0100, Chris Lucas wrote:
On 7 April 2016 at 18:41, Gilles  
wrote:




It's certainly worth ascertaining with more data points (to make a 
nice

plot).



I can run mult on both for {10, 100, 500, 1000, 1500, 2000, 2500, 
5000} if

that would help.


That would help if you are willing to also format the results and 
enhance

the CM userguide. ;-)



Naturally, better asymptotic

performance would be nice,



What do you mean?



When I wrote that I was thinking it might make sense to use 
Strassen's
algorithm (or something else with better asymptotic performance; this 
isn't
my area of expertise) for large square matrices. After looking a bit 
at the
literature it seems unlikely to be worth it, given numerical 
stability

concerns etc.


Thanks for checking!
It would also help if this information is retained in the doc so that 
the

same questioning does not pop every couple of years.



but I recognize that the AMC devs likely have

better things to do.

Observation 2: When multiplying new matrices, BlockMatrix offers 
some

improvement when multiplying the same matrices repeatedly



Why would someone do that?



Good question! Maybe if matrices are being selected/changed at 
runtime and

someone can't be bothered to check/cache?


Having immutable implementation(s) could perhaps help in some cases 
(?).


None of the regular CM contributors have had use-cases for enhancing
algorithms/data structures that would deal with large datasets.
[Up to contemplating dropping the sparse matrix implementation, due to
identified shortcomings (cf. JIRA/ML archive).]

Perhaps, this wold also require a lot work to reinvent a wheel that is
out of scope for CM (?).

Observation 3: OJAlgo doesn't scale better asymptotically from what 
I can

tell. That's no great surprise based on
https://github.com/optimatika/ojAlgo/wiki/Multiply.
 Scala Breeze
w/native
libs scales closer to O(N^2.4).



Details on how you arrive to those numbers would really be a useful
addition
to the CM documentation.



I can share code snippets. I'd guess that breeze actually scales 
w/N^3 in a

single thread and that the faster-seeming scaling is related to
parallelization, e.g., adding threads for larger matrices.


If performance can be improved through parallelization, it would 
certainly

be useful to know the expected gain, and open a "Wish" request in JIRA.

There are several features in CM where parallelism is the obvious next
thing to do (e.g. FFT), either internally, or by providing interface
(and thread-safe instances) to be used by external framework.

Unfortunately, I must state that (almost) nobody seems interested in
enhancing CM!

Observation 4: Breeze with native libraries does better than one 
might
guess based on some previous benchmarks. People still might want to 
use
math commons, not least because those native libs might be 
encumbered with

undesirable licenses.



Do they use multi-threading?



Yes, they do. I suppose that's important to note given that there are 
some
applications where running entire separate pipelines in parallel will 
be

more efficient than having parallel matrix ops.


There are many issues with CM's "linear" package; see e.g.
  https://issues.apache.org/jira/browse/MATH-765

A clean-up should certainly consider parallelism as a requirement for a 
new

design.

The task is obviously daunting given that the MATH-765 issue has been
open for more than 4 years now... :-/


Regards,
Gilles






[^1]:



https://git-wip-us.apache.org/repos/asf?p=commons-math.git;a=blob;f=src/main/java/org/apache/commons/math4/linear/Array2DRowRealMatrix.java;h=3778db56ba406beb973e1355234593bc006adb59;hb=HEAD


On 7 April 2016 at 16:50, Gilles  
wrote:


On Thu, 7 Apr 2016 16:17:52 +0100, Chris Lucas wrote:


Thanks for suggesting the BlockRealMatrix. I've run a few 
benchmarks

comparing the two, along with some other libraries.


The email is not really suited for tables (see your message 
below).

What are the points which you want to make?

As said before, "reports" on CM features (a.o. benchmarks) are 
welcome

but
should be formatted for inclusion in the userguide (for now, until 
we

might
create a separate document for data that is subject to change more 
or

less
rapidly).

If you suspect a problem with the code, then I suggest that you 
file a

JIRA
report, where files can be attached (or tables can be viewed 
without

being
deformed thanks to the "{noformat}" tag).

Regards,
Gilles


I'm using jmh 1.11.3, JDK 1.8.0_05, VM 25.5-b02, with 2 warmups 
and 10

test


iterations.

The suffixes denote what's being tested:
  MC: AMC 3.6.1 using Array2DRowRealMatrix
  SB: Scala Breeze 0.12 with native libraries
  OJ: OJAlgo 39.0. Some other benchmarks suggest OJAlgo is an 
especially

speedy pure-java library.
  BMC: MC using a BlockRealMatrix.

I'm using the same matrix 

Re: [math] matrix multiplication surprisingly slow?

2016-04-07 Thread Chris Lucas
On 7 April 2016 at 18:41, Gilles  wrote:

>
> It's certainly worth ascertaining with more data points (to make a nice
> plot).
>

I can run mult on both for {10, 100, 500, 1000, 1500, 2000, 2500, 5000} if
that would help.


>
> Naturally, better asymptotic
>> performance would be nice,
>>
>
> What do you mean?
>

When I wrote that I was thinking it might make sense to use Strassen's
algorithm (or something else with better asymptotic performance; this isn't
my area of expertise) for large square matrices. After looking a bit at the
literature it seems unlikely to be worth it, given numerical stability
concerns etc.


>
> but I recognize that the AMC devs likely have
>> better things to do.
>>
>> Observation 2: When multiplying new matrices, BlockMatrix offers some
>> improvement when multiplying the same matrices repeatedly
>>
>
> Why would someone do that?
>

Good question! Maybe if matrices are being selected/changed at runtime and
someone can't be bothered to check/cache?


>
> Observation 3: OJAlgo doesn't scale better asymptotically from what I can
>> tell. That's no great surprise based on
>> https://github.com/optimatika/ojAlgo/wiki/Multiply.
>>  Scala Breeze
>> w/native
>> libs scales closer to O(N^2.4).
>>
>
> Details on how you arrive to those numbers would really be a useful
> addition
> to the CM documentation.
>

I can share code snippets. I'd guess that breeze actually scales w/N^3 in a
single thread and that the faster-seeming scaling is related to
parallelization, e.g., adding threads for larger matrices.


>
> Observation 4: Breeze with native libraries does better than one might
>> guess based on some previous benchmarks. People still might want to use
>> math commons, not least because those native libs might be encumbered with
>> undesirable licenses.
>>
>
> Do they use multi-threading?
>

Yes, they do. I suppose that's important to note given that there are some
applications where running entire separate pipelines in parallel will be
more efficient than having parallel matrix ops.


>
> Regards,
> Gilles
>
>
>
>> [^1]:
>>
>>
>> https://git-wip-us.apache.org/repos/asf?p=commons-math.git;a=blob;f=src/main/java/org/apache/commons/math4/linear/Array2DRowRealMatrix.java;h=3778db56ba406beb973e1355234593bc006adb59;hb=HEAD
>>
>>
>> On 7 April 2016 at 16:50, Gilles  wrote:
>>
>> On Thu, 7 Apr 2016 16:17:52 +0100, Chris Lucas wrote:
>>>
>>> Thanks for suggesting the BlockRealMatrix. I've run a few benchmarks
 comparing the two, along with some other libraries.


>>> The email is not really suited for tables (see your message below).
>>> What are the points which you want to make?
>>>
>>> As said before, "reports" on CM features (a.o. benchmarks) are welcome
>>> but
>>> should be formatted for inclusion in the userguide (for now, until we
>>> might
>>> create a separate document for data that is subject to change more or
>>> less
>>> rapidly).
>>>
>>> If you suspect a problem with the code, then I suggest that you file a
>>> JIRA
>>> report, where files can be attached (or tables can be viewed without
>>> being
>>> deformed thanks to the "{noformat}" tag).
>>>
>>> Regards,
>>> Gilles
>>>
>>>
>>> I'm using jmh 1.11.3, JDK 1.8.0_05, VM 25.5-b02, with 2 warmups and 10
>>> test
>>>
 iterations.

 The suffixes denote what's being tested:
   MC: AMC 3.6.1 using Array2DRowRealMatrix
   SB: Scala Breeze 0.12 with native libraries
   OJ: OJAlgo 39.0. Some other benchmarks suggest OJAlgo is an especially
 speedy pure-java library.
   BMC: MC using a BlockRealMatrix.

 I'm using the same matrix creation/multiplication/inverse code as I
 mentioned in my previous note. When testing BlockReadMatrices, I'm using
 fixed random matrices and annotating my class with
 @State(Scope.Benchmark).
 For the curious, my rationale for building matrices on the spot is that
 I'm
 already caching results pretty heavily and rarely perform the same
 operation on the same matrix twice -- I'm most curious about performance
 in
 the absence of caching benefits.

 Test 1a: Creating and multiplying two random/uniform (i.e., all entries
 drawn from math.random()) 100x100 matrices:
 [info] MatrixBenchmarks.buildMultTestMC100  thrpt   10 836.579
 ±   7.120  ops/s
 [info] MatrixBenchmarks.buildMultTestSB100  thrpt   101649.089
 ± 170.649  ops/s
 [info] MatrixBenchmarks.buildMultTestOJ100  thrpt   10  1485.014 ±
 44.158
 ops/s
 [info] MatrixBenchmarks.multBMC100thrpt   10  1051.055 ±  2.290
 ops/s

 Test 1b: Creating and multiplying two random/uniform 500x500 matrices:
 [info] MatrixBenchmarks.buildMultTestMC500  thrpt   10   8.997
 ±   0.064  ops/s
 [info] MatrixBenchmarks.buildMultTestSB500  thrpt   10  80.558
 ±   

Re: [math] matrix multiplication surprisingly slow?

2016-04-07 Thread Gilles

On Thu, 7 Apr 2016 18:09:10 +0100, Chris Lucas wrote:
Sorry, you're right that I should have summarized the key 
implications (at

least as I see them).

The short version:
   (1) If you're multiplying small matrices with AMC, BlockRealMatrix 
is

your friend. For large matrices, stick with Array2DRowRealMatrix


This is surprising, as the exact opposite of the intended purpose of
"BlockRealMatrix".
It's certainly worth ascertaining with more data points (to make a nice 
plot).



(or
something else?).


Patch welcome. :-)

   (2) Scaling isn't great  -- O(N^3) where N is the number of 
rows/cols in
a square matrix) -- but neither is OJAlgo's. The inter-library 
differences

can mostly be attributed to constant factors.

The longer version:

Observation 1: For Array2DRowRealMatrix, matrix multiplication 
appears to
scale with ~ O(N^3) (i.e., the naive algorithm). In retrospect, that 
was a

bit silly to spend time benchmarking because I've just looked at the
source[^1] and it *is* the naive algorithm.


Not totally, there is a copy of the "current row" which otherwise makes
the computation scale really badly (and perhaps the initial reason for 
the

"BlockRealMatrix" implementation).


Naturally, better asymptotic
performance would be nice,


What do you mean?


but I recognize that the AMC devs likely have
better things to do.

Observation 2: When multiplying new matrices, BlockMatrix offers some
improvement when multiplying the same matrices repeatedly


Why would someone do that?


and for smaller
matrices. However on one person's setup (mine) it doesn't offer 
better

performance on large random matrices. Follow-up tests with N \in
{1000,1500,2000,2500} indicate it scales worse than the naive 
algorithm
(didn't look at growth, but factor of ~5 worse on a 2500x2500 
matrix).


In effect, that had been my conclusion some years ago. [But I hadn't 
check
"small" matrices, as I was so convinced that the overhead of 
"BlockRealMatrix"

was unsuitable for the smaller sizes!]

IIUC, CM (both "Array2DRowRealMatrix" and "BlockRealMatrix") only makes 
data
closer in RAM with the "hope" that the CPU cache can help (at least 
that's

what the doc for "BlockRealMatrix" indicates IIRC).

Observation 3: OJAlgo doesn't scale better asymptotically from what I 
can

tell. That's no great surprise based on
https://github.com/optimatika/ojAlgo/wiki/Multiply.
 Scala Breeze 
w/native

libs scales closer to O(N^2.4).


Details on how you arrive to those numbers would really be a useful 
addition

to the CM documentation.

Observation 4: Breeze with native libraries does better than one 
might
guess based on some previous benchmarks. People still might want to 
use
math commons, not least because those native libs might be encumbered 
with

undesirable licenses.


Do they use multi-threading?

Regards,
Gilles



[^1]:

https://git-wip-us.apache.org/repos/asf?p=commons-math.git;a=blob;f=src/main/java/org/apache/commons/math4/linear/Array2DRowRealMatrix.java;h=3778db56ba406beb973e1355234593bc006adb59;hb=HEAD


On 7 April 2016 at 16:50, Gilles  
wrote:



On Thu, 7 Apr 2016 16:17:52 +0100, Chris Lucas wrote:

Thanks for suggesting the BlockRealMatrix. I've run a few 
benchmarks

comparing the two, along with some other libraries.



The email is not really suited for tables (see your message below).
What are the points which you want to make?

As said before, "reports" on CM features (a.o. benchmarks) are 
welcome but
should be formatted for inclusion in the userguide (for now, until 
we might
create a separate document for data that is subject to change more 
or less

rapidly).

If you suspect a problem with the code, then I suggest that you file 
a JIRA
report, where files can be attached (or tables can be viewed without 
being

deformed thanks to the "{noformat}" tag).

Regards,
Gilles


I'm using jmh 1.11.3, JDK 1.8.0_05, VM 25.5-b02, with 2 warmups and 
10 test

iterations.

The suffixes denote what's being tested:
  MC: AMC 3.6.1 using Array2DRowRealMatrix
  SB: Scala Breeze 0.12 with native libraries
  OJ: OJAlgo 39.0. Some other benchmarks suggest OJAlgo is an 
especially

speedy pure-java library.
  BMC: MC using a BlockRealMatrix.

I'm using the same matrix creation/multiplication/inverse code as I
mentioned in my previous note. When testing BlockReadMatrices, I'm 
using

fixed random matrices and annotating my class with
@State(Scope.Benchmark).
For the curious, my rationale for building matrices on the spot is 
that

I'm
already caching results pretty heavily and rarely perform the same
operation on the same matrix twice -- I'm most curious about 
performance

in
the absence of caching benefits.

Test 1a: Creating and multiplying two random/uniform (i.e., all 
entries

drawn from math.random()) 100x100 matrices:
[info] MatrixBenchmarks.buildMultTestMC100  thrpt   10 
836.579

±   7.120  ops/s
[info] 

Re: [math] matrix multiplication surprisingly slow?

2016-04-07 Thread Chris Lucas
Sorry, you're right that I should have summarized the key implications (at
least as I see them).

The short version:
   (1) If you're multiplying small matrices with AMC, BlockRealMatrix is
your friend. For large matrices, stick with Array2DRowRealMatrix (or
something else?).
   (2) Scaling isn't great  -- O(N^3) where N is the number of rows/cols in
a square matrix) -- but neither is OJAlgo's. The inter-library differences
can mostly be attributed to constant factors.

The longer version:

Observation 1: For Array2DRowRealMatrix, matrix multiplication appears to
scale with ~ O(N^3) (i.e., the naive algorithm). In retrospect, that was a
bit silly to spend time benchmarking because I've just looked at the
source[^1] and it *is* the naive algorithm. Naturally, better asymptotic
performance would be nice, but I recognize that the AMC devs likely have
better things to do.

Observation 2: When multiplying new matrices, BlockMatrix offers some
improvement when multiplying the same matrices repeatedly and for smaller
matrices. However on one person's setup (mine) it doesn't offer better
performance on large random matrices. Follow-up tests with N \in
{1000,1500,2000,2500} indicate it scales worse than the naive algorithm
(didn't look at growth, but factor of ~5 worse on a 2500x2500 matrix).

Observation 3: OJAlgo doesn't scale better asymptotically from what I can
tell. That's no great surprise based on
https://github.com/optimatika/ojAlgo/wiki/Multiply.
 Scala Breeze w/native
libs scales closer to O(N^2.4).

Observation 4: Breeze with native libraries does better than one might
guess based on some previous benchmarks. People still might want to use
math commons, not least because those native libs might be encumbered with
undesirable licenses.

[^1]:
https://git-wip-us.apache.org/repos/asf?p=commons-math.git;a=blob;f=src/main/java/org/apache/commons/math4/linear/Array2DRowRealMatrix.java;h=3778db56ba406beb973e1355234593bc006adb59;hb=HEAD


On 7 April 2016 at 16:50, Gilles  wrote:

> On Thu, 7 Apr 2016 16:17:52 +0100, Chris Lucas wrote:
>
>> Thanks for suggesting the BlockRealMatrix. I've run a few benchmarks
>> comparing the two, along with some other libraries.
>>
>
> The email is not really suited for tables (see your message below).
> What are the points which you want to make?
>
> As said before, "reports" on CM features (a.o. benchmarks) are welcome but
> should be formatted for inclusion in the userguide (for now, until we might
> create a separate document for data that is subject to change more or less
> rapidly).
>
> If you suspect a problem with the code, then I suggest that you file a JIRA
> report, where files can be attached (or tables can be viewed without being
> deformed thanks to the "{noformat}" tag).
>
> Regards,
> Gilles
>
>
> I'm using jmh 1.11.3, JDK 1.8.0_05, VM 25.5-b02, with 2 warmups and 10 test
>> iterations.
>>
>> The suffixes denote what's being tested:
>>   MC: AMC 3.6.1 using Array2DRowRealMatrix
>>   SB: Scala Breeze 0.12 with native libraries
>>   OJ: OJAlgo 39.0. Some other benchmarks suggest OJAlgo is an especially
>> speedy pure-java library.
>>   BMC: MC using a BlockRealMatrix.
>>
>> I'm using the same matrix creation/multiplication/inverse code as I
>> mentioned in my previous note. When testing BlockReadMatrices, I'm using
>> fixed random matrices and annotating my class with
>> @State(Scope.Benchmark).
>> For the curious, my rationale for building matrices on the spot is that
>> I'm
>> already caching results pretty heavily and rarely perform the same
>> operation on the same matrix twice -- I'm most curious about performance
>> in
>> the absence of caching benefits.
>>
>> Test 1a: Creating and multiplying two random/uniform (i.e., all entries
>> drawn from math.random()) 100x100 matrices:
>> [info] MatrixBenchmarks.buildMultTestMC100  thrpt   10 836.579
>> ±   7.120  ops/s
>> [info] MatrixBenchmarks.buildMultTestSB100  thrpt   101649.089
>> ± 170.649  ops/s
>> [info] MatrixBenchmarks.buildMultTestOJ100  thrpt   10  1485.014 ± 44.158
>> ops/s
>> [info] MatrixBenchmarks.multBMC100thrpt   10  1051.055 ±  2.290  ops/s
>>
>> Test 1b: Creating and multiplying two random/uniform 500x500 matrices:
>> [info] MatrixBenchmarks.buildMultTestMC500  thrpt   10   8.997
>> ±   0.064  ops/s
>> [info] MatrixBenchmarks.buildMultTestSB500  thrpt   10  80.558
>> ±   0.627  ops/s
>> [info] MatrixBenchmarks.buildMultTestOJ500  thrpt   10  34.626
>> ±   2.505  ops/s
>> [info] MatrixBenchmarks.multBMC500thrpt   10 9.132 ±  0.059  ops/s
>> [info] MatrixBenchmarks.multCacheBMC500  thrpt   10  13.630 ± 0.107  ops/s
>> [no matrix creation]
>>
>> ---
>>
>> Test 2a: Creating a single 500x500 matrix (to get a sense of whether the
>> mult itself is driving differences:
>> [info] MatrixBenchmarks.buildMC500  thrpt   10 155.026
>> ±  

Re: [math] matrix multiplication surprisingly slow?

2016-04-07 Thread Gilles

On Thu, 7 Apr 2016 16:17:52 +0100, Chris Lucas wrote:

Thanks for suggesting the BlockRealMatrix. I've run a few benchmarks
comparing the two, along with some other libraries.


The email is not really suited for tables (see your message below).
What are the points which you want to make?

As said before, "reports" on CM features (a.o. benchmarks) are welcome 
but
should be formatted for inclusion in the userguide (for now, until we 
might
create a separate document for data that is subject to change more or 
less

rapidly).

If you suspect a problem with the code, then I suggest that you file a 
JIRA
report, where files can be attached (or tables can be viewed without 
being

deformed thanks to the "{noformat}" tag).

Regards,
Gilles

I'm using jmh 1.11.3, JDK 1.8.0_05, VM 25.5-b02, with 2 warmups and 
10 test

iterations.

The suffixes denote what's being tested:
  MC: AMC 3.6.1 using Array2DRowRealMatrix
  SB: Scala Breeze 0.12 with native libraries
  OJ: OJAlgo 39.0. Some other benchmarks suggest OJAlgo is an 
especially

speedy pure-java library.
  BMC: MC using a BlockRealMatrix.

I'm using the same matrix creation/multiplication/inverse code as I
mentioned in my previous note. When testing BlockReadMatrices, I'm 
using
fixed random matrices and annotating my class with 
@State(Scope.Benchmark).
For the curious, my rationale for building matrices on the spot is 
that I'm

already caching results pretty heavily and rarely perform the same
operation on the same matrix twice -- I'm most curious about 
performance in

the absence of caching benefits.

Test 1a: Creating and multiplying two random/uniform (i.e., all 
entries

drawn from math.random()) 100x100 matrices:
[info] MatrixBenchmarks.buildMultTestMC100  thrpt   10 
836.579

±   7.120  ops/s
[info] MatrixBenchmarks.buildMultTestSB100  thrpt   10
1649.089

± 170.649  ops/s
[info] MatrixBenchmarks.buildMultTestOJ100  thrpt   10  1485.014 ± 
44.158

ops/s
[info] MatrixBenchmarks.multBMC100thrpt   10  1051.055 ±  2.290  
ops/s


Test 1b: Creating and multiplying two random/uniform 500x500 
matrices:
[info] MatrixBenchmarks.buildMultTestMC500  thrpt   10   
8.997

±   0.064  ops/s
[info] MatrixBenchmarks.buildMultTestSB500  thrpt   10  
80.558

±   0.627  ops/s
[info] MatrixBenchmarks.buildMultTestOJ500  thrpt   10  
34.626

±   2.505  ops/s
[info] MatrixBenchmarks.multBMC500thrpt   10 9.132 ±  0.059  
ops/s
[info] MatrixBenchmarks.multCacheBMC500  thrpt   10  13.630 ± 0.107  
ops/s

[no matrix creation]

---

Test 2a: Creating a single 500x500 matrix (to get a sense of whether 
the

mult itself is driving differences:
[info] MatrixBenchmarks.buildMC500  thrpt   10 
155.026

±   2.456  ops/s
[info] MatrixBenchmarks.buildSB500  thrpt   10 
197.257

±   4.619  ops/s
[info] MatrixBenchmarks.buildOJ500  thrpt   10 
176.036

±   2.121  ops/s

Test 2b: Creating a single 1000x1000 matrix (to show it scales w/N):
[info] MatrixBenchmarks.buildMC1000  thrpt   1036.112 ±  2.775  
ops/s
[info] MatrixBenchmarks.buildSB1000  thrpt   1045.557 ±  2.893  
ops/s
[info] MatrixBenchmarks.buildOJ1000  thrpt   1047.438 ±  1.370  
ops/s
[info] MatrixBenchmarks.buildBMC1000  thrpt   1037.779 ±  0.871  
ops/s


---

Test 3a: Inverting a random 100x100 matrix:
[info] MatrixBenchmarks.invMC100   thrpt   10  1033.011 ±  9.928  
ops/s
[info] MatrixBenchmarks.invSB100   thrpt   10  1664.833 ± 15.170  
ops/s
[info] MatrixBenchmarks.invOJ100   thrpt   10  1174.044 ± 52.083  
ops/s
[info] MatrixBenchmarks.invBMC100 thrpt   10  1043.858 ± 22.144  
ops/s


Test 3b: Inverting a random 500x500 matrix:
[info] MatrixBenchmarks.invMC500thrpt   10   9.089 ±
0.284  ops/s
[info] MatrixBenchmarks.invSB500thrpt   10  43.857 ±
1.083  ops/s
[info] MatrixBenchmarks.invOJ500thrpt   10  23.444 ±
1.484  ops/s
[info] MatrixBenchmarks.invBMC500 thrpt   10 9.156 ±  0.052  
ops/s
[info] MatrixBenchmarks.invCacheBMC500   thrpt   10   9.627 ± 0.084  
ops/s

[no matrix creation]

Test 3c:
[info] MatrixBenchmarks.invMC1000  thrpt   10 0.704 ±  0.040  
ops/s
[info] MatrixBenchmarks.invSB1000   thrpt   10 8.611 ±  
0.557

ops/s
[info] MatrixBenchmarks.invOJ1000  thrpt   10 3.539 ±  0.229  
ops/s
[info] MatrixBenchmarks.invBMC1000thrpt   10 0.691 ±  0.095  
ops/s


Also, isn't matrix multiplication at least O(N^2.37), rather than 
O(N^2)?


On 6 April 2016 at 12:55, Chris Lucas  wrote:


Thanks for the quick reply!

I've pasted a small self-contained example at the bottom. It creates 
the
matrices in advance, but nothing meaningful changes if they're 
created on a

per-operation basis.

Results for 50 multiplications of [size]x[size] matrices:
   Size: 10, total time: 0.012 seconds, time per: 0.000 seconds
   Size: 100, total time: 0.062 seconds, time per: 

Re: [math] matrix multiplication surprisingly slow?

2016-04-07 Thread Chris Lucas
Thanks for suggesting the BlockRealMatrix. I've run a few benchmarks
comparing the two, along with some other libraries.

I'm using jmh 1.11.3, JDK 1.8.0_05, VM 25.5-b02, with 2 warmups and 10 test
iterations.

The suffixes denote what's being tested:
  MC: AMC 3.6.1 using Array2DRowRealMatrix
  SB: Scala Breeze 0.12 with native libraries
  OJ: OJAlgo 39.0. Some other benchmarks suggest OJAlgo is an especially
speedy pure-java library.
  BMC: MC using a BlockRealMatrix.

I'm using the same matrix creation/multiplication/inverse code as I
mentioned in my previous note. When testing BlockReadMatrices, I'm using
fixed random matrices and annotating my class with @State(Scope.Benchmark).
For the curious, my rationale for building matrices on the spot is that I'm
already caching results pretty heavily and rarely perform the same
operation on the same matrix twice -- I'm most curious about performance in
the absence of caching benefits.

Test 1a: Creating and multiplying two random/uniform (i.e., all entries
drawn from math.random()) 100x100 matrices:
[info] MatrixBenchmarks.buildMultTestMC100  thrpt   10 836.579
±   7.120  ops/s
[info] MatrixBenchmarks.buildMultTestSB100  thrpt   101649.089
± 170.649  ops/s
[info] MatrixBenchmarks.buildMultTestOJ100  thrpt   10  1485.014 ± 44.158
ops/s
[info] MatrixBenchmarks.multBMC100thrpt   10  1051.055 ±  2.290  ops/s

Test 1b: Creating and multiplying two random/uniform 500x500 matrices:
[info] MatrixBenchmarks.buildMultTestMC500  thrpt   10   8.997
±   0.064  ops/s
[info] MatrixBenchmarks.buildMultTestSB500  thrpt   10  80.558
±   0.627  ops/s
[info] MatrixBenchmarks.buildMultTestOJ500  thrpt   10  34.626
±   2.505  ops/s
[info] MatrixBenchmarks.multBMC500thrpt   10 9.132 ±  0.059  ops/s
[info] MatrixBenchmarks.multCacheBMC500  thrpt   10  13.630 ± 0.107  ops/s
[no matrix creation]

---

Test 2a: Creating a single 500x500 matrix (to get a sense of whether the
mult itself is driving differences:
[info] MatrixBenchmarks.buildMC500  thrpt   10 155.026
±   2.456  ops/s
[info] MatrixBenchmarks.buildSB500  thrpt   10 197.257
±   4.619  ops/s
[info] MatrixBenchmarks.buildOJ500  thrpt   10 176.036
±   2.121  ops/s

Test 2b: Creating a single 1000x1000 matrix (to show it scales w/N):
[info] MatrixBenchmarks.buildMC1000  thrpt   1036.112 ±  2.775  ops/s
[info] MatrixBenchmarks.buildSB1000  thrpt   1045.557 ±  2.893  ops/s
[info] MatrixBenchmarks.buildOJ1000  thrpt   1047.438 ±  1.370  ops/s
[info] MatrixBenchmarks.buildBMC1000  thrpt   1037.779 ±  0.871  ops/s

---

Test 3a: Inverting a random 100x100 matrix:
[info] MatrixBenchmarks.invMC100   thrpt   10  1033.011 ±  9.928  ops/s
[info] MatrixBenchmarks.invSB100   thrpt   10  1664.833 ± 15.170  ops/s
[info] MatrixBenchmarks.invOJ100   thrpt   10  1174.044 ± 52.083  ops/s
[info] MatrixBenchmarks.invBMC100 thrpt   10  1043.858 ± 22.144  ops/s

Test 3b: Inverting a random 500x500 matrix:
[info] MatrixBenchmarks.invMC500thrpt   10   9.089 ±
0.284  ops/s
[info] MatrixBenchmarks.invSB500thrpt   10  43.857 ±
1.083  ops/s
[info] MatrixBenchmarks.invOJ500thrpt   10  23.444 ±
1.484  ops/s
[info] MatrixBenchmarks.invBMC500 thrpt   10 9.156 ±  0.052  ops/s
[info] MatrixBenchmarks.invCacheBMC500   thrpt   10   9.627 ± 0.084  ops/s
[no matrix creation]

Test 3c:
[info] MatrixBenchmarks.invMC1000  thrpt   10 0.704 ±  0.040  ops/s
[info] MatrixBenchmarks.invSB1000   thrpt   10 8.611 ±  0.557
ops/s
[info] MatrixBenchmarks.invOJ1000  thrpt   10 3.539 ±  0.229  ops/s
[info] MatrixBenchmarks.invBMC1000thrpt   10 0.691 ±  0.095  ops/s

Also, isn't matrix multiplication at least O(N^2.37), rather than O(N^2)?

On 6 April 2016 at 12:55, Chris Lucas  wrote:

> Thanks for the quick reply!
>
> I've pasted a small self-contained example at the bottom. It creates the
> matrices in advance, but nothing meaningful changes if they're created on a
> per-operation basis.
>
> Results for 50 multiplications of [size]x[size] matrices:
>Size: 10, total time: 0.012 seconds, time per: 0.000 seconds
>Size: 100, total time: 0.062 seconds, time per: 0.001 seconds
>Size: 300, total time: 3.050 seconds, time per: 0.061 seconds
>Size: 500, total time: 15.186 seconds, time per: 0.304 seconds
>Size: 600, total time: 17.532 seconds, time per: 0.351 seconds
>
> For comparison:
>
> Results for 50 additions of [size]x[size] matrices (which should be
> faster, be the extent of the difference is nonetheless striking to me):
>Size: 10, total time: 0.011 seconds, time per: 0.000 seconds
>Size: 100, total time: 0.012 seconds, time per: 0.000 seconds
>Size: 300, total time: 0.020 seconds, time per: 0.000 seconds
>Size: 500, total time: 0.032 seconds, time per: 0.001 seconds
>Size: 600, total 

Re: [math] matrix multiplication surprisingly slow?

2016-04-07 Thread Gilles

On Wed, 6 Apr 2016 12:55:06 +0100, Chris Lucas wrote:

Thanks for the quick reply!

I've pasted a small self-contained example at the bottom. It creates 
the
matrices in advance, but nothing meaningful changes if they're 
created on a

per-operation basis.

Results for 50 multiplications of [size]x[size] matrices:
   Size: 10, total time: 0.012 seconds, time per: 0.000 seconds
   Size: 100, total time: 0.062 seconds, time per: 0.001 seconds
   Size: 300, total time: 3.050 seconds, time per: 0.061 seconds
   Size: 500, total time: 15.186 seconds, time per: 0.304 seconds
   Size: 600, total time: 17.532 seconds, time per: 0.351 seconds

For comparison:

Results for 50 additions of [size]x[size] matrices (which should be 
faster,

be the extent of the difference is nonetheless striking to me):
   Size: 10, total time: 0.011 seconds, time per: 0.000 seconds
   Size: 100, total time: 0.012 seconds, time per: 0.000 seconds
   Size: 300, total time: 0.020 seconds, time per: 0.000 seconds
   Size: 500, total time: 0.032 seconds, time per: 0.001 seconds
   Size: 600, total time: 0.050 seconds, time per: 0.001 seconds

Results for 50 inversions of a [size]x[size] matrix, which I'd expect 
to be

slower than multiplication for larger matrices:
   Size: 10, total time: 0.014 seconds, time per: 0.000 seconds
   Size: 100, total time: 0.074 seconds, time per: 0.001 seconds
   Size: 300, total time: 1.005 seconds, time per: 0.020 seconds
   Size: 500, total time: 5.024 seconds, time per: 0.100 seconds
   Size: 600, total time: 9.297 seconds, time per: 0.186 seconds

I hope this is useful, and if I'm doing something wrong that's 
leading to

this performance gap, I'd love to know.


Here is the result of a JMH run (only benchmarking matrix 
multiplication):

---CUT---
[...]

Benchmark   Mode  CntScore  Error  
Units
MyBenchmarkMatrix.size100x100_a2d  thrpt  200  738.787 ±8.501  
ops/s
MyBenchmarkMatrix.size100x100_bthrpt  200 1527.522 ±   16.158  
ops/s
MyBenchmarkMatrix.size10x10_a2dthrpt  200   697662.468 ± 4807.916  
ops/s
MyBenchmarkMatrix.size10x10_b  thrpt  200  1003800.610 ± 6612.299  
ops/s
MyBenchmarkMatrix.size600x600_a2d  thrpt  2000.669 ±0.007  
ops/s
MyBenchmarkMatrix.size600x600_bthrpt  2007.794 ±0.071  
ops/s

---CUT---

where the suffixindicates the layout type:
  a2d -> Array2DRowRealMatrix
b -> BlockRealMatrix
[So, even for small matrices, "BlockRealMatrix" is a clear win.]

As for the multiplication getting slower for larger sizes, I'd assume 
that it's
related to the O(n^2) nature of the algorithm (where "n" is the number 
of elements

of the matrix).
For the two layouts provided in CM and for the other library, you could 
perhaps
confirm the exact dependency by fitting the parameters "a", "b", "c" of 
the

function
  t(n) = a + b n + c n^2
using e.g. 
http://commons.apache.org/proper/commons-math/javadocs/api-3.6.1/org/apache/commons/math3/fitting/PolynomialCurveFitter.html

(and more data points).


Regards,
Gilles




import org.apache.commons.math3.linear.LUDecomposition;
import org.apache.commons.math3.linear.Array2DRowRealMatrix;
import org.apache.commons.math3.linear.RealMatrix;


public class AMCMatrices {

  public static void main(String[] args) {
miniTest(0);
  }

  public static void miniTest(int tType) {
int samples = 50;

int sizes[] = { 10, 100, 300, 500, 600 };

for (int sI = 0; sI < sizes.length; sI++) {
  int mSize = sizes[sI];

  org.apache.commons.math3.linear.RealMatrix m0 = buildM(mSize, 
mSize);

  RealMatrix m1 = buildM(mSize, mSize);

  long start = System.nanoTime();
  for (int n = 0; n < samples; n++) {
switch (tType) {
case 0:
  m0.multiply(m1);
  break;
case 1:
  m0.add(m1);
  break;
case 2:
  new LUDecomposition(m0).getSolver().getInverse();
  break;
}

  }
  long end = System.nanoTime();

  double dt = ((double) (end - start)) / 1E9;
  System.out.println(String.format(
  "Size: %d, total time: %3.3f seconds, time per: %3.3f 
seconds",

  mSize, dt, dt / samples));
}
  }

  public static Array2DRowRealMatrix buildM(int r, int c) {
double[][] matVals = new double[r][c];
for (int i = 0; i < r; i++) {
  for (int j = 0; j < c; j++) {
matVals[i][j] = Math.random();
  }
}
return new Array2DRowRealMatrix(matVals);
  }
}



On 5 April 2016 at 19:36, Gilles  
wrote:



Hi.

On Tue, 5 Apr 2016 15:43:04 +0100, Chris Lucas wrote:

I recently ran a benchmark comparing the performance math commons 
3.6.1's

linear algebra library to the that of scala Breeze (
https://github.com/scalanlp/breeze).

I looked at det, inverse, Cholesky factorization, addition, and
multiplication, including matrices with 10, 100, 500, and 1000 
elements,
with symmetric, non-symmetric, and 

Re: [math] matrix multiplication surprisingly slow?

2016-04-06 Thread Chris Lucas
Thanks for the quick reply!

I've pasted a small self-contained example at the bottom. It creates the
matrices in advance, but nothing meaningful changes if they're created on a
per-operation basis.

Results for 50 multiplications of [size]x[size] matrices:
   Size: 10, total time: 0.012 seconds, time per: 0.000 seconds
   Size: 100, total time: 0.062 seconds, time per: 0.001 seconds
   Size: 300, total time: 3.050 seconds, time per: 0.061 seconds
   Size: 500, total time: 15.186 seconds, time per: 0.304 seconds
   Size: 600, total time: 17.532 seconds, time per: 0.351 seconds

For comparison:

Results for 50 additions of [size]x[size] matrices (which should be faster,
be the extent of the difference is nonetheless striking to me):
   Size: 10, total time: 0.011 seconds, time per: 0.000 seconds
   Size: 100, total time: 0.012 seconds, time per: 0.000 seconds
   Size: 300, total time: 0.020 seconds, time per: 0.000 seconds
   Size: 500, total time: 0.032 seconds, time per: 0.001 seconds
   Size: 600, total time: 0.050 seconds, time per: 0.001 seconds

Results for 50 inversions of a [size]x[size] matrix, which I'd expect to be
slower than multiplication for larger matrices:
   Size: 10, total time: 0.014 seconds, time per: 0.000 seconds
   Size: 100, total time: 0.074 seconds, time per: 0.001 seconds
   Size: 300, total time: 1.005 seconds, time per: 0.020 seconds
   Size: 500, total time: 5.024 seconds, time per: 0.100 seconds
   Size: 600, total time: 9.297 seconds, time per: 0.186 seconds

I hope this is useful, and if I'm doing something wrong that's leading to
this performance gap, I'd love to know.


import org.apache.commons.math3.linear.LUDecomposition;
import org.apache.commons.math3.linear.Array2DRowRealMatrix;
import org.apache.commons.math3.linear.RealMatrix;


public class AMCMatrices {

  public static void main(String[] args) {
miniTest(0);
  }

  public static void miniTest(int tType) {
int samples = 50;

int sizes[] = { 10, 100, 300, 500, 600 };

for (int sI = 0; sI < sizes.length; sI++) {
  int mSize = sizes[sI];

  org.apache.commons.math3.linear.RealMatrix m0 = buildM(mSize, mSize);
  RealMatrix m1 = buildM(mSize, mSize);

  long start = System.nanoTime();
  for (int n = 0; n < samples; n++) {
switch (tType) {
case 0:
  m0.multiply(m1);
  break;
case 1:
  m0.add(m1);
  break;
case 2:
  new LUDecomposition(m0).getSolver().getInverse();
  break;
}

  }
  long end = System.nanoTime();

  double dt = ((double) (end - start)) / 1E9;
  System.out.println(String.format(
  "Size: %d, total time: %3.3f seconds, time per: %3.3f seconds",
  mSize, dt, dt / samples));
}
  }

  public static Array2DRowRealMatrix buildM(int r, int c) {
double[][] matVals = new double[r][c];
for (int i = 0; i < r; i++) {
  for (int j = 0; j < c; j++) {
matVals[i][j] = Math.random();
  }
}
return new Array2DRowRealMatrix(matVals);
  }
}



On 5 April 2016 at 19:36, Gilles  wrote:

> Hi.
>
> On Tue, 5 Apr 2016 15:43:04 +0100, Chris Lucas wrote:
>
>> I recently ran a benchmark comparing the performance math commons 3.6.1's
>> linear algebra library to the that of scala Breeze (
>> https://github.com/scalanlp/breeze).
>>
>> I looked at det, inverse, Cholesky factorization, addition, and
>> multiplication, including matrices with 10, 100, 500, and 1000 elements,
>> with symmetric, non-symmetric, and non-square cases where applicable.
>>
>
> It would be interesting to add this to the CM documentation:
>   https://issues.apache.org/jira/browse/MATH-1194
>
> In general, I was pleasantly surprised: math commons performed about as
>> well as Breeze, despite the latter relying on native libraries. There was
>> one exception, however:
>>
>> m0.multiply(m1)
>>
>> where m0 and m1 are both Array2DRowRealMatrix instances. It scaled very
>> poorly in math commons, being much slower than nominally more expensive
>> operations like inv and the Breeze implementation. Does anyone have a
>> thought as to what's going on?
>>
>
> Could your provide more information such as a plot of performance vs size?
> A self-contained and minimal code to run would be nice too.
> See the CM micro-benchmarking tool here:
>
> https://github.com/apache/commons-math/blob/master/src/test/java/org/apache/commons/math4/PerfTestUtils.java
> And an example of how to use it:
>
> https://github.com/apache/commons-math/blob/master/src/userguide/java/org/apache/commons/math4/userguide/FastMathTestPerformance.java
>
> In case it's useful, one representative test
>> involves multiplying two instances of
>>
>> new Array2DRowRealMatrix(matVals)
>>
>> where matVals is 1000x1000 entries of math.random and the second instance
>> is created as part of the loop. This part of the benchmark is not specific
>> to the expensive multiplication step, 

Re: [math] matrix multiplication surprisingly slow?

2016-04-05 Thread Gilles

Hi.

On Tue, 5 Apr 2016 15:43:04 +0100, Chris Lucas wrote:
I recently ran a benchmark comparing the performance math commons 
3.6.1's

linear algebra library to the that of scala Breeze (
https://github.com/scalanlp/breeze).

I looked at det, inverse, Cholesky factorization, addition, and
multiplication, including matrices with 10, 100, 500, and 1000 
elements,

with symmetric, non-symmetric, and non-square cases where applicable.


It would be interesting to add this to the CM documentation:
  https://issues.apache.org/jira/browse/MATH-1194

In general, I was pleasantly surprised: math commons performed about 
as
well as Breeze, despite the latter relying on native libraries. There 
was

one exception, however:

m0.multiply(m1)

where m0 and m1 are both Array2DRowRealMatrix instances. It scaled 
very
poorly in math commons, being much slower than nominally more 
expensive

operations like inv and the Breeze implementation. Does anyone have a
thought as to what's going on?


Could your provide more information such as a plot of performance vs 
size?

A self-contained and minimal code to run would be nice too.
See the CM micro-benchmarking tool here:
  
https://github.com/apache/commons-math/blob/master/src/test/java/org/apache/commons/math4/PerfTestUtils.java

And an example of how to use it:
  
https://github.com/apache/commons-math/blob/master/src/userguide/java/org/apache/commons/math4/userguide/FastMathTestPerformance.java



In case it's useful, one representative test
involves multiplying two instances of

new Array2DRowRealMatrix(matVals)

where matVals is 1000x1000 entries of math.random and the second 
instance
is created as part of the loop. This part of the benchmark is not 
specific
to the expensive multiplication step, and takes very little time 
relative
to the multiplication itself. I'm using System.nanotime for the 
timing, and
take the average time over several consecutive iterations, on a 3.5 
GHz

Intel Core i7, Oracle JRE (build 1.8.0_05-b13).


You might want to confirm the behaviour using JMH (becoming the Java 
standard

for benchmarking):
  http://openjdk.java.net/projects/code-tools/jmh/


Best regards,
Gilles



Thanks,

Chris



-
To unsubscribe, e-mail: user-unsubscr...@commons.apache.org
For additional commands, e-mail: user-h...@commons.apache.org



[math] matrix multiplication surprisingly slow?

2016-04-05 Thread Chris Lucas
I recently ran a benchmark comparing the performance math commons 3.6.1's
linear algebra library to the that of scala Breeze (
https://github.com/scalanlp/breeze).

I looked at det, inverse, Cholesky factorization, addition, and
multiplication, including matrices with 10, 100, 500, and 1000 elements,
with symmetric, non-symmetric, and non-square cases where applicable.

In general, I was pleasantly surprised: math commons performed about as
well as Breeze, despite the latter relying on native libraries. There was
one exception, however:

m0.multiply(m1)

where m0 and m1 are both Array2DRowRealMatrix instances. It scaled very
poorly in math commons, being much slower than nominally more expensive
operations like inv and the Breeze implementation. Does anyone have a
thought as to what's going on? In case it's useful, one representative test
involves multiplying two instances of

new Array2DRowRealMatrix(matVals)

where matVals is 1000x1000 entries of math.random and the second instance
is created as part of the loop. This part of the benchmark is not specific
to the expensive multiplication step, and takes very little time relative
to the multiplication itself. I'm using System.nanotime for the timing, and
take the average time over several consecutive iterations, on a 3.5 GHz
Intel Core i7, Oracle JRE (build 1.8.0_05-b13).

Thanks,

Chris