On 7 April 2016 at 18:41, Gilles <[email protected]> wrote:
>
> It's certainly worth ascertaining with more data points (to make a nice
> plot).
>
I can run mult on both for {10, 100, 500, 1000, 1500, 2000, 2500, 5000} if
that would help.
>
> Naturally, better asymptotic
>> performance would be nice,
>>
>
> What do you mean?
>
When I wrote that I was thinking it might make sense to use Strassen's
algorithm (or something else with better asymptotic performance; this isn't
my area of expertise) for large square matrices. After looking a bit at the
literature it seems unlikely to be worth it, given numerical stability
concerns etc.
>
> but I recognize that the AMC devs likely have
>> better things to do.
>>
>> Observation 2: When multiplying new matrices, BlockMatrix offers some
>> improvement when multiplying the same matrices repeatedly
>>
>
> Why would someone do that?
>
Good question! Maybe if matrices are being selected/changed at runtime and
someone can't be bothered to check/cache?
>
> Observation 3: OJAlgo doesn't scale better asymptotically from what I can
>> tell. That's no great surprise based on
>> https://github.com/optimatika/ojAlgo/wiki/Multiply.
>> <https://github.com/optimatika/ojAlgo/wiki/Multiply> Scala Breeze
>> w/native
>> libs scales closer to O(N^2.4).
>>
>
> Details on how you arrive to those numbers would really be a useful
> addition
> to the CM documentation.
>
I can share code snippets. I'd guess that breeze actually scales w/N^3 in a
single thread and that the faster-seeming scaling is related to
parallelization, e.g., adding threads for larger matrices.
>
> Observation 4: Breeze with native libraries does better than one might
>> guess based on some previous benchmarks. People still might want to use
>> math commons, not least because those native libs might be encumbered with
>> undesirable licenses.
>>
>
> Do they use multi-threading?
>
Yes, they do. I suppose that's important to note given that there are some
applications where running entire separate pipelines in parallel will be
more efficient than having parallel matrix ops.
>
> Regards,
> Gilles
>
>
>
>> [^1]:
>>
>>
>> https://git-wip-us.apache.org/repos/asf?p=commons-math.git;a=blob;f=src/main/java/org/apache/commons/math4/linear/Array2DRowRealMatrix.java;h=3778db56ba406beb973e1355234593bc006adb59;hb=HEAD
>>
>>
>> On 7 April 2016 at 16:50, Gilles <[email protected]> wrote:
>>
>> On Thu, 7 Apr 2016 16:17:52 +0100, Chris Lucas wrote:
>>>
>>> Thanks for suggesting the BlockRealMatrix. I've run a few benchmarks
>>>> comparing the two, along with some other libraries.
>>>>
>>>>
>>> The email is not really suited for tables (see your message below).
>>> What are the points which you want to make?
>>>
>>> As said before, "reports" on CM features (a.o. benchmarks) are welcome
>>> but
>>> should be formatted for inclusion in the userguide (for now, until we
>>> might
>>> create a separate document for data that is subject to change more or
>>> less
>>> rapidly).
>>>
>>> If you suspect a problem with the code, then I suggest that you file a
>>> JIRA
>>> report, where files can be attached (or tables can be viewed without
>>> being
>>> deformed thanks to the "{noformat}" tag).
>>>
>>> Regards,
>>> Gilles
>>>
>>>
>>> I'm using jmh 1.11.3, JDK 1.8.0_05, VM 25.5-b02, with 2 warmups and 10
>>> test
>>>
>>>> iterations.
>>>>
>>>> The suffixes denote what's being tested:
>>>> MC: AMC 3.6.1 using Array2DRowRealMatrix
>>>> SB: Scala Breeze 0.12 with native libraries
>>>> OJ: OJAlgo 39.0. Some other benchmarks suggest OJAlgo is an especially
>>>> speedy pure-java library.
>>>> BMC: MC using a BlockRealMatrix.
>>>>
>>>> I'm using the same matrix creation/multiplication/inverse code as I
>>>> mentioned in my previous note. When testing BlockReadMatrices, I'm using
>>>> fixed random matrices and annotating my class with
>>>> @State(Scope.Benchmark).
>>>> For the curious, my rationale for building matrices on the spot is that
>>>> I'm
>>>> already caching results pretty heavily and rarely perform the same
>>>> operation on the same matrix twice -- I'm most curious about performance
>>>> in
>>>> the absence of caching benefits.
>>>>
>>>> Test 1a: Creating and multiplying two random/uniform (i.e., all entries
>>>> drawn from math.random()) 100x100 matrices:
>>>> [info] MatrixBenchmarks.buildMultTestMC100 thrpt 10 836.579
>>>> ± 7.120 ops/s
>>>> [info] MatrixBenchmarks.buildMultTestSB100 thrpt 10 1649.089
>>>> ± 170.649 ops/s
>>>> [info] MatrixBenchmarks.buildMultTestOJ100 thrpt 10 1485.014 ±
>>>> 44.158
>>>> ops/s
>>>> [info] MatrixBenchmarks.multBMC100 thrpt 10 1051.055 ± 2.290
>>>> ops/s
>>>>
>>>> Test 1b: Creating and multiplying two random/uniform 500x500 matrices:
>>>> [info] MatrixBenchmarks.buildMultTestMC500 thrpt 10 8.997
>>>> ± 0.064 ops/s
>>>> [info] MatrixBenchmarks.buildMultTestSB500 thrpt 10 80.558
>>>> ± 0.627 ops/s
>>>> [info] MatrixBenchmarks.buildMultTestOJ500 thrpt 10 34.626
>>>> ± 2.505 ops/s
>>>> [info] MatrixBenchmarks.multBMC500 thrpt 10 9.132 ± 0.059
>>>> ops/s
>>>> [info] MatrixBenchmarks.multCacheBMC500 thrpt 10 13.630 ± 0.107
>>>> ops/s
>>>> [no matrix creation]
>>>>
>>>> ---
>>>>
>>>> Test 2a: Creating a single 500x500 matrix (to get a sense of whether the
>>>> mult itself is driving differences:
>>>> [info] MatrixBenchmarks.buildMC500 thrpt 10 155.026
>>>> ± 2.456 ops/s
>>>> [info] MatrixBenchmarks.buildSB500 thrpt 10 197.257
>>>> ± 4.619 ops/s
>>>> [info] MatrixBenchmarks.buildOJ500 thrpt 10 176.036
>>>> ± 2.121 ops/s
>>>>
>>>> Test 2b: Creating a single 1000x1000 matrix (to show it scales w/N):
>>>> [info] MatrixBenchmarks.buildMC1000 thrpt 10 36.112 ± 2.775
>>>> ops/s
>>>> [info] MatrixBenchmarks.buildSB1000 thrpt 10 45.557 ± 2.893
>>>> ops/s
>>>> [info] MatrixBenchmarks.buildOJ1000 thrpt 10 47.438 ± 1.370
>>>> ops/s
>>>> [info] MatrixBenchmarks.buildBMC1000 thrpt 10 37.779 ± 0.871
>>>> ops/s
>>>>
>>>> ---
>>>>
>>>> Test 3a: Inverting a random 100x100 matrix:
>>>> [info] MatrixBenchmarks.invMC100 thrpt 10 1033.011 ± 9.928 ops/s
>>>> [info] MatrixBenchmarks.invSB100 thrpt 10 1664.833 ± 15.170 ops/s
>>>> [info] MatrixBenchmarks.invOJ100 thrpt 10 1174.044 ± 52.083 ops/s
>>>> [info] MatrixBenchmarks.invBMC100 thrpt 10 1043.858 ± 22.144
>>>> ops/s
>>>>
>>>> Test 3b: Inverting a random 500x500 matrix:
>>>> [info] MatrixBenchmarks.invMC500 thrpt 10 9.089 ±
>>>> 0.284 ops/s
>>>> [info] MatrixBenchmarks.invSB500 thrpt 10 43.857 ±
>>>> 1.083 ops/s
>>>> [info] MatrixBenchmarks.invOJ500 thrpt 10 23.444 ±
>>>> 1.484 ops/s
>>>> [info] MatrixBenchmarks.invBMC500 thrpt 10 9.156 ± 0.052
>>>> ops/s
>>>> [info] MatrixBenchmarks.invCacheBMC500 thrpt 10 9.627 ± 0.084
>>>> ops/s
>>>> [no matrix creation]
>>>>
>>>> Test 3c:
>>>> [info] MatrixBenchmarks.invMC1000 thrpt 10 0.704 ± 0.040 ops/s
>>>> [info] MatrixBenchmarks.invSB1000 thrpt 10 8.611 ±
>>>> 0.557
>>>> ops/s
>>>> [info] MatrixBenchmarks.invOJ1000 thrpt 10 3.539 ± 0.229 ops/s
>>>> [info] MatrixBenchmarks.invBMC1000 thrpt 10 0.691 ± 0.095
>>>> ops/s
>>>>
>>>> Also, isn't matrix multiplication at least O(N^2.37), rather than
>>>> O(N^2)?
>>>>
>>>> On 6 April 2016 at 12:55, Chris Lucas <[email protected]> wrote:
>>>>
>>>> Thanks for the quick reply!
>>>>
>>>>>
>>>>> I've pasted a small self-contained example at the bottom. It creates
>>>>> the
>>>>> matrices in advance, but nothing meaningful changes if they're created
>>>>> on a
>>>>> per-operation basis.
>>>>>
>>>>> Results for 50 multiplications of [size]x[size] matrices:
>>>>> Size: 10, total time: 0.012 seconds, time per: 0.000 seconds
>>>>> Size: 100, total time: 0.062 seconds, time per: 0.001 seconds
>>>>> Size: 300, total time: 3.050 seconds, time per: 0.061 seconds
>>>>> Size: 500, total time: 15.186 seconds, time per: 0.304 seconds
>>>>> Size: 600, total time: 17.532 seconds, time per: 0.351 seconds
>>>>>
>>>>> For comparison:
>>>>>
>>>>> Results for 50 additions of [size]x[size] matrices (which should be
>>>>> faster, be the extent of the difference is nonetheless striking to me):
>>>>> Size: 10, total time: 0.011 seconds, time per: 0.000 seconds
>>>>> Size: 100, total time: 0.012 seconds, time per: 0.000 seconds
>>>>> Size: 300, total time: 0.020 seconds, time per: 0.000 seconds
>>>>> Size: 500, total time: 0.032 seconds, time per: 0.001 seconds
>>>>> Size: 600, total time: 0.050 seconds, time per: 0.001 seconds
>>>>>
>>>>> Results for 50 inversions of a [size]x[size] matrix, which I'd expect
>>>>> to
>>>>> be slower than multiplication for larger matrices:
>>>>> Size: 10, total time: 0.014 seconds, time per: 0.000 seconds
>>>>> Size: 100, total time: 0.074 seconds, time per: 0.001 seconds
>>>>> Size: 300, total time: 1.005 seconds, time per: 0.020 seconds
>>>>> Size: 500, total time: 5.024 seconds, time per: 0.100 seconds
>>>>> Size: 600, total time: 9.297 seconds, time per: 0.186 seconds
>>>>>
>>>>> I hope this is useful, and if I'm doing something wrong that's leading
>>>>> to
>>>>> this performance gap, I'd love to know.
>>>>>
>>>>> ----
>>>>> import org.apache.commons.math3.linear.LUDecomposition;
>>>>> import org.apache.commons.math3.linear.Array2DRowRealMatrix;
>>>>> import org.apache.commons.math3.linear.RealMatrix;
>>>>>
>>>>>
>>>>> public class AMCMatrices {
>>>>>
>>>>> public static void main(String[] args) {
>>>>> miniTest(0);
>>>>> }
>>>>>
>>>>> public static void miniTest(int tType) {
>>>>> int samples = 50;
>>>>>
>>>>> int sizes[] = { 10, 100, 300, 500, 600 };
>>>>>
>>>>> for (int sI = 0; sI < sizes.length; sI++) {
>>>>> int mSize = sizes[sI];
>>>>>
>>>>> org.apache.commons.math3.linear.RealMatrix m0 = buildM(mSize,
>>>>> mSize);
>>>>> RealMatrix m1 = buildM(mSize, mSize);
>>>>>
>>>>> long start = System.nanoTime();
>>>>> for (int n = 0; n < samples; n++) {
>>>>> switch (tType) {
>>>>> case 0:
>>>>> m0.multiply(m1);
>>>>> break;
>>>>> case 1:
>>>>> m0.add(m1);
>>>>> break;
>>>>> case 2:
>>>>> new LUDecomposition(m0).getSolver().getInverse();
>>>>> break;
>>>>> }
>>>>>
>>>>> }
>>>>> long end = System.nanoTime();
>>>>>
>>>>> double dt = ((double) (end - start)) / 1E9;
>>>>> System.out.println(String.format(
>>>>> "Size: %d, total time: %3.3f seconds, time per: %3.3f
>>>>> seconds",
>>>>> mSize, dt, dt / samples));
>>>>> }
>>>>> }
>>>>>
>>>>> public static Array2DRowRealMatrix buildM(int r, int c) {
>>>>> double[][] matVals = new double[r][c];
>>>>> for (int i = 0; i < r; i++) {
>>>>> for (int j = 0; j < c; j++) {
>>>>> matVals[i][j] = Math.random();
>>>>> }
>>>>> }
>>>>> return new Array2DRowRealMatrix(matVals);
>>>>> }
>>>>> }
>>>>>
>>>>> ----
>>>>>
>>>>> On 5 April 2016 at 19:36, Gilles <[email protected]> wrote:
>>>>>
>>>>> Hi.
>>>>>
>>>>>>
>>>>>> On Tue, 5 Apr 2016 15:43:04 +0100, Chris Lucas wrote:
>>>>>>
>>>>>> I recently ran a benchmark comparing the performance math commons
>>>>>>
>>>>>>> 3.6.1's
>>>>>>> linear algebra library to the that of scala Breeze (
>>>>>>> https://github.com/scalanlp/breeze).
>>>>>>>
>>>>>>> I looked at det, inverse, Cholesky factorization, addition, and
>>>>>>> multiplication, including matrices with 10, 100, 500, and 1000
>>>>>>> elements,
>>>>>>> with symmetric, non-symmetric, and non-square cases where applicable.
>>>>>>>
>>>>>>>
>>>>>>> It would be interesting to add this to the CM documentation:
>>>>>> https://issues.apache.org/jira/browse/MATH-1194
>>>>>>
>>>>>> In general, I was pleasantly surprised: math commons performed about
>>>>>> as
>>>>>>
>>>>>> well as Breeze, despite the latter relying on native libraries. There
>>>>>>> was
>>>>>>> one exception, however:
>>>>>>>
>>>>>>> m0.multiply(m1)
>>>>>>>
>>>>>>> where m0 and m1 are both Array2DRowRealMatrix instances. It scaled
>>>>>>> very
>>>>>>> poorly in math commons, being much slower than nominally more
>>>>>>> expensive
>>>>>>> operations like inv and the Breeze implementation. Does anyone have a
>>>>>>> thought as to what's going on?
>>>>>>>
>>>>>>>
>>>>>>> Could your provide more information such as a plot of performance vs
>>>>>> size?
>>>>>> A self-contained and minimal code to run would be nice too.
>>>>>> See the CM micro-benchmarking tool here:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> https://github.com/apache/commons-math/blob/master/src/test/java/org/apache/commons/math4/PerfTestUtils.java
>>>>>> And an example of how to use it:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> https://github.com/apache/commons-math/blob/master/src/userguide/java/org/apache/commons/math4/userguide/FastMathTestPerformance.java
>>>>>>
>>>>>> In case it's useful, one representative test
>>>>>>
>>>>>> involves multiplying two instances of
>>>>>>>
>>>>>>> new Array2DRowRealMatrix(matVals)
>>>>>>>
>>>>>>> where matVals is 1000x1000 entries of math.random and the second
>>>>>>> instance
>>>>>>> is created as part of the loop. This part of the benchmark is not
>>>>>>> specific
>>>>>>> to the expensive multiplication step, and takes very little time
>>>>>>> relative
>>>>>>> to the multiplication itself. I'm using System.nanotime for the
>>>>>>> timing,
>>>>>>> and
>>>>>>> take the average time over several consecutive iterations, on a 3.5
>>>>>>> GHz
>>>>>>> Intel Core i7, Oracle JRE (build 1.8.0_05-b13).
>>>>>>>
>>>>>>>
>>>>>>> You might want to confirm the behaviour using JMH (becoming the Java
>>>>>> standard
>>>>>> for benchmarking):
>>>>>> http://openjdk.java.net/projects/code-tools/jmh/
>>>>>>
>>>>>>
>>>>>> Best regards,
>>>>>> Gilles
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>>>
>>>>>>> Chris
>>>>>>>
>>>>>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>