On Thu, 7 Apr 2016 19:25:43 +0100, Chris Lucas wrote:
On 7 April 2016 at 18:41, Gilles <gil...@harfang.homelinux.org> wrote:It's certainly worth ascertaining with more data points (to make a niceplot).I can run mult on both for {10, 100, 500, 1000, 1500, 2000, 2500, 5000} ifthat would help.
That would help if you are willing to also format the results and enhance
the CM userguide. ;-)
Naturally, better asymptoticperformance would be nice,What do you mean?When I wrote that I was thinking it might make sense to use Strassen's algorithm (or something else with better asymptotic performance; this isn't my area of expertise) for large square matrices. After looking a bit at the literature it seems unlikely to be worth it, given numerical stabilityconcerns etc.
Thanks for checking!It would also help if this information is retained in the doc so that the
same questioning does not pop every couple of years.
but I recognize that the AMC devs likely havebetter things to do.Observation 2: When multiplying new matrices, BlockMatrix offers someimprovement when multiplying the same matrices repeatedlyWhy would someone do that?Good question! Maybe if matrices are being selected/changed at runtime andsomeone can't be bothered to check/cache?
Having immutable implementation(s) could perhaps help in some cases (?).
None of the regular CM contributors have had use-cases for enhancing algorithms/data structures that would deal with large datasets. [Up to contemplating dropping the sparse matrix implementation, due to identified shortcomings (cf. JIRA/ML archive).] Perhaps, this wold also require a lot work to reinvent a wheel that is out of scope for CM (?).
Observation 3: OJAlgo doesn't scale better asymptotically from what I cantell. That's no great surprise based on https://github.com/optimatika/ojAlgo/wiki/Multiply. <https://github.com/optimatika/ojAlgo/wiki/Multiply> Scala Breeze w/native libs scales closer to O(N^2.4).Details on how you arrive to those numbers would really be a useful addition to the CM documentation.I can share code snippets. I'd guess that breeze actually scales w/N^3 in asingle thread and that the faster-seeming scaling is related to parallelization, e.g., adding threads for larger matrices.
If performance can be improved through parallelization, it would certainly
be useful to know the expected gain, and open a "Wish" request in JIRA. There are several features in CM where parallelism is the obvious next thing to do (e.g. FFT), either internally, or by providing interface (and thread-safe instances) to be used by external framework. Unfortunately, I must state that (almost) nobody seems interested in enhancing CM!
Observation 4: Breeze with native libraries does better than one mightguess based on some previous benchmarks. People still might want to use math commons, not least because those native libs might be encumbered withundesirable licenses.Do they use multi-threading?Yes, they do. I suppose that's important to note given that there are some applications where running entire separate pipelines in parallel will bemore efficient than having parallel matrix ops.
There are many issues with CM's "linear" package; see e.g. https://issues.apache.org/jira/browse/MATH-765A clean-up should certainly consider parallelism as a requirement for a new
design. The task is obviously daunting given that the MATH-765 issue has been open for more than 4 years now... :-/ Regards, Gilles
[^1]: https://git-wip-us.apache.org/repos/asf?p=commons-math.git;a=blob;f=src/main/java/org/apache/commons/math4/linear/Array2DRowRealMatrix.java;h=3778db56ba406beb973e1355234593bc006adb59;hb=HEADOn 7 April 2016 at 16:50, Gilles <gil...@harfang.homelinux.org> wrote:On Thu, 7 Apr 2016 16:17:52 +0100, Chris Lucas wrote:Thanks for suggesting the BlockRealMatrix. I've run a few benchmarksThe email is not really suited for tables (see your message below).comparing the two, along with some other libraries.What are the points which you want to make?As said before, "reports" on CM features (a.o. benchmarks) are welcomebutshould be formatted for inclusion in the userguide (for now, until wemightcreate a separate document for data that is subject to change more orless rapidly).If you suspect a problem with the code, then I suggest that you file aJIRAreport, where files can be attached (or tables can be viewed withoutbeing deformed thanks to the "{noformat}" tag). Regards, GillesI'm using jmh 1.11.3, JDK 1.8.0_05, VM 25.5-b02, with 2 warmups and 10testiterations. The suffixes denote what's being tested: MC: AMC 3.6.1 using Array2DRowRealMatrix SB: Scala Breeze 0.12 with native librariesOJ: OJAlgo 39.0. Some other benchmarks suggest OJAlgo is an especiallyspeedy pure-java library. BMC: MC using a BlockRealMatrix.I'm using the same matrix creation/multiplication/inverse code as I mentioned in my previous note. When testing BlockReadMatrices, I'm usingfixed random matrices and annotating my class with @State(Scope.Benchmark).For the curious, my rationale for building matrices on the spot is thatI'malready caching results pretty heavily and rarely perform the same operation on the same matrix twice -- I'm most curious about performancein the absence of caching benefits.Test 1a: Creating and multiplying two random/uniform (i.e., all entriesdrawn from math.random()) 100x100 matrices:[info] MatrixBenchmarks.buildMultTestMC100 thrpt 10 836.579± 7.120 ops/s[info] MatrixBenchmarks.buildMultTestSB100 thrpt 10 1649.089± 170.649 ops/s[info] MatrixBenchmarks.buildMultTestOJ100 thrpt 10 1485.014 ±44.158 ops/s[info] MatrixBenchmarks.multBMC100 thrpt 10 1051.055 ± 2.290ops/sTest 1b: Creating and multiplying two random/uniform 500x500 matrices: [info] MatrixBenchmarks.buildMultTestMC500 thrpt 10 8.997± 0.064 ops/s[info] MatrixBenchmarks.buildMultTestSB500 thrpt 10 80.558± 0.627 ops/s[info] MatrixBenchmarks.buildMultTestOJ500 thrpt 10 34.626± 2.505 ops/s[info] MatrixBenchmarks.multBMC500 thrpt 10 9.132 ± 0.059ops/s[info] MatrixBenchmarks.multCacheBMC500 thrpt 10 13.630 ± 0.107ops/s [no matrix creation] ---Test 2a: Creating a single 500x500 matrix (to get a sense of whether themult itself is driving differences:[info] MatrixBenchmarks.buildMC500 thrpt 10 155.026± 2.456 ops/s[info] MatrixBenchmarks.buildSB500 thrpt 10 197.257± 4.619 ops/s[info] MatrixBenchmarks.buildOJ500 thrpt 10 176.036± 2.121 ops/sTest 2b: Creating a single 1000x1000 matrix (to show it scales w/N): [info] MatrixBenchmarks.buildMC1000 thrpt 10 36.112 ± 2.775ops/s[info] MatrixBenchmarks.buildSB1000 thrpt 10 45.557 ± 2.893ops/s[info] MatrixBenchmarks.buildOJ1000 thrpt 10 47.438 ± 1.370ops/s[info] MatrixBenchmarks.buildBMC1000 thrpt 10 37.779 ± 0.871ops/s --- Test 3a: Inverting a random 100x100 matrix:[info] MatrixBenchmarks.invMC100 thrpt 10 1033.011 ± 9.928 ops/s [info] MatrixBenchmarks.invSB100 thrpt 10 1664.833 ± 15.170 ops/s [info] MatrixBenchmarks.invOJ100 thrpt 10 1174.044 ± 52.083 ops/s [info] MatrixBenchmarks.invBMC100 thrpt 10 1043.858 ± 22.144ops/s Test 3b: Inverting a random 500x500 matrix:[info] MatrixBenchmarks.invMC500 thrpt 10 9.089 ±0.284 ops/s[info] MatrixBenchmarks.invSB500 thrpt 10 43.857 ±1.083 ops/s[info] MatrixBenchmarks.invOJ500 thrpt 10 23.444 ±1.484 ops/s[info] MatrixBenchmarks.invBMC500 thrpt 10 9.156 ± 0.052ops/s[info] MatrixBenchmarks.invCacheBMC500 thrpt 10 9.627 ± 0.084ops/s [no matrix creation] Test 3c:[info] MatrixBenchmarks.invMC1000 thrpt 10 0.704 ± 0.040 ops/s [info] MatrixBenchmarks.invSB1000 thrpt 10 8.611 ±0.557 ops/s[info] MatrixBenchmarks.invOJ1000 thrpt 10 3.539 ± 0.229 ops/s [info] MatrixBenchmarks.invBMC1000 thrpt 10 0.691 ± 0.095ops/s Also, isn't matrix multiplication at least O(N^2.37), rather than O(N^2)?On 6 April 2016 at 12:55, Chris Lucas <cluc...@inf.ed.ac.uk> wrote:Thanks for the quick reply!I've pasted a small self-contained example at the bottom. It createsthematrices in advance, but nothing meaningful changes if they're createdon a per-operation basis. Results for 50 multiplications of [size]x[size] matrices: Size: 10, total time: 0.012 seconds, time per: 0.000 seconds Size: 100, total time: 0.062 seconds, time per: 0.001 seconds Size: 300, total time: 3.050 seconds, time per: 0.061 secondsSize: 500, total time: 15.186 seconds, time per: 0.304 seconds Size: 600, total time: 17.532 seconds, time per: 0.351 secondsFor comparison:Results for 50 additions of [size]x[size] matrices (which should be faster, be the extent of the difference is nonetheless striking to me):Size: 10, total time: 0.011 seconds, time per: 0.000 seconds Size: 100, total time: 0.012 seconds, time per: 0.000 seconds Size: 300, total time: 0.020 seconds, time per: 0.000 seconds Size: 500, total time: 0.032 seconds, time per: 0.001 seconds Size: 600, total time: 0.050 seconds, time per: 0.001 secondsResults for 50 inversions of a [size]x[size] matrix, which I'd expectto be slower than multiplication for larger matrices: Size: 10, total time: 0.014 seconds, time per: 0.000 seconds Size: 100, total time: 0.074 seconds, time per: 0.001 seconds Size: 300, total time: 1.005 seconds, time per: 0.020 seconds Size: 500, total time: 5.024 seconds, time per: 0.100 seconds Size: 600, total time: 9.297 seconds, time per: 0.186 secondsI hope this is useful, and if I'm doing something wrong that's leadingto this performance gap, I'd love to know. ---- import org.apache.commons.math3.linear.LUDecomposition; import org.apache.commons.math3.linear.Array2DRowRealMatrix; import org.apache.commons.math3.linear.RealMatrix; public class AMCMatrices { public static void main(String[] args) { miniTest(0); } public static void miniTest(int tType) { int samples = 50; int sizes[] = { 10, 100, 300, 500, 600 }; for (int sI = 0; sI < sizes.length; sI++) { int mSize = sizes[sI];org.apache.commons.math3.linear.RealMatrix m0 = buildM(mSize,mSize); RealMatrix m1 = buildM(mSize, mSize); long start = System.nanoTime(); for (int n = 0; n < samples; n++) { switch (tType) { case 0: m0.multiply(m1); break; case 1: m0.add(m1); break; case 2: new LUDecomposition(m0).getSolver().getInverse(); break; } } long end = System.nanoTime(); double dt = ((double) (end - start)) / 1E9; System.out.println(String.format( "Size: %d, total time: %3.3f seconds, time per: %3.3f seconds", mSize, dt, dt / samples)); } } public static Array2DRowRealMatrix buildM(int r, int c) { double[][] matVals = new double[r][c]; for (int i = 0; i < r; i++) { for (int j = 0; j < c; j++) { matVals[i][j] = Math.random(); } } return new Array2DRowRealMatrix(matVals); } } ----On 5 April 2016 at 19:36, Gilles <gil...@harfang.homelinux.org> wrote:Hi.On Tue, 5 Apr 2016 15:43:04 +0100, Chris Lucas wrote:I recently ran a benchmark comparing the performance math commons3.6.1's linear algebra library to the that of scala Breeze ( https://github.com/scalanlp/breeze).I looked at det, inverse, Cholesky factorization, addition, andmultiplication, including matrices with 10, 100, 500, and 1000 elements,with symmetric, non-symmetric, and non-square cases where applicable.It would be interesting to add this to the CM documentation:https://issues.apache.org/jira/browse/MATH-1194In general, I was pleasantly surprised: math commons performed aboutaswell as Breeze, despite the latter relying on native libraries. Therewas one exception, however: m0.multiply(m1)where m0 and m1 are both Array2DRowRealMatrix instances. It scaledvery poorly in math commons, being much slower than nominally more expensiveoperations like inv and the Breeze implementation. Does anyone have athought as to what's going on?Could your provide more information such as a plot of performance vssize? A self-contained and minimal code to run would be nice too. See the CM micro-benchmarking tool here: https://github.com/apache/commons-math/blob/master/src/test/java/org/apache/commons/math4/PerfTestUtils.java And an example of how to use it: https://github.com/apache/commons-math/blob/master/src/userguide/java/org/apache/commons/math4/userguide/FastMathTestPerformance.java In case it's useful, one representative test involves multiplying two instances ofnew Array2DRowRealMatrix(matVals)where matVals is 1000x1000 entries of math.random and the secondinstanceis created as part of the loop. This part of the benchmark is notspecificto the expensive multiplication step, and takes very little timerelativeto the multiplication itself. I'm using System.nanotime for thetiming, andtake the average time over several consecutive iterations, on a 3.5GHz Intel Core i7, Oracle JRE (build 1.8.0_05-b13).You might want to confirm the behaviour using JMH (becoming the Javastandard for benchmarking): http://openjdk.java.net/projects/code-tools/jmh/ Best regards, Gilles Thanks,Chris
--------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@commons.apache.org For additional commands, e-mail: user-h...@commons.apache.org