Re: Performance differences between SystemML LibMatrixMult and Breeze with native BLAS

Matthias Boehm Wed, 30 Nov 2016 16:01:51 -0800

ok, then let's sort this out one by one

1) Benchmarks: There are a couple of things we should be aware of forthese native/java benchmarks. First, please specify k as the number oflogical cores on your machine and use a sufficiently large heap withXms=Xmx and Xmn=0.1*Xmx. Second, exclude the initial warmup runs for JITcompilation or outliers where GC happened from these measurements.

2) Breeze Comparison: Please also get the breeze numbers without nativeBLAS libraries as another baseline with comparable runtime platform.

3) Bigger Picture: Just to clarify the overall question here - of coursenative BLAS libraries are expected to be faster for squared (or similar)dense matrix multiply, as current JDKs usually only compile scalar butno packed SIMD instructions for these operations. How much depends onthe architecture. On older architectures with 128bit and 256bit vectorunits, it was not too problematic. But the trend continues and hence itis worth thinking about it if nothing happens on the JDK front. Thereasons why we decided for platform independence in the past were asfollows:

(a) Squared dense matrix multiply is not a common operation (other thanin DL). Much more common are memory-bandwidth bound matrix-vectormultiplications and there it actually leads to a 3x slowdown copyingyour data out to a native library.(b) In end-to-end algorithms, especially on large-scale scenarios, weoften see other factors dominating performance.(c) Keeping the build and deployment simple without the dependency tonative libraries was the logical conclusion given (a) and (b).(d) There are also workarounds: A user can always (and we did this inthe past with certain LAPACK functions), define an external function andcall there whatever library she wants.



Regards,
Matthias

On 12/1/2016 12:27 AM, [email protected] wrote:

This is the printout from 50 iterations with timings decommented:

MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 465.897145
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 389.913848
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 426.539142
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 391.878792
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 349.830464
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 284.751495
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 337.790165
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 363.655144
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 334.348717
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 745.822571
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 1257.83537
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 313.253455
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 268.226473
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 252.079117
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 254.162898
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 257.962804
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 279.462628
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 240.553724
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 269.316559
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 245.755306
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 266.528604
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 240.022494
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 269.964251
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 246.011221
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 309.174575
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 254.311429
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 262.97415
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 256.096419
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 293.975642
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 262.577342
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 287.840992
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 293.495411
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 253.541925
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 293.485217
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 266.114958
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 260.231448
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 260.012622
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 267.912608
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 264.265422
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 276.937746
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 261.649393
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 245.334056
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 258.506884
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 243.960491
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 251.801208
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 271.235477
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 275.290229
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 251.290325
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 265.851277
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 240.902494

Am 01.12.2016 00:08 schrieb Matthias Boehm:

Could you please make sure you're comparing the right thing. Even on
old sandy bridge CPUs our matrix mult for 1kx1k usually takes 40-50ms.
We also did the same experiments with larger matrices and SystemML was
about 2x faster compared to Breeze. Please decomment the timings in
LibMatrixMult.matrixMult and double check the timing as well as that
we're actually comparing dense matrix multiply.

Regards,
Matthias

On 11/30/2016 11:54 PM, [email protected] wrote:

Hi all,

I have run a very quick comparison between SystemML's LibMatrixMult and
Breeze matrix multiplication using native BLAS (OpenBLAS through
netlib-java). As per my very small comparison I get the result that
there is a performance difference for dense-dense Matrices of size 1000
x 1000 (our default blocksize) with Breeze being about 5-6 times faster
here. The code I used can be found here:
https://github.com/fschueler/incubator-systemml/blob/model_types/src/test/scala/org/apache/sysml/api/linalg/layout/local/SystemMLLocalBackendTest.scala



Running this code with 50 iterations each gives me for example average
times of:
Breeze:         49.74 ms
SystemML:   363.44 ms

I don't want to say this is true for every operation, but those results
let us form the hypothesis that native BLAS operations can lead to a
significant speedup for certain operations which is worth testing with
more advanced benchmarks.

Btw: I am definitely not saying we should use Breeze here. I am more
looking at native BLAS and LAPACK implementations in general (as
provided by OpenBLAS, MKL, etc.).

Let me know what you think!
Felix

Re: Performance differences between SystemML LibMatrixMult and Breeze with native BLAS

Reply via email to