Re: Performance differences between SystemML LibMatrixMult and Breeze with native BLAS

fschueler Wed, 30 Nov 2016 15:28:20 -0800

This is the printout from 50 iterations with timings decommented:


MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 465.897145
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 389.913848
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 426.539142
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 391.878792
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 349.830464
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 284.751495
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 337.790165
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 363.655144
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 334.348717
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 745.822571
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 1257.83537
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 313.253455
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 268.226473
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 252.079117
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 254.162898
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 257.962804
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 279.462628
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 240.553724
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 269.316559
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 245.755306
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 266.528604
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 240.022494
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 269.964251
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 246.011221
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 309.174575
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 254.311429
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 262.97415
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 256.096419
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 293.975642
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 262.577342
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 287.840992
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 293.495411
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 253.541925
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 293.485217
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 266.114958
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 260.231448
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 260.012622
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 267.912608
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 264.265422
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 276.937746
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 261.649393
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 245.334056
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 258.506884
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 243.960491
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 251.801208
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 271.235477
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 275.290229
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 251.290325
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 265.851277
MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 240.902494

Am 01.12.2016 00:08 schrieb Matthias Boehm:

Could you please make sure you're comparing the right thing. Even on
old sandy bridge CPUs our matrix mult for 1kx1k usually takes 40-50ms.
We also did the same experiments with larger matrices and SystemML was
about 2x faster compared to Breeze. Please decomment the timings in
LibMatrixMult.matrixMult and double check the timing as well as that
we're actually comparing dense matrix multiply.

Regards,
Matthias

On 11/30/2016 11:54 PM, fschue...@posteo.de wrote:

Hi all,

I have run a very quick comparison between SystemML's LibMatrixMultand

Breeze matrix multiplication using native BLAS (OpenBLAS through
netlib-java). As per my very small comparison I get the result that

there is a performance difference for dense-dense Matrices of size1000x 1000 (our default blocksize) with Breeze being about 5-6 timesfaster

here. The code I used can be found here:
https://github.com/fschueler/incubator-systemml/blob/model_types/src/test/scala/org/apache/sysml/api/linalg/layout/local/SystemMLLocalBackendTest.scala


Running this code with 50 iterations each gives me for example average
times of:
Breeze:         49.74 ms
SystemML:   363.44 ms

I don't want to say this is true for every operation, but thoseresults

let us form the hypothesis that native BLAS operations can lead to a
significant speedup for certain operations which is worth testing with
more advanced benchmarks.

Btw: I am definitely not saying we should use Breeze here. I am more
looking at native BLAS and LAPACK implementations in general (as
provided by OpenBLAS, MKL, etc.).

Let me know what you think!
Felix

Re: Performance differences between SystemML LibMatrixMult and Breeze with native BLAS

Reply via email to