Thanks Matthias for your prompt response. Rana
On Thu, Jun 18, 2020 at 9:00 PM Rana Alotaibi <ralot...@eng.ucsd.edu> wrote: > Hi SystemML developers, > > I have been experimenting with SystemML lately. I found some interesting > numbers when evaluating these two expressions: *t(M%*%N)* and *t(N)%*% > t(M)* (The size of matrix M is 20Kx100 and N is 100x20K). The pipeline > *t(N)%*% > t(M) *drastically outperforms* t(M%*%N)*. > > - *t(M%*%N)* > - The multiplication took ~2.53 secs > - The transpose took ~ 1.41 (The size of the intermediate *M%*%N *is > 20Kx20K) > - *t(N)%*% t(M) * > - The multiplication took ~0.002 secs (given that the size of t(N) is > 20Kx100 and t(M) is 100x20K ) > - The total transpose time for both matrices is 0.023 secs > > So, I'm interested to know why the matrix multiplication in this pipeline > *t(N)%*% > t(M)* is faster than *t(M%*%N)*? and in general why *t(N)%*% t(M)* is > faster than *t(M%*%N)*? > > *Experiments Configurations**: * > > - I'm using systemml-1.2.0 (Binary tgz) and working on a single node. > - CPU E5-2640 v4 @ 2.40GHz, 40Cores > - RAM : 123GB > > *t(N)%*%t(M) DML Script* > | time = externalFunction(Integer i) return (Double B) > | implemented in > (classname="org.apache.sysml.udf.lib.TimeWrapper", exectype="mem"); > | times = matrix(0.0, rows = 5, cols = 1) > | M = rand(rows = 20000, cols = 100, pdf = 'uniform') > | N = rand(rows = 100, cols = 20000, pdf = 'uniform') > | for (ix in 1:5) { > | start = time(1) > | print("StartTime: "+start) > | res = t(N)%*%t(M) > | print(as.scalar(res[1,1])) > | stop = time(1) > | print("EndTime:- "+stop) > | pipTime= (stop - start) / 1000 > | print ("Time:"+ pipTime) > | times[ix,1] = pipTime > | } > | times = t(times) > > *t(M%*%N) DML Script* > | time = externalFunction(Integer i) return (Double B) > | implemented in > (classname="org.apache.sysml.udf.lib.TimeWrapper", exectype="mem"); > | times = matrix(0.0, rows = 5, cols = 1) > | M = rand(rows = 20000, cols = 100, pdf = 'uniform') > | N = rand(rows = 100, cols = 20000, pdf = 'uniform') > | for (ix in 1:5) { > | start = time(1) > | print("StartTime: "+start) > | res = t(M%*%N) > | print(as.scalar(res[1,1])) > | end = time(1) > | print("EndTime: "+end) > | pipTime = (end - start)/1000 > | print ("Time:"+ pipTime) > | times[ix,1] = pipTime > | } > | times = t(times) > > *t(N)%*%t(M) RUN TIME EXPLAIN * > # Memory Budget local/remote = 63716MB/140MB/140MB > # Degree of Parallelism (vcores) local/remote = 40/1/1 > PROGRAM ( size CP/MR = 58/0 ) > --FUNCTIONS > ----FUNCTION CALL GRAPH > ------MAIN PROGRAM > --------.defaultNS::time > ----EXTERNAL FUNCTION .defaultNS::time > --MAIN PROGRAM > ----GENERIC (lines 4-6) [recompile=false] > ------CP createvar _mVar0 scratch_space//_p32870_192.168.250.59//_t0/temp0 > true MATRIX binaryblock 5 1 1000 1000 0 copy > ------CP rand 5.SCALAR.INT.true 1.SCALAR.INT.true 1000 1000 0.0 0.0 1.0 -1 > uniform 1.0 40 _mVar0.MATRIX.DOUBLE > ------CP createvar _mVar1 scratch_space//_p32870_192.168.250.59//_t0/temp1 > true MATRIX binaryblock 20000 100 1000 1000 2000000 copy > ------CP rand 20000.SCALAR.INT.true 100.SCALAR.INT.true 1000 1000 0.0 1.0 > 1.0 -1 uniform 1.0 40 _mVar1.MATRIX.DOUBLE > ------CP createvar _mVar2 scratch_space//_p32870_192.168.250.59//_t0/temp2 > true MATRIX binaryblock 100 20000 1000 1000 2000000 copy > ------CP rand 100.SCALAR.INT.true 20000.SCALAR.INT.true 1000 1000 0.0 1.0 > 1.0 -1 uniform 1.0 40 _mVar2.MATRIX.DOUBLE > ------CP cpvar _mVar0 times > ------CP cpvar _mVar1 M > ------CP cpvar _mVar2 N > ------CP rmvar _mVar0 _mVar1 _mVar2 > ----FOR (lines 0-0) [in-place=[times]] > ------CP assignvar 1.SCALAR.INT.true __pred.SCALAR.INT > ------CP assignvar 5.SCALAR.INT.true __pred.SCALAR.INT > ------CP assignvar 1.SCALAR.INT.true __pred.SCALAR.INT > ------GENERIC (lines 8-8) [recompile=false] > --------CP extfunct .defaultNS time 1 1 i=1.SCALAR.INT.true start > ------GENERIC (lines 9-12) [recompile=false] > --------CP + StartTime: .SCALAR.STRING.true start.SCALAR.DOUBLE.false > _Var3.SCALAR.STRING > --------CP createvar _mVar4 > scratch_space//_p32870_192.168.250.59//_t0/temp3 true MATRIX binarycell > 20000 100 -1 -1 2000000 copy > --------CP r' N.MATRIX.DOUBLE _mVar4.MATRIX.DOUBLE 40 > --------CP createvar _mVar5 > scratch_space//_p32870_192.168.250.59//_t0/temp4 true MATRIX binarycell 1 > 100 -1 -1 -1 copy > --------CP rightIndex _mVar4.MATRIX.DOUBLE 1.SCALAR.INT.true > 1.SCALAR.INT.true 1.SCALAR.INT.true 100.SCALAR.INT.true _mVar5.MATRIX.DOUBLE > --------CP rmvar _mVar4 > --------CP createvar _mVar6 > scratch_space//_p32870_192.168.250.59//_t0/temp5 true MATRIX binarycell 100 > 20000 -1 -1 2000000 copy > --------CP r' M.MATRIX.DOUBLE _mVar6.MATRIX.DOUBLE 40 > --------CP createvar _mVar7 > scratch_space//_p32870_192.168.250.59//_t0/temp6 true MATRIX binarycell 100 > 1 -1 -1 -1 copy > --------CP rightIndex _mVar6.MATRIX.DOUBLE 1.SCALAR.INT.true > 100.SCALAR.INT.true 1.SCALAR.INT.true 1.SCALAR.INT.true _mVar7.MATRIX.DOUBLE > --------CP rmvar _mVar6 > --------CP createvar _mVar8 > scratch_space//_p32870_192.168.250.59//_t0/temp7 true MATRIX binarycell 1 1 > -1 -1 -1 copy > --------CP ba+* _mVar5.MATRIX.DOUBLE _mVar7.MATRIX.DOUBLE > _mVar8.MATRIX.DOUBLE 40 > --------CP rmvar _mVar5 _mVar7 > --------CP createvar _mVar9 > scratch_space//_p32870_192.168.250.59//_t0/temp8 true MATRIX binarycell 1 1 > -1 -1 -1 copy > --------CP rightIndex _mVar8.MATRIX.DOUBLE 1.SCALAR.INT.true > 1.SCALAR.INT.true 1.SCALAR.INT.true 1.SCALAR.INT.true _mVar9.MATRIX.DOUBLE > --------CP rmvar _mVar8 > --------CP castdts _mVar9.MATRIX.DOUBLE.false _Var10.SCALAR.STRING > --------CP rmvar _mVar9 > --------CP print _Var3.SCALAR.STRING.false _Var11.SCALAR.STRING > --------CP rmvar _Var3 > --------CP print _Var10.SCALAR.STRING.false _Var12.SCALAR.STRING > --------CP rmvar _Var10 > --------CP extfunct .defaultNS time 1 1 i=1.SCALAR.INT.true stop > --------CP rmvar _Var11 _Var12 > ------GENERIC (lines 13-16) [recompile=false] > --------CP + EndTime:- .SCALAR.STRING.true stop.SCALAR.DOUBLE.false > _Var13.SCALAR.STRING > --------CP - stop.SCALAR.DOUBLE.false start.SCALAR.DOUBLE.false > _Var14.SCALAR.DOUBLE > --------CP / _Var14.SCALAR.DOUBLE.false 1000.SCALAR.INT.true > _Var15.SCALAR.DOUBLE > --------CP rmvar _Var14 > --------CP + Time:.SCALAR.STRING.true _Var15.SCALAR.DOUBLE.false > _Var16.SCALAR.STRING > --------CP createvar _mVar17 > scratch_space//_p32870_192.168.250.59//_t0/temp9 true MATRIX binarycell 5 1 > -1 -1 -1 copy > --------CP leftIndex times.MATRIX.DOUBLE _Var15.SCALAR.DOUBLE.false > ix.SCALAR.INT.false ix.SCALAR.INT.false 1.SCALAR.INT.true 1.SCALAR.INT.true > _mVar17.MATRIX.DOUBLE > --------CP rmvar _Var15 > --------CP print _Var13.SCALAR.STRING.false _Var18.SCALAR.STRING > --------CP rmvar _Var13 > --------CP print _Var16.SCALAR.STRING.false _Var19.SCALAR.STRING > --------CP rmvar _Var16 times _Var18 _Var19 > --------CP cpvar _mVar17 times > --------CP rmvar _mVar17 start stop > ----GENERIC (lines 18-19) [recompile=false] > ------CP createvar _mVar20 > scratch_space//_p32870_192.168.250.59//_t0/temp10 true MATRIX binarycell 1 > 5 -1 -1 -1 copy > ------CP r' times.MATRIX.DOUBLE _mVar20.MATRIX.DOUBLE 40 > ------CP rmvar times > ------CP cpvar _mVar20 times > ------CP rmvar _mVar20 times > > SystemML Statistics: > Total elapsed time: 1.178 sec. > Total compilation time: 0.591 sec. > Total execution time: 0.587 sec. > Number of compiled MR Jobs: 0. > Number of executed MR Jobs: 0. > Cache hits (Mem, WB, FS, HDFS): 47/0/0/0. > Cache writes (WB, FS, HDFS): 12/0/0. > Cache times (ACQr/m, RLS, EXP): 0.001/0.001/0.002/0.000 sec. > HOP DAGs recompiled (PRED, SB): 0/0. > HOP DAGs recompile time: 0.000 sec. > Total JIT compile time: 29.332 sec. > Total JVM GC count: 8. > Total JVM GC time: 0.38 sec. > Heavy hitter instructions: > # Instruction Time(s) Count > 1 rand 0.307 3 > 2 r' 0.256 11 > 3 ba+* 0.010 5 > 4 time 0.004 10 > 5 rmvar 0.003 67 > 6 rightIndex 0.002 15 > 7 print 0.001 20 > 8 createvar 0.001 39 > 9 leftIndex 0.000 5 > 10 + 0.000 15 > > *t(M%*%N) RUN TIME EXPLAIN * > # Memory Budget local/remote = 63716MB/140MB/140MB > # Degree of Parallelism (vcores) local/remote = 40/1/1 > PROGRAM ( size CP/MR = 50/0 ) > --FUNCTIONS > ----FUNCTION CALL GRAPH > ------MAIN PROGRAM > --------.defaultNS::time > ----EXTERNAL FUNCTION .defaultNS::time > --MAIN PROGRAM > ----GENERIC (lines 4-6) [recompile=false] > ------CP createvar _mVar0 scratch_space//_p32616_192.168.250.59//_t0/temp0 > true MATRIX binaryblock 5 1 1000 1000 0 copy > ------CP rand 5.SCALAR.INT.true 1.SCALAR.INT.true 1000 1000 0.0 0.0 1.0 -1 > uniform 1.0 40 _mVar0.MATRIX.DOUBLE > ------CP createvar _mVar1 scratch_space//_p32616_192.168.250.59//_t0/temp1 > true MATRIX binaryblock 20000 100 1000 1000 2000000 copy > ------CP rand 20000.SCALAR.INT.true 100.SCALAR.INT.true 1000 1000 0.0 1.0 > 1.0 -1 uniform 1.0 40 _mVar1.MATRIX.DOUBLE > ------CP createvar _mVar2 scratch_space//_p32616_192.168.250.59//_t0/temp2 > true MATRIX binaryblock 100 20000 1000 1000 2000000 copy > ------CP rand 100.SCALAR.INT.true 20000.SCALAR.INT.true 1000 1000 0.0 1.0 > 1.0 -1 uniform 1.0 40 _mVar2.MATRIX.DOUBLE > ------CP cpvar _mVar0 times > ------CP cpvar _mVar1 M > ------CP cpvar _mVar2 N > ------CP rmvar _mVar0 _mVar1 _mVar2 > ----FOR (lines 0-0) [in-place=[times]] > ------CP assignvar 1.SCALAR.INT.true __pred.SCALAR.INT > ------CP assignvar 5.SCALAR.INT.true __pred.SCALAR.INT > ------CP assignvar 1.SCALAR.INT.true __pred.SCALAR.INT > ------GENERIC (lines 8-8) [recompile=false] > --------CP extfunct .defaultNS time 1 1 i=1.SCALAR.INT.true start > ------GENERIC (lines 9-12) [recompile=false] > --------CP + StartTime: .SCALAR.STRING.true start.SCALAR.DOUBLE.false > _Var3.SCALAR.STRING > --------CP createvar _mVar4 > scratch_space//_p32616_192.168.250.59//_t0/temp3 true MATRIX binarycell > 20000 20000 -1 -1 -1 copy > --------CP ba+* M.MATRIX.DOUBLE N.MATRIX.DOUBLE _mVar4.MATRIX.DOUBLE 40 > --------CP createvar _mVar5 > scratch_space//_p32616_192.168.250.59//_t0/temp4 true MATRIX binarycell > 20000 20000 -1 -1 -1 copy > --------CP r' _mVar4.MATRIX.DOUBLE _mVar5.MATRIX.DOUBLE 40 > --------CP rmvar _mVar4 > --------CP createvar _mVar6 > scratch_space//_p32616_192.168.250.59//_t0/temp5 true MATRIX binarycell 1 1 > -1 -1 -1 copy > --------CP rightIndex _mVar5.MATRIX.DOUBLE 1.SCALAR.INT.true > 1.SCALAR.INT.true 1.SCALAR.INT.true 1.SCALAR.INT.true _mVar6.MATRIX.DOUBLE > --------CP rmvar _mVar5 > --------CP castdts _mVar6.MATRIX.DOUBLE.false _Var7.SCALAR.STRING > --------CP rmvar _mVar6 > --------CP print _Var3.SCALAR.STRING.false _Var8.SCALAR.STRING > --------CP rmvar _Var3 > --------CP print _Var7.SCALAR.STRING.false _Var9.SCALAR.STRING > --------CP rmvar _Var7 > --------CP extfunct .defaultNS time 1 1 i=1.SCALAR.INT.true end > --------CP rmvar _Var8 _Var9 > ------GENERIC (lines 13-16) [recompile=false] > --------CP + EndTime: .SCALAR.STRING.true end.SCALAR.DOUBLE.false > _Var10.SCALAR.STRING > --------CP - end.SCALAR.DOUBLE.false start.SCALAR.DOUBLE.false > _Var11.SCALAR.DOUBLE > --------CP / _Var11.SCALAR.DOUBLE.false 1000.SCALAR.INT.true > _Var12.SCALAR.DOUBLE > --------CP rmvar _Var11 > --------CP + Time:.SCALAR.STRING.true _Var12.SCALAR.DOUBLE.false > _Var13.SCALAR.STRING > --------CP createvar _mVar14 > scratch_space//_p32616_192.168.250.59//_t0/temp6 true MATRIX binarycell 5 1 > -1 -1 -1 copy > --------CP leftIndex times.MATRIX.DOUBLE _Var12.SCALAR.DOUBLE.false > ix.SCALAR.INT.false ix.SCALAR.INT.false 1.SCALAR.INT.true 1.SCALAR.INT.true > _mVar14.MATRIX.DOUBLE > --------CP rmvar _Var12 > --------CP print _Var10.SCALAR.STRING.false _Var15.SCALAR.STRING > --------CP rmvar _Var10 > --------CP print _Var13.SCALAR.STRING.false _Var16.SCALAR.STRING > --------CP rmvar _Var13 times _Var15 _Var16 > --------CP cpvar _mVar14 times > --------CP rmvar _mVar14 start end > ----GENERIC (lines 18-19) [recompile=false] > ------CP createvar _mVar17 > scratch_space//_p32616_192.168.250.59//_t0/temp7 true MATRIX binarycell 1 5 > -1 -1 -1 copy > ------CP r' times.MATRIX.DOUBLE _mVar17.MATRIX.DOUBLE 40 > ------CP rmvar times > ------CP cpvar _mVar17 times > ------CP rmvar _mVar17 times > > SystemML Statistics: > Total elapsed time: 22.074 sec. > Total compilation time: 0.567 sec. > Total execution time: 21.507 sec. > Number of compiled MR Jobs: 0. > Number of executed MR Jobs: 0. > Cache hits (Mem, WB, FS, HDFS): 32/0/0/0. > Cache writes (WB, FS, HDFS): 12/0/0. > Cache times (ACQr/m, RLS, EXP): 0.000/0.001/0.003/0.000 sec. > HOP DAGs recompiled (PRED, SB): 0/0. > HOP DAGs recompile time: 0.000 sec. > Total JIT compile time: 33.155 sec. > Total JVM GC count: 7. > Total JVM GC time: 0.279 sec. > Heavy hitter instructions: > # Instruction Time(s) Count > 1 ba+* 12.661 5 > 2 r' 8.496 6 > 3 rand 0.334 3 > 4 time 0.006 10 > 5 rmvar 0.003 57 > 6 rightIndex 0.002 5 > 7 print 0.001 20 > 8 createvar 0.001 24 > 9 + 0.000 15 > 10 leftIndex 0.000 5 > > > > > Thanks, > Rana >