Thanks Matthias for your prompt response.

Rana

On Thu, Jun 18, 2020 at 9:00 PM Rana Alotaibi <ralot...@eng.ucsd.edu> wrote:

> Hi SystemML developers,
>
> I have been experimenting with SystemML lately. I found some interesting
> numbers when evaluating these two expressions: *t(M%*%N)* and *t(N)%*%
> t(M)* (The size of matrix M is 20Kx100 and N is 100x20K). The pipeline 
> *t(N)%*%
> t(M) *drastically outperforms* t(M%*%N)*.
>
>    - *t(M%*%N)*
>    - The multiplication took ~2.53 secs
>       - The transpose took ~ 1.41 (The size of the intermediate *M%*%N *is
>       20Kx20K)
>    - *t(N)%*% t(M) *
>    - The multiplication took ~0.002 secs (given that the size of t(N) is
>       20Kx100 and t(M) is 100x20K )
>       - The total transpose time for both matrices is 0.023 secs
>
> So, I'm interested to know why the matrix multiplication in this pipeline 
> *t(N)%*%
> t(M)* is faster than *t(M%*%N)*? and in general why *t(N)%*% t(M)* is
> faster than *t(M%*%N)*?
>
> *Experiments Configurations**: *
>
>    - I'm using  systemml-1.2.0 (Binary tgz) and working on a single node.
>    - CPU E5-2640 v4 @ 2.40GHz, 40Cores
>    - RAM : 123GB
>
>    *t(N)%*%t(M) DML Script*
>     | time = externalFunction(Integer i) return (Double B)
>     |       implemented in
> (classname="org.apache.sysml.udf.lib.TimeWrapper", exectype="mem");
>     |  times = matrix(0.0, rows = 5, cols = 1)
>     |  M = rand(rows = 20000, cols = 100, pdf = 'uniform')
>     |  N = rand(rows = 100, cols = 20000, pdf = 'uniform')
>     | for (ix in 1:5) {
>     |   start = time(1)
>     |   print("StartTime: "+start)
>     |   res = t(N)%*%t(M)
>     |   print(as.scalar(res[1,1]))
>     |   stop = time(1)
>     |   print("EndTime:- "+stop)
>     |   pipTime= (stop - start) / 1000
>     |   print ("Time:"+ pipTime)
>     |   times[ix,1] = pipTime
>     | }
>     | times = t(times)
>
>     *t(M%*%N) DML Script*
>     | time = externalFunction(Integer i) return (Double B)
>     |       implemented in
> (classname="org.apache.sysml.udf.lib.TimeWrapper", exectype="mem");
>     |  times = matrix(0.0, rows = 5, cols = 1)
>     |  M = rand(rows = 20000, cols = 100, pdf = 'uniform')
>     |  N = rand(rows = 100, cols = 20000, pdf = 'uniform')
>     | for (ix in 1:5) {
>     |   start = time(1)
>     |   print("StartTime: "+start)
>     |   res = t(M%*%N)
>     |   print(as.scalar(res[1,1]))
>     |   end = time(1)
>     |   print("EndTime: "+end)
>     |   pipTime = (end - start)/1000
>     |   print ("Time:"+ pipTime)
>     |   times[ix,1] = pipTime
>     | }
>     | times = t(times)
>
> *t(N)%*%t(M) RUN TIME EXPLAIN *
> # Memory Budget local/remote = 63716MB/140MB/140MB
> # Degree of Parallelism (vcores) local/remote = 40/1/1
> PROGRAM ( size CP/MR = 58/0 )
> --FUNCTIONS
> ----FUNCTION CALL GRAPH
> ------MAIN PROGRAM
> --------.defaultNS::time
> ----EXTERNAL FUNCTION .defaultNS::time
> --MAIN PROGRAM
> ----GENERIC (lines 4-6) [recompile=false]
> ------CP createvar _mVar0 scratch_space//_p32870_192.168.250.59//_t0/temp0
> true MATRIX binaryblock 5 1 1000 1000 0 copy
> ------CP rand 5.SCALAR.INT.true 1.SCALAR.INT.true 1000 1000 0.0 0.0 1.0 -1
> uniform 1.0 40 _mVar0.MATRIX.DOUBLE
> ------CP createvar _mVar1 scratch_space//_p32870_192.168.250.59//_t0/temp1
> true MATRIX binaryblock 20000 100 1000 1000 2000000 copy
> ------CP rand 20000.SCALAR.INT.true 100.SCALAR.INT.true 1000 1000 0.0 1.0
> 1.0 -1 uniform 1.0 40 _mVar1.MATRIX.DOUBLE
> ------CP createvar _mVar2 scratch_space//_p32870_192.168.250.59//_t0/temp2
> true MATRIX binaryblock 100 20000 1000 1000 2000000 copy
> ------CP rand 100.SCALAR.INT.true 20000.SCALAR.INT.true 1000 1000 0.0 1.0
> 1.0 -1 uniform 1.0 40 _mVar2.MATRIX.DOUBLE
> ------CP cpvar _mVar0 times
> ------CP cpvar _mVar1 M
> ------CP cpvar _mVar2 N
> ------CP rmvar _mVar0 _mVar1 _mVar2
> ----FOR (lines 0-0) [in-place=[times]]
> ------CP assignvar 1.SCALAR.INT.true __pred.SCALAR.INT
> ------CP assignvar 5.SCALAR.INT.true __pred.SCALAR.INT
> ------CP assignvar 1.SCALAR.INT.true __pred.SCALAR.INT
> ------GENERIC (lines 8-8) [recompile=false]
> --------CP extfunct .defaultNS time 1 1 i=1.SCALAR.INT.true start
> ------GENERIC (lines 9-12) [recompile=false]
> --------CP + StartTime: .SCALAR.STRING.true start.SCALAR.DOUBLE.false
> _Var3.SCALAR.STRING
> --------CP createvar _mVar4
> scratch_space//_p32870_192.168.250.59//_t0/temp3 true MATRIX binarycell
> 20000 100 -1 -1 2000000 copy
> --------CP r' N.MATRIX.DOUBLE _mVar4.MATRIX.DOUBLE 40
> --------CP createvar _mVar5
> scratch_space//_p32870_192.168.250.59//_t0/temp4 true MATRIX binarycell 1
> 100 -1 -1 -1 copy
> --------CP rightIndex _mVar4.MATRIX.DOUBLE 1.SCALAR.INT.true
> 1.SCALAR.INT.true 1.SCALAR.INT.true 100.SCALAR.INT.true _mVar5.MATRIX.DOUBLE
> --------CP rmvar _mVar4
> --------CP createvar _mVar6
> scratch_space//_p32870_192.168.250.59//_t0/temp5 true MATRIX binarycell 100
> 20000 -1 -1 2000000 copy
> --------CP r' M.MATRIX.DOUBLE _mVar6.MATRIX.DOUBLE 40
> --------CP createvar _mVar7
> scratch_space//_p32870_192.168.250.59//_t0/temp6 true MATRIX binarycell 100
> 1 -1 -1 -1 copy
> --------CP rightIndex _mVar6.MATRIX.DOUBLE 1.SCALAR.INT.true
> 100.SCALAR.INT.true 1.SCALAR.INT.true 1.SCALAR.INT.true _mVar7.MATRIX.DOUBLE
> --------CP rmvar _mVar6
> --------CP createvar _mVar8
> scratch_space//_p32870_192.168.250.59//_t0/temp7 true MATRIX binarycell 1 1
> -1 -1 -1 copy
> --------CP ba+* _mVar5.MATRIX.DOUBLE _mVar7.MATRIX.DOUBLE
> _mVar8.MATRIX.DOUBLE 40
> --------CP rmvar _mVar5 _mVar7
> --------CP createvar _mVar9
> scratch_space//_p32870_192.168.250.59//_t0/temp8 true MATRIX binarycell 1 1
> -1 -1 -1 copy
> --------CP rightIndex _mVar8.MATRIX.DOUBLE 1.SCALAR.INT.true
> 1.SCALAR.INT.true 1.SCALAR.INT.true 1.SCALAR.INT.true _mVar9.MATRIX.DOUBLE
> --------CP rmvar _mVar8
> --------CP castdts _mVar9.MATRIX.DOUBLE.false _Var10.SCALAR.STRING
> --------CP rmvar _mVar9
> --------CP print _Var3.SCALAR.STRING.false _Var11.SCALAR.STRING
> --------CP rmvar _Var3
> --------CP print _Var10.SCALAR.STRING.false _Var12.SCALAR.STRING
> --------CP rmvar _Var10
> --------CP extfunct .defaultNS time 1 1 i=1.SCALAR.INT.true stop
> --------CP rmvar _Var11 _Var12
> ------GENERIC (lines 13-16) [recompile=false]
> --------CP + EndTime:- .SCALAR.STRING.true stop.SCALAR.DOUBLE.false
> _Var13.SCALAR.STRING
> --------CP - stop.SCALAR.DOUBLE.false start.SCALAR.DOUBLE.false
> _Var14.SCALAR.DOUBLE
> --------CP / _Var14.SCALAR.DOUBLE.false 1000.SCALAR.INT.true
> _Var15.SCALAR.DOUBLE
> --------CP rmvar _Var14
> --------CP + Time:.SCALAR.STRING.true _Var15.SCALAR.DOUBLE.false
> _Var16.SCALAR.STRING
> --------CP createvar _mVar17
> scratch_space//_p32870_192.168.250.59//_t0/temp9 true MATRIX binarycell 5 1
> -1 -1 -1 copy
> --------CP leftIndex times.MATRIX.DOUBLE _Var15.SCALAR.DOUBLE.false
> ix.SCALAR.INT.false ix.SCALAR.INT.false 1.SCALAR.INT.true 1.SCALAR.INT.true
> _mVar17.MATRIX.DOUBLE
> --------CP rmvar _Var15
> --------CP print _Var13.SCALAR.STRING.false _Var18.SCALAR.STRING
> --------CP rmvar _Var13
> --------CP print _Var16.SCALAR.STRING.false _Var19.SCALAR.STRING
> --------CP rmvar _Var16 times _Var18 _Var19
> --------CP cpvar _mVar17 times
> --------CP rmvar _mVar17 start stop
> ----GENERIC (lines 18-19) [recompile=false]
> ------CP createvar _mVar20
> scratch_space//_p32870_192.168.250.59//_t0/temp10 true MATRIX binarycell 1
> 5 -1 -1 -1 copy
> ------CP r' times.MATRIX.DOUBLE _mVar20.MATRIX.DOUBLE 40
> ------CP rmvar times
> ------CP cpvar _mVar20 times
> ------CP rmvar _mVar20 times
>
> SystemML Statistics:
> Total elapsed time: 1.178 sec.
> Total compilation time: 0.591 sec.
> Total execution time: 0.587 sec.
> Number of compiled MR Jobs: 0.
> Number of executed MR Jobs: 0.
> Cache hits (Mem, WB, FS, HDFS): 47/0/0/0.
> Cache writes (WB, FS, HDFS): 12/0/0.
> Cache times (ACQr/m, RLS, EXP): 0.001/0.001/0.002/0.000 sec.
> HOP DAGs recompiled (PRED, SB): 0/0.
> HOP DAGs recompile time: 0.000 sec.
> Total JIT compile time: 29.332 sec.
> Total JVM GC count: 8.
> Total JVM GC time: 0.38 sec.
> Heavy hitter instructions:
>   #  Instruction  Time(s)  Count
>   1  rand           0.307      3
>   2  r'             0.256     11
>   3  ba+*           0.010      5
>   4  time           0.004     10
>   5  rmvar          0.003     67
>   6  rightIndex     0.002     15
>   7  print          0.001     20
>   8  createvar      0.001     39
>   9  leftIndex      0.000      5
>  10  +              0.000     15
>
> *t(M%*%N) RUN TIME EXPLAIN *
> # Memory Budget local/remote = 63716MB/140MB/140MB
> # Degree of Parallelism (vcores) local/remote = 40/1/1
> PROGRAM ( size CP/MR = 50/0 )
> --FUNCTIONS
> ----FUNCTION CALL GRAPH
> ------MAIN PROGRAM
> --------.defaultNS::time
> ----EXTERNAL FUNCTION .defaultNS::time
> --MAIN PROGRAM
> ----GENERIC (lines 4-6) [recompile=false]
> ------CP createvar _mVar0 scratch_space//_p32616_192.168.250.59//_t0/temp0
> true MATRIX binaryblock 5 1 1000 1000 0 copy
> ------CP rand 5.SCALAR.INT.true 1.SCALAR.INT.true 1000 1000 0.0 0.0 1.0 -1
> uniform 1.0 40 _mVar0.MATRIX.DOUBLE
> ------CP createvar _mVar1 scratch_space//_p32616_192.168.250.59//_t0/temp1
> true MATRIX binaryblock 20000 100 1000 1000 2000000 copy
> ------CP rand 20000.SCALAR.INT.true 100.SCALAR.INT.true 1000 1000 0.0 1.0
> 1.0 -1 uniform 1.0 40 _mVar1.MATRIX.DOUBLE
> ------CP createvar _mVar2 scratch_space//_p32616_192.168.250.59//_t0/temp2
> true MATRIX binaryblock 100 20000 1000 1000 2000000 copy
> ------CP rand 100.SCALAR.INT.true 20000.SCALAR.INT.true 1000 1000 0.0 1.0
> 1.0 -1 uniform 1.0 40 _mVar2.MATRIX.DOUBLE
> ------CP cpvar _mVar0 times
> ------CP cpvar _mVar1 M
> ------CP cpvar _mVar2 N
> ------CP rmvar _mVar0 _mVar1 _mVar2
> ----FOR (lines 0-0) [in-place=[times]]
> ------CP assignvar 1.SCALAR.INT.true __pred.SCALAR.INT
> ------CP assignvar 5.SCALAR.INT.true __pred.SCALAR.INT
> ------CP assignvar 1.SCALAR.INT.true __pred.SCALAR.INT
> ------GENERIC (lines 8-8) [recompile=false]
> --------CP extfunct .defaultNS time 1 1 i=1.SCALAR.INT.true start
> ------GENERIC (lines 9-12) [recompile=false]
> --------CP + StartTime: .SCALAR.STRING.true start.SCALAR.DOUBLE.false
> _Var3.SCALAR.STRING
> --------CP createvar _mVar4
> scratch_space//_p32616_192.168.250.59//_t0/temp3 true MATRIX binarycell
> 20000 20000 -1 -1 -1 copy
> --------CP ba+* M.MATRIX.DOUBLE N.MATRIX.DOUBLE _mVar4.MATRIX.DOUBLE 40
> --------CP createvar _mVar5
> scratch_space//_p32616_192.168.250.59//_t0/temp4 true MATRIX binarycell
> 20000 20000 -1 -1 -1 copy
> --------CP r' _mVar4.MATRIX.DOUBLE _mVar5.MATRIX.DOUBLE 40
> --------CP rmvar _mVar4
> --------CP createvar _mVar6
> scratch_space//_p32616_192.168.250.59//_t0/temp5 true MATRIX binarycell 1 1
> -1 -1 -1 copy
> --------CP rightIndex _mVar5.MATRIX.DOUBLE 1.SCALAR.INT.true
> 1.SCALAR.INT.true 1.SCALAR.INT.true 1.SCALAR.INT.true _mVar6.MATRIX.DOUBLE
> --------CP rmvar _mVar5
> --------CP castdts _mVar6.MATRIX.DOUBLE.false _Var7.SCALAR.STRING
> --------CP rmvar _mVar6
> --------CP print _Var3.SCALAR.STRING.false _Var8.SCALAR.STRING
> --------CP rmvar _Var3
> --------CP print _Var7.SCALAR.STRING.false _Var9.SCALAR.STRING
> --------CP rmvar _Var7
> --------CP extfunct .defaultNS time 1 1 i=1.SCALAR.INT.true end
> --------CP rmvar _Var8 _Var9
> ------GENERIC (lines 13-16) [recompile=false]
> --------CP + EndTime: .SCALAR.STRING.true end.SCALAR.DOUBLE.false
> _Var10.SCALAR.STRING
> --------CP - end.SCALAR.DOUBLE.false start.SCALAR.DOUBLE.false
> _Var11.SCALAR.DOUBLE
> --------CP / _Var11.SCALAR.DOUBLE.false 1000.SCALAR.INT.true
> _Var12.SCALAR.DOUBLE
> --------CP rmvar _Var11
> --------CP + Time:.SCALAR.STRING.true _Var12.SCALAR.DOUBLE.false
> _Var13.SCALAR.STRING
> --------CP createvar _mVar14
> scratch_space//_p32616_192.168.250.59//_t0/temp6 true MATRIX binarycell 5 1
> -1 -1 -1 copy
> --------CP leftIndex times.MATRIX.DOUBLE _Var12.SCALAR.DOUBLE.false
> ix.SCALAR.INT.false ix.SCALAR.INT.false 1.SCALAR.INT.true 1.SCALAR.INT.true
> _mVar14.MATRIX.DOUBLE
> --------CP rmvar _Var12
> --------CP print _Var10.SCALAR.STRING.false _Var15.SCALAR.STRING
> --------CP rmvar _Var10
> --------CP print _Var13.SCALAR.STRING.false _Var16.SCALAR.STRING
> --------CP rmvar _Var13 times _Var15 _Var16
> --------CP cpvar _mVar14 times
> --------CP rmvar _mVar14 start end
> ----GENERIC (lines 18-19) [recompile=false]
> ------CP createvar _mVar17
> scratch_space//_p32616_192.168.250.59//_t0/temp7 true MATRIX binarycell 1 5
> -1 -1 -1 copy
> ------CP r' times.MATRIX.DOUBLE _mVar17.MATRIX.DOUBLE 40
> ------CP rmvar times
> ------CP cpvar _mVar17 times
> ------CP rmvar _mVar17 times
>
> SystemML Statistics:
> Total elapsed time: 22.074 sec.
> Total compilation time: 0.567 sec.
> Total execution time: 21.507 sec.
> Number of compiled MR Jobs: 0.
> Number of executed MR Jobs: 0.
> Cache hits (Mem, WB, FS, HDFS): 32/0/0/0.
> Cache writes (WB, FS, HDFS): 12/0/0.
> Cache times (ACQr/m, RLS, EXP): 0.000/0.001/0.003/0.000 sec.
> HOP DAGs recompiled (PRED, SB): 0/0.
> HOP DAGs recompile time: 0.000 sec.
> Total JIT compile time: 33.155 sec.
> Total JVM GC count: 7.
> Total JVM GC time: 0.279 sec.
> Heavy hitter instructions:
>   #  Instruction  Time(s)  Count
>   1  ba+*          12.661      5
>   2  r'             8.496      6
>   3  rand           0.334      3
>   4  time           0.006     10
>   5  rmvar          0.003     57
>   6  rightIndex     0.002      5
>   7  print          0.001     20
>   8  createvar      0.001     24
>   9  +              0.000     15
>  10  leftIndex      0.000      5
>
>
>
>
> Thanks,
> Rana
>

Reply via email to