[jira] [Comment Edited] (SYSTEMML-1561) Improve constant folding during compilation

Mike Dusenberry (JIRA) Tue, 09 May 2017 14:04:20 -0700

    [ 
https://issues.apache.org/jira/browse/SYSTEMML-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16003546#comment-16003546
 ]


Mike Dusenberry edited comment on SYSTEMML-1561 at 5/9/17 9:04 PM:
-------------------------------------------------------------------

As I noted on SYSTEMML-1566, I ran experiments again using (1) the commit 
before the IPA scalar replacement update, (2) the commit with the IPA scalar 
replacement update, and (3) the proposed commit with the updated constant 
folding (which relies on the IPA update for usefulness), and measured the 
following results:

commit 2c5c3b14e1906cda70ae1581b19a5e908b3ab329 (pre IPA update)
{code}
17/05/05 14:39:49 INFO ScriptExecutorUtils: SystemML Statistics:
Total elapsed time:             712.183 sec.
Total compilation time:         1.996 sec.
Total execution time:           710.187 sec.
Number of compiled Spark inst:  134.
Number of executed Spark inst:  2513.
Cache hits (Mem, WB, FS, HDFS): 153624/0/0/862.
Cache writes (WB, FS, HDFS):    79043/0/2170.
Cache times (ACQr/m, RLS, EXP): 32.052/0.038/5.508/55.790 sec.
HOP DAGs recompiled (PRED, SB): 0/5979.
HOP DAGs recompile time:        3.670 sec.
Functions recompiled:           10.
Functions recompile time:       0.082 sec.
Spark ctx create time (lazy):   0.959 sec.
Spark trans counts (par,bc,col):347/1649/862.
Spark trans times (par,bc,col): 0.671/25.076/31.988 secs.
Total JIT compile time:         118.9 sec.
Total JVM GC count:             267.
Total JVM GC time:              7.523 sec.
Heavy hitter instructions (name, time, count):
-- 1)   train   671.994 sec     1
-- 2)   conv2d_bias_add         198.398 sec     3298
-- 3)   maxpooling_backward     174.666 sec     1720
-- 4)   predict         140.782 sec     9
-- 5)   sp_mapmm        94.035 sec      1649
-- 6)   conv2d_backward_filter  63.328 sec      1720
-- 7)   sp_sel+         39.259 sec      860
-- 8)   ba+*    18.615 sec      5089
-- 9)   +*      16.627 sec      10320
-- 10)  conv2d_backward_data    14.297 sec      860
{code}

commit abc9686fbaaa11c12cfa02c49c7675165acdf176 (w/ IPA update)
{code}
17/05/05 15:05:16 INFO ScriptExecutorUtils: SystemML Statistics:
Total elapsed time:             673.900 sec.
Total compilation time:         1.938 sec.
Total execution time:           671.962 sec.
Number of compiled Spark inst:  128.
Number of executed Spark inst:  2513.
Cache hits (Mem, WB, FS, HDFS): 153645/0/0/862.
Cache writes (WB, FS, HDFS):    79043/0/2149.
Cache times (ACQr/m, RLS, EXP): 31.568/0.038/4.639/54.790 sec.
HOP DAGs recompiled (PRED, SB): 0/5978.
HOP DAGs recompile time:        3.705 sec.
Functions recompiled:           10.
Functions recompile time:       0.068 sec.
Spark ctx create time (lazy):   0.948 sec.
Spark trans counts (par,bc,col):368/1649/862.
Spark trans times (par,bc,col): 0.689/26.035/31.503 secs.
Total JIT compile time:         111.921 sec.
Total JVM GC count:             265.
Total JVM GC time:              7.118 sec.
Heavy hitter instructions (name, time, count):
-- 1)   train   634.306 sec     1
-- 2)   conv2d_bias_add         190.557 sec     3298
-- 3)   maxpooling_backward     141.588 sec     1720
-- 4)   predict         135.222 sec     9
-- 5)   sp_mapmm        94.025 sec      1649
-- 6)   conv2d_backward_filter  66.058 sec      1720
-- 7)   sp_sel+         39.204 sec      860
-- 8)   +*      18.272 sec      10320
-- 9)   ba+*    15.804 sec      5089
-- 10)  conv2d_backward_data    13.627 sec      860
{code}

w/ updated constant folding
{code}
17/05/05 15:15:19 INFO ScriptExecutorUtils: SystemML Statistics:
Total elapsed time:             405.615 sec.
Total compilation time:         2.070 sec.
Total execution time:           403.545 sec.
Number of compiled Spark inst:  139.
Number of executed Spark inst:  793.
Cache hits (Mem, WB, FS, HDFS): 156654/0/0/2.
Cache writes (WB, FS, HDFS):    79043/0/8.
Cache times (ACQr/m, RLS, EXP): 3.467/0.043/3.566/1.175 sec.
HOP DAGs recompiled (PRED, SB): 0/5978.
HOP DAGs recompile time:        3.178 sec.
Functions recompiled:           10.
Functions recompile time:       0.072 sec.
Spark ctx create time (lazy):   1.024 sec.
Spark trans counts (par,bc,col):789/789/2.
Spark trans times (par,bc,col): 0.982/0.299/3.418 secs.
Total JIT compile time:         145.368 sec.
Total JVM GC count:             438.
Total JVM GC time:              8.992 sec.
Heavy hitter instructions (name, time, count):
-- 1)   train   370.373 sec     1
-- 2)   conv2d_bias_add         178.914 sec     3298
-- 3)   predict         116.145 sec     9
-- 4)   conv2d_backward_filter  55.582 sec      1720
-- 5)   +*      18.948 sec      10320
-- 6)   sel+    18.238 sec      3369
-- 7)   ba+*    16.171 sec      5949
-- 8)   conv2d_backward_data    15.038 sec      860
-- 9)   sp_mapmm        13.980 sec      789
-- 10)  relu_maxpooling         12.415 sec      3298
{code}

With the IPA scalar replacement + constant folding updates, we've gained an 
additional ~300s, for a ~1.75x speedup in this scenario.


was (Author: mwdus...@us.ibm.com):
As I noted on SystemML-1566, I ran experiments again using (1) the commit 
before the IPA scalar replacement update, (2) the commit with the IPA scalar 
replacement update, and (3) the proposed commit with the updated constant 
folding (which relies on the IPA update for usefulness), and measured the 
following results:

commit 2c5c3b14e1906cda70ae1581b19a5e908b3ab329 (pre IPA update)
{code}
17/05/05 14:39:49 INFO ScriptExecutorUtils: SystemML Statistics:
Total elapsed time:             712.183 sec.
Total compilation time:         1.996 sec.
Total execution time:           710.187 sec.
Number of compiled Spark inst:  134.
Number of executed Spark inst:  2513.
Cache hits (Mem, WB, FS, HDFS): 153624/0/0/862.
Cache writes (WB, FS, HDFS):    79043/0/2170.
Cache times (ACQr/m, RLS, EXP): 32.052/0.038/5.508/55.790 sec.
HOP DAGs recompiled (PRED, SB): 0/5979.
HOP DAGs recompile time:        3.670 sec.
Functions recompiled:           10.
Functions recompile time:       0.082 sec.
Spark ctx create time (lazy):   0.959 sec.
Spark trans counts (par,bc,col):347/1649/862.
Spark trans times (par,bc,col): 0.671/25.076/31.988 secs.
Total JIT compile time:         118.9 sec.
Total JVM GC count:             267.
Total JVM GC time:              7.523 sec.
Heavy hitter instructions (name, time, count):
-- 1)   train   671.994 sec     1
-- 2)   conv2d_bias_add         198.398 sec     3298
-- 3)   maxpooling_backward     174.666 sec     1720
-- 4)   predict         140.782 sec     9
-- 5)   sp_mapmm        94.035 sec      1649
-- 6)   conv2d_backward_filter  63.328 sec      1720
-- 7)   sp_sel+         39.259 sec      860
-- 8)   ba+*    18.615 sec      5089
-- 9)   +*      16.627 sec      10320
-- 10)  conv2d_backward_data    14.297 sec      860
{code}

commit abc9686fbaaa11c12cfa02c49c7675165acdf176 (w/ IPA update)
{code}
17/05/05 15:05:16 INFO ScriptExecutorUtils: SystemML Statistics:
Total elapsed time:             673.900 sec.
Total compilation time:         1.938 sec.
Total execution time:           671.962 sec.
Number of compiled Spark inst:  128.
Number of executed Spark inst:  2513.
Cache hits (Mem, WB, FS, HDFS): 153645/0/0/862.
Cache writes (WB, FS, HDFS):    79043/0/2149.
Cache times (ACQr/m, RLS, EXP): 31.568/0.038/4.639/54.790 sec.
HOP DAGs recompiled (PRED, SB): 0/5978.
HOP DAGs recompile time:        3.705 sec.
Functions recompiled:           10.
Functions recompile time:       0.068 sec.
Spark ctx create time (lazy):   0.948 sec.
Spark trans counts (par,bc,col):368/1649/862.
Spark trans times (par,bc,col): 0.689/26.035/31.503 secs.
Total JIT compile time:         111.921 sec.
Total JVM GC count:             265.
Total JVM GC time:              7.118 sec.
Heavy hitter instructions (name, time, count):
-- 1)   train   634.306 sec     1
-- 2)   conv2d_bias_add         190.557 sec     3298
-- 3)   maxpooling_backward     141.588 sec     1720
-- 4)   predict         135.222 sec     9
-- 5)   sp_mapmm        94.025 sec      1649
-- 6)   conv2d_backward_filter  66.058 sec      1720
-- 7)   sp_sel+         39.204 sec      860
-- 8)   +*      18.272 sec      10320
-- 9)   ba+*    15.804 sec      5089
-- 10)  conv2d_backward_data    13.627 sec      860
{code}

w/ updated constant folding
{code}
17/05/05 15:15:19 INFO ScriptExecutorUtils: SystemML Statistics:
Total elapsed time:             405.615 sec.
Total compilation time:         2.070 sec.
Total execution time:           403.545 sec.
Number of compiled Spark inst:  139.
Number of executed Spark inst:  793.
Cache hits (Mem, WB, FS, HDFS): 156654/0/0/2.
Cache writes (WB, FS, HDFS):    79043/0/8.
Cache times (ACQr/m, RLS, EXP): 3.467/0.043/3.566/1.175 sec.
HOP DAGs recompiled (PRED, SB): 0/5978.
HOP DAGs recompile time:        3.178 sec.
Functions recompiled:           10.
Functions recompile time:       0.072 sec.
Spark ctx create time (lazy):   1.024 sec.
Spark trans counts (par,bc,col):789/789/2.
Spark trans times (par,bc,col): 0.982/0.299/3.418 secs.
Total JIT compile time:         145.368 sec.
Total JVM GC count:             438.
Total JVM GC time:              8.992 sec.
Heavy hitter instructions (name, time, count):
-- 1)   train   370.373 sec     1
-- 2)   conv2d_bias_add         178.914 sec     3298
-- 3)   predict         116.145 sec     9
-- 4)   conv2d_backward_filter  55.582 sec      1720
-- 5)   +*      18.948 sec      10320
-- 6)   sel+    18.238 sec      3369
-- 7)   ba+*    16.171 sec      5949
-- 8)   conv2d_backward_data    15.038 sec      860
-- 9)   sp_mapmm        13.980 sec      789
-- 10)  relu_maxpooling         12.415 sec      3298
{code}

With the IPA scalar replacement + constant folding updates, we've gained an 
additional ~300s, for a ~1.75x speedup in this scenario.

> Improve constant folding during compilation
> -------------------------------------------
>
>                 Key: SYSTEMML-1561
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-1561
>             Project: SystemML
>          Issue Type: Improvement
>            Reporter: Mike Dusenberry
>            Assignee: Mike Dusenberry
>             Fix For: SystemML 1.0
>
>         Attachments: scenario1_plan.txt, scenario1.py, scenario2_plan.txt, 
> scenario2.py
>
>
> In our `nn` library, our convolution and pooling layers have to pass around 
> the spatial dimensions (height and width) of the images that are stretched 
> out into rows of the input/output matrices.  These output dimensions are 
> computed within the forward functions of the above layers as small scalar 
> equations.  From a mathematical standpoint, these sizes can be determined at 
> compile time, and it is nice to have these size equations in DML (v.s. hiding 
> them inside the engine within built-in functions).  However, we do not 
> currently evaluate these expressions during compilation, and thus we are left 
> with unknown sizes even during recompilation.  This naturally leads to max 
> memory estimates and thus often leads to unnecessary distributed runtime ops 
> rather than simple CP ones.
> I have two related scenarios for which this is a problem.  They both involve 
> the {{Houtc1}} & {{Woutc1}} values that are returned from a 
> `conv2d::forward(...)` function.  These represent the spatial dimensions of 
> the volume with each of the rows of the output {{outc1}} of the function, and 
> the third dimension is {{F1}}.  Thus, {{outc1}} has a number of columns equal 
> to {{F1*Houtc1*Wouc1}}.
> In the first scenario ({{scenario1.py}}), a random matrix {{doutc1}} is 
> created that should have the same dimensions as {{outc1}}.  For the columns, 
> if I use {{cols=ncol(outc1)}} in this rand statement, the size will be 
> propagated and CP ops will be compiled and run.  I I instead use 
> {{cols=F1*Houtc1*Woutc1}}, the size will forever be unknown, even during 
> recompilation, and thus Spark ops will be compiled and run.  I have included 
> the recompile hops plan ({{scenario1_plan.txt}}).
> In the second scenario ({{scenario2.py}}), a {{max_pool2d::forward(...)}} 
> function is inserted after the {{conv2d::forward(...)}} function that 
> requires the {{Houtc1}} and {{Woutc1}} variables to be supplied as arguments. 
>  Since those latter variables are not executed during compilation time, the 
> max pooling sizes remain unknown, even during recompilation, and thus Spark 
> ops will be compiled and run.  I have included the recompile hops plan 
> ({{scenario2_plan.txt}}).
> We should either improve or fix our constant folding rewrites so that these 
> scenarios are fixed, as they are necessary for performant deep learning 
> applications.  Note too that this issue will be present in other non-deep 
> learning scenarios as well.
> Mailing list thread: 
> https://www.mail-archive.com/dev@systemml.incubator.apache.org/msg01657.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (SYSTEMML-1561) Improve constant folding during compilation

Reply via email to