[ 
https://issues.apache.org/jira/browse/SYSTEMML-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Boehm updated SYSTEMML-1336:
-------------------------------------
    Description: 
This task aims to address suboptimal parfor optimizer choices for partitionable 
scenarios with large driver memory. Currently, we only apply partitioning, if 
the right indexing operation does not fit in memory of the driver or remote 
tasks. The execution type selection is then unaware of potential partitioning, 
and does not revert this decision - this is problematic, because the large 
input likely exceeds the memory budget of remote tasks, ultimately causing the 
optimizer to fall back to a local parfor with very small degree of parallelism 
k.

On our perftest 8GB Univariate stats scenario (with 20GB driver, i.e., 14GB 
memory budget), this lead to a local parfor with k=1 and thus, unnecessarily 
high execution time.
{code}
Total elapsed time:             781.233 sec.
Total compilation time:         2.059 sec.
Total execution time:           779.175 sec.
Number of compiled Spark inst:  0.
Number of executed Spark inst:  0.
Cache hits (Mem, WB, FS, HDFS): 27904/0/0/2.
Cache writes (WB, FS, HDFS):    3134/0/1.
Cache times (ACQr/m, RLS, EXP): 9.200/0.022/0.301/0.300 sec.
HOP DAGs recompiled (PRED, SB): 0/100.
HOP DAGs recompile time:        0.238 sec.
Spark ctx create time (lazy):   0.000 sec.
Spark trans counts (par,bc,col):0/0/0.
Spark trans times (par,bc,col): 0.000/0.000/0.000 secs.
ParFor loops optimized:         1.
ParFor optimize time:           1.985 sec.
ParFor initialize time:         0.007 sec.
ParFor result merge time:       0.003 sec.
ParFor total update in-place:   0/0/13900
Total JIT compile time:         13.542 sec.
Total JVM GC count:             29.
Total JVM GC time:              3.49 sec.
Heavy hitter instructions (name, time, count):
-- 1)   cm      479.000 sec     2700
-- 2)   qsort   228.928 sec     900
-- 3)   qpick   20.598 sec      1800
-- 4)   rangeReIndex    16.051 sec      2999
-- 5)   uamean  12.867 sec      900
-- 6)   uacmax  9.870 sec       1
-- 7)   ctable  3.158 sec       100
-- 8)   uamin   2.589 sec       1000
-- 9)   uamax   2.560 sec       1101
-- 10)  write   0.300 sec       1
{code}

  was:
This tasks aims to address suboptimal parfor optimizer choices for 
partitionable scenarios with large driver memory. Currently, we only apply 
partitioning, if the right indexing operation does not fit in memory of the 
driver or remote tasks. The execution type selection is then unaware of 
potential partitioning, and does not revert this decision - this is 
problematic, because the large input likely exceeds the memory budget of remote 
tasks, ultimately cause the optimizer to fall back to a local parfor with very 
small degree of parallelism k.

On our perftest 8GB Univariate stats scenario (with 20GB driver, i.e., 14GB 
memory budget), this lead to a local parfor with k=1 and unnecessarily slow 
execution time.
{code}
Total elapsed time:             781.233 sec.
Total compilation time:         2.059 sec.
Total execution time:           779.175 sec.
Number of compiled Spark inst:  0.
Number of executed Spark inst:  0.
Cache hits (Mem, WB, FS, HDFS): 27904/0/0/2.
Cache writes (WB, FS, HDFS):    3134/0/1.
Cache times (ACQr/m, RLS, EXP): 9.200/0.022/0.301/0.300 sec.
HOP DAGs recompiled (PRED, SB): 0/100.
HOP DAGs recompile time:        0.238 sec.
Spark ctx create time (lazy):   0.000 sec.
Spark trans counts (par,bc,col):0/0/0.
Spark trans times (par,bc,col): 0.000/0.000/0.000 secs.
ParFor loops optimized:         1.
ParFor optimize time:           1.985 sec.
ParFor initialize time:         0.007 sec.
ParFor result merge time:       0.003 sec.
ParFor total update in-place:   0/0/13900
Total JIT compile time:         13.542 sec.
Total JVM GC count:             29.
Total JVM GC time:              3.49 sec.
Heavy hitter instructions (name, time, count):
-- 1)   cm      479.000 sec     2700
-- 2)   qsort   228.928 sec     900
-- 3)   qpick   20.598 sec      1800
-- 4)   rangeReIndex    16.051 sec      2999
-- 5)   uamean  12.867 sec      900
-- 6)   uacmax  9.870 sec       1
-- 7)   ctable  3.158 sec       100
-- 8)   uamin   2.589 sec       1000
-- 9)   uamax   2.560 sec       1101
-- 10)  write   0.300 sec       1
{code}


> Improve parfor exec type selection (w/ potential data partitioning)
> -------------------------------------------------------------------
>
>                 Key: SYSTEMML-1336
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-1336
>             Project: SystemML
>          Issue Type: Sub-task
>          Components: Compiler
>            Reporter: Matthias Boehm
>             Fix For: SystemML 1.0
>
>
> This task aims to address suboptimal parfor optimizer choices for 
> partitionable scenarios with large driver memory. Currently, we only apply 
> partitioning, if the right indexing operation does not fit in memory of the 
> driver or remote tasks. The execution type selection is then unaware of 
> potential partitioning, and does not revert this decision - this is 
> problematic, because the large input likely exceeds the memory budget of 
> remote tasks, ultimately causing the optimizer to fall back to a local parfor 
> with very small degree of parallelism k.
> On our perftest 8GB Univariate stats scenario (with 20GB driver, i.e., 14GB 
> memory budget), this lead to a local parfor with k=1 and thus, unnecessarily 
> high execution time.
> {code}
> Total elapsed time:           781.233 sec.
> Total compilation time:               2.059 sec.
> Total execution time:         779.175 sec.
> Number of compiled Spark inst:        0.
> Number of executed Spark inst:        0.
> Cache hits (Mem, WB, FS, HDFS):       27904/0/0/2.
> Cache writes (WB, FS, HDFS):  3134/0/1.
> Cache times (ACQr/m, RLS, EXP):       9.200/0.022/0.301/0.300 sec.
> HOP DAGs recompiled (PRED, SB):       0/100.
> HOP DAGs recompile time:      0.238 sec.
> Spark ctx create time (lazy): 0.000 sec.
> Spark trans counts (par,bc,col):0/0/0.
> Spark trans times (par,bc,col):       0.000/0.000/0.000 secs.
> ParFor loops optimized:               1.
> ParFor optimize time:         1.985 sec.
> ParFor initialize time:               0.007 sec.
> ParFor result merge time:     0.003 sec.
> ParFor total update in-place: 0/0/13900
> Total JIT compile time:               13.542 sec.
> Total JVM GC count:           29.
> Total JVM GC time:            3.49 sec.
> Heavy hitter instructions (name, time, count):
> -- 1)         cm      479.000 sec     2700
> -- 2)         qsort   228.928 sec     900
> -- 3)         qpick   20.598 sec      1800
> -- 4)         rangeReIndex    16.051 sec      2999
> -- 5)         uamean  12.867 sec      900
> -- 6)         uacmax  9.870 sec       1
> -- 7)         ctable  3.158 sec       100
> -- 8)         uamin   2.589 sec       1000
> -- 9)         uamax   2.560 sec       1101
> -- 10)        write   0.300 sec       1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to