[jira] [Updated] (SYSTEMML-1011) Slow sparse append cbind (sparse row re-allocations)

Matthias Boehm (JIRA) Tue, 04 Oct 2016 19:10:41 -0700

     [ 
https://issues.apache.org/jira/browse/SYSTEMML-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Matthias Boehm updated SYSTEMML-1011:
-------------------------------------
    Description: 
All algorithms that support the 'intercept' option (e.g., LinregCG, LinregDS, 
L2SVM, MSVM, Mlogreg, and GLM) append a column of 1s in the beginning of the 
script. On large sparse data, this append sometimes dominates end-to-end 
performance. For example, there are the LinregCG results for a 10Mx1K scenario 
with sparsity 0.01.

{code}
-- Running runLinearRegCG on 10M_1k_sparse (all configs)
LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_sparse: 7
LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_sparse: 15
LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_sparse: 15
-- Running runLinearRegCG on 10M_1k_sparse (all configs)
LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_sparse: 7
LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_sparse: 15
LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_sparse: 15
-- Running runLinearRegCG on 10M_1k_sparse (all configs)
LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_sparse: 6
LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_sparse: 15
LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_sparse: 16
-- Running runLinearRegCG on 10M_1k_sparse (all configs)
LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_sparse: 7
LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_sparse: 16
LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_sparse: 15
{code}

and here is the related -stats output for ict=1.
{code}
Total elapsed time:             16.893 sec.
Total compilation time:         2.412 sec.
Total execution time:           14.480 sec.
Number of compiled Spark inst:  0.
Number of executed Spark inst:  0.
Cache hits (Mem, WB, FS, HDFS): 172/0/0/2.
Cache writes (WB, FS, HDFS):    77/0/1.
Cache times (ACQr/m, RLS, EXP): 1.734/0.003/2.143/0.209 sec.
HOP DAGs recompiled (PRED, SB): 0/0.
HOP DAGs recompile time:        0.000 sec.
Spark ctx create time (lazy):   0.000 sec.
Spark trans counts (par,bc,col):0/0/0.
Spark trans times (par,bc,col): 0.000/0.000/0.000 secs.
Total JIT compile time:         5.357 sec.
Total JVM GC count:             2.
Total JVM GC time:              5.628 sec.
Heavy hitter instructions (name, time, count):
-- 1)   append  8.595 sec       26
-- 2)   mmchain         4.443 sec       8
-- 3)   ba+*    0.537 sec       10
-- 4)   r'      0.411 sec       10
-- 5)   write   0.210 sec       1
-- 6)   -       0.087 sec       20
-- 7)   uak+    0.059 sec       2
-- 8)   tsmm    0.049 sec       11
-- 9)   rand    0.043 sec       5
-- 10)  +*      0.007 sec       24
{code}

The large GC time indicates that sparse row re-allocations are a major issue 
here. We should compute the joint nnz per output row, and allocate the output 
sparse row just once.

> Slow sparse append cbind (sparse row re-allocations)
> ----------------------------------------------------
>
>                 Key: SYSTEMML-1011
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-1011
>             Project: SystemML
>          Issue Type: Umbrella
>            Reporter: Matthias Boehm
>            Assignee: Matthias Boehm
>
> All algorithms that support the 'intercept' option (e.g., LinregCG, LinregDS, 
> L2SVM, MSVM, Mlogreg, and GLM) append a column of 1s in the beginning of the 
> script. On large sparse data, this append sometimes dominates end-to-end 
> performance. For example, there are the LinregCG results for a 10Mx1K 
> scenario with sparsity 0.01.
> {code}
> -- Running runLinearRegCG on 10M_1k_sparse (all configs)
> LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_sparse: 7
> LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_sparse: 15
> LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_sparse: 15
> -- Running runLinearRegCG on 10M_1k_sparse (all configs)
> LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_sparse: 7
> LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_sparse: 15
> LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_sparse: 15
> -- Running runLinearRegCG on 10M_1k_sparse (all configs)
> LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_sparse: 6
> LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_sparse: 15
> LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_sparse: 16
> -- Running runLinearRegCG on 10M_1k_sparse (all configs)
> LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_sparse: 7
> LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_sparse: 16
> LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_sparse: 15
> {code}
> and here is the related -stats output for ict=1.
> {code}
> Total elapsed time:           16.893 sec.
> Total compilation time:               2.412 sec.
> Total execution time:         14.480 sec.
> Number of compiled Spark inst:        0.
> Number of executed Spark inst:        0.
> Cache hits (Mem, WB, FS, HDFS):       172/0/0/2.
> Cache writes (WB, FS, HDFS):  77/0/1.
> Cache times (ACQr/m, RLS, EXP):       1.734/0.003/2.143/0.209 sec.
> HOP DAGs recompiled (PRED, SB):       0/0.
> HOP DAGs recompile time:      0.000 sec.
> Spark ctx create time (lazy): 0.000 sec.
> Spark trans counts (par,bc,col):0/0/0.
> Spark trans times (par,bc,col):       0.000/0.000/0.000 secs.
> Total JIT compile time:               5.357 sec.
> Total JVM GC count:           2.
> Total JVM GC time:            5.628 sec.
> Heavy hitter instructions (name, time, count):
> -- 1)         append  8.595 sec       26
> -- 2)         mmchain         4.443 sec       8
> -- 3)         ba+*    0.537 sec       10
> -- 4)         r'      0.411 sec       10
> -- 5)         write   0.210 sec       1
> -- 6)         -       0.087 sec       20
> -- 7)         uak+    0.059 sec       2
> -- 8)         tsmm    0.049 sec       11
> -- 9)         rand    0.043 sec       5
> -- 10)        +*      0.007 sec       24
> {code}
> The large GC time indicates that sparse row re-allocations are a major issue 
> here. We should compute the joint nnz per output row, and allocate the output 
> sparse row just once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (SYSTEMML-1011) Slow sparse append cbind (sparse row re-allocations)

Reply via email to