[ https://issues.apache.org/jira/browse/SYSTEMML-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Matthias Boehm updated SYSTEMML-1011: ------------------------------------- Description: All algorithms that support the 'intercept' option (e.g., LinregCG, LinregDS, L2SVM, MSVM, Mlogreg, and GLM) append a column of 1s in the beginning of the script. On large sparse data, this append sometimes dominates end-to-end performance. For example, there are the LinregCG results for a 10Mx1K scenario with sparsity 0.01. {code} -- Running runLinearRegCG on 10M_1k_sparse (all configs) LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_sparse: 7 LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_sparse: 15 LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_sparse: 15 -- Running runLinearRegCG on 10M_1k_sparse (all configs) LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_sparse: 7 LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_sparse: 15 LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_sparse: 15 -- Running runLinearRegCG on 10M_1k_sparse (all configs) LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_sparse: 6 LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_sparse: 15 LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_sparse: 16 -- Running runLinearRegCG on 10M_1k_sparse (all configs) LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_sparse: 7 LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_sparse: 16 LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_sparse: 15 {code} and here is the related -stats output for ict=1. {code} Total elapsed time: 16.893 sec. Total compilation time: 2.412 sec. Total execution time: 14.480 sec. Number of compiled Spark inst: 0. Number of executed Spark inst: 0. Cache hits (Mem, WB, FS, HDFS): 172/0/0/2. Cache writes (WB, FS, HDFS): 77/0/1. Cache times (ACQr/m, RLS, EXP): 1.734/0.003/2.143/0.209 sec. HOP DAGs recompiled (PRED, SB): 0/0. HOP DAGs recompile time: 0.000 sec. Spark ctx create time (lazy): 0.000 sec. Spark trans counts (par,bc,col):0/0/0. Spark trans times (par,bc,col): 0.000/0.000/0.000 secs. Total JIT compile time: 5.357 sec. Total JVM GC count: 2. Total JVM GC time: 5.628 sec. Heavy hitter instructions (name, time, count): -- 1) append 8.595 sec 26 -- 2) mmchain 4.443 sec 8 -- 3) ba+* 0.537 sec 10 -- 4) r' 0.411 sec 10 -- 5) write 0.210 sec 1 -- 6) - 0.087 sec 20 -- 7) uak+ 0.059 sec 2 -- 8) tsmm 0.049 sec 11 -- 9) rand 0.043 sec 5 -- 10) +* 0.007 sec 24 {code} The large GC time indicates that sparse row re-allocations are a major issue here. We should compute the joint nnz per output row, and allocate the output sparse row just once. > Slow sparse append cbind (sparse row re-allocations) > ---------------------------------------------------- > > Key: SYSTEMML-1011 > URL: https://issues.apache.org/jira/browse/SYSTEMML-1011 > Project: SystemML > Issue Type: Umbrella > Reporter: Matthias Boehm > Assignee: Matthias Boehm > > All algorithms that support the 'intercept' option (e.g., LinregCG, LinregDS, > L2SVM, MSVM, Mlogreg, and GLM) append a column of 1s in the beginning of the > script. On large sparse data, this append sometimes dominates end-to-end > performance. For example, there are the LinregCG results for a 10Mx1K > scenario with sparsity 0.01. > {code} > -- Running runLinearRegCG on 10M_1k_sparse (all configs) > LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_sparse: 7 > LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_sparse: 15 > LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_sparse: 15 > -- Running runLinearRegCG on 10M_1k_sparse (all configs) > LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_sparse: 7 > LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_sparse: 15 > LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_sparse: 15 > -- Running runLinearRegCG on 10M_1k_sparse (all configs) > LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_sparse: 6 > LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_sparse: 15 > LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_sparse: 16 > -- Running runLinearRegCG on 10M_1k_sparse (all configs) > LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_sparse: 7 > LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_sparse: 16 > LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_sparse: 15 > {code} > and here is the related -stats output for ict=1. > {code} > Total elapsed time: 16.893 sec. > Total compilation time: 2.412 sec. > Total execution time: 14.480 sec. > Number of compiled Spark inst: 0. > Number of executed Spark inst: 0. > Cache hits (Mem, WB, FS, HDFS): 172/0/0/2. > Cache writes (WB, FS, HDFS): 77/0/1. > Cache times (ACQr/m, RLS, EXP): 1.734/0.003/2.143/0.209 sec. > HOP DAGs recompiled (PRED, SB): 0/0. > HOP DAGs recompile time: 0.000 sec. > Spark ctx create time (lazy): 0.000 sec. > Spark trans counts (par,bc,col):0/0/0. > Spark trans times (par,bc,col): 0.000/0.000/0.000 secs. > Total JIT compile time: 5.357 sec. > Total JVM GC count: 2. > Total JVM GC time: 5.628 sec. > Heavy hitter instructions (name, time, count): > -- 1) append 8.595 sec 26 > -- 2) mmchain 4.443 sec 8 > -- 3) ba+* 0.537 sec 10 > -- 4) r' 0.411 sec 10 > -- 5) write 0.210 sec 1 > -- 6) - 0.087 sec 20 > -- 7) uak+ 0.059 sec 2 > -- 8) tsmm 0.049 sec 11 > -- 9) rand 0.043 sec 5 > -- 10) +* 0.007 sec 24 > {code} > The large GC time indicates that sparse row re-allocations are a major issue > here. We should compute the joint nnz per output row, and allocate the output > sparse row just once. -- This message was sent by Atlassian JIRA (v6.3.4#6332)