[ https://issues.apache.org/jira/browse/SYSTEMML-1025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15563789#comment-15563789 ]
Frederick Reiss commented on SYSTEMML-1025: ------------------------------------------- Thanks for the analysis, Matthias! Would a more robust version of RDD.coalesce() (i.e. keep partitions local when possible and shuffle the remainder) have helped in this case? > Perftest: Large performance variability on scenario L dense (80GB) > ------------------------------------------------------------------ > > Key: SYSTEMML-1025 > URL: https://issues.apache.org/jira/browse/SYSTEMML-1025 > Project: SystemML > Issue Type: Bug > Reporter: Matthias Boehm > Assignee: Matthias Boehm > Priority: Blocker > Fix For: SystemML 0.11 > > > During many runs of our entire performance testsuite, we've seen quite some > performance variability, especially for scenario L dense (80GB) where spark > operations are the dominating factor for end-to-end performance. These issues > showed up over all algorithms and configurations but especially for > multinomial classification and parfor scripts. > Let's take for example Naive Bayes over the dense 10M x 1K input with 20 > classes. Below are the results of 7 consecutive runs: > {code} > NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 67 > NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 362 > NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 484 > NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 64 > NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 310 > NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 91 > NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 68 > {code} > After a detailed investigation, it seems that imbalance, garbage collection, > and poor data locality are the reasons: > * First, we generated the inputs with our Spark backend. Apparently, the rand > operation caused imbalance due to garbage collection of some nodes. However, > this is a very realistic scenario as we cannot always assume perfect balance. > * Second, especially for multinomial classification and parfor scripts, the > intermediates are not just vectors but larger matrices or there are simply > more intermediates. This led again to more garbage collection. > * Third, the scheduler delay of 3s for pending tasks was exceeded due to > garbage collection, leading to remote execution which significantly slowed > down the overall execution. > To resolve these issues, we should make the following two changes: > * (1) More conservative configuration of spark.locality.wait in systemml's > preferred spark configuration, where we did not consider this at all so far. > * (2) Improvements of reduce-all operations which current unnecessarily > create intermediate pair outputs and hence unnecessary Tuple2 and > MatrixIndexes objects. > With a default scheduler delay of 5s instead of the default 3s as well as > improved reduce-all for mapmm, groupedagg, tsmm, tsmm2, zipmm, and uagg, we > got the following promising results (which include spark context creation and > initial read): > {code} > NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 52 > NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 45 > NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 44 > NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 44 > NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 51 > NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 50 > NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 47 > {code} > cc [~reinwald] [~niketanpansare] [~freiss] -- This message was sent by Atlassian JIRA (v6.3.4#6332)