GitHub user sethah opened a pull request: https://github.com/apache/spark/pull/15593
[SPARK-18060][ML] Avoid unnecessary computation for MLOR ## What changes were proposed in this pull request? Before this patch, the gradient updates for multinomial logistic regression were computed by an outer loop over the number of classes and an inner loop over the number of features. Inside the inner loop, we standardized the feature value (`value / featuresStd(index)`), which means we performed the computation `numFeatures * numClasses` times. We only need to perform that computation `numFeatures` times, however. If we re-order the inner and outer loop, we can avoid this, but then we lose sequential memory access. In this patch, we instead lay out the coefficients in column major order while we train, so that we can avoid the extra computation and retain sequential memory access. We convert back to row-major order when we create the model, since the vector matrix multiply required by predict will access the coefficients in row-major order. ## How was this patch tested? This is an implementation detail only, so the original behavior should be maintained. All tests pass. I ran some performance tests to verify speedups. The results are below, and show significant speedups. ## Performance Tests **Setup** 3 node bare-metal cluster 120 cores total 384 gb RAM total **Results** | | numPoints | numFeatures | numClasses | regParam | elasticNetParam | currentMasterTime (sec) | thisPatchTime (sec) | pctSpeedup | |----|-------------|---------------|--------------|------------|-------------------|---------------------------|-----------------------|--------------| | 0 | 1e+07 | 100 | 500 | 0.5 | 0 | 90 | 18 | 80 | | 1 | 1e+08 | 100 | 50 | 0.5 | 0 | 90 | 19 | 78 | | 2 | 1e+08 | 100 | 50 | 0.05 | 1 | 72 | 19 | 73 | | 3 | 1e+06 | 100 | 5000 | 0.5 | 0 | 93 | 53 | 43 | | 4 | 1e+07 | 100 | 5000 | 0.5 | 0 | 900 | 390 | 56 | | 5 | 1e+08 | 100 | 500 | 0.5 | 0 | 840 | 174 | 79 | | 6 | 1e+08 | 100 | 200 | 0.5 | 0 | 360 | 72 | 80 | You can merge this pull request into a Git repository by running: $ git pull https://github.com/sethah/spark MLOR_PERF_COL_MAJOR_COEF Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15593.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15593 ---- commit 4c19abebe0b78bcd26fc142ef6787517e1e4482d Author: sethah <seth.hendrickso...@gmail.com> Date: 2016-10-21T17:19:50Z tests pass except initial model commit fcab96a3d608ca49d8a8963f79a277163d87ddce Author: sethah <seth.hendrickso...@gmail.com> Date: 2016-10-21T19:49:39Z initialModel passes commit 07fd1504136ad7b1ce37f443e26f407b07345991 Author: sethah <seth.hendrickso...@gmail.com> Date: 2016-10-21T23:01:27Z clean up and refactoring exp op in log agg ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org