GitHub user MechCoder opened a pull request: https://github.com/apache/spark/pull/13959
[SPARK-14351] [MLlib] [ML] Optimize findBestSplits method for decision trees (and random forest) ## What changes were proposed in this pull request? The current `findBestSplits` method creates an instance of `ImpurityCalculator` and `ImpurityStats` for every possible split and feature in the search for the bestSplit. Every instance of `ImpurityCalculator` creates an array of size `statsSize` which is unnecessary and take a non-negligible amount of time. This pull request tackles this problem by the following technique. 1. Remove the `impurityCalculator` instantiation for every possible split and feature. Replace this by a `calculateGain` method for each impurity that computes the gain directly from the `allStats` attribute of the `DTStatsAggregator` which holds all the necessary information. 2. Replace returning an instance of `ImpurityStats` for every possible split and feature with just the information gain since the gain is sufficient to calculate the `bestSplit`. Just return an instance of `ImpurityStats` once for the `bestSplit` 3. Remove the not-so-useful `calculateImpurityStats` method. ## How was this patch tested? Since this is a performance improvement, tests are necessary. Here are the improvements for a `RandomForestRegressor` with `maxDepth` set to 30, `subSamplingRate` set to 1 and `maxBins` set to 20 on synthetic data. The timings were calculated locally and the mean of 3 attempts were taken. | n_trees | n_samples | n_features | time in master | total time in this branch | | ------------- |:-------------:| -----------:|---------------:|--------------------------:| | 1 | 10000 | 500 | 8.954 | 7.786 | | 10 | 10000 | 500 | 9.44 | 6.825 | | 100 | 10000 | 500 | 18.457 | 16.498 | | 1 | 500 | 10000 | 8.718 | 6.783 | | 10 | 500 | 10000 | 8.579 | 6.853 | | 100 | 500 | 10000 | 17.593 | 15.905 | | 1 | 1000 | 1000 | 8.323 | 6.456 | | 10 | 1000 | 1000 | 8.841 | 6.633 | | 100 | 1000 | 1000 | 17.834 | 16.077 | | 500 | 1000 | 1000 | 64.3 | 58.94 | You can merge this pull request into a Git repository by running: $ git pull https://github.com/MechCoder/spark again Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13959.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13959 ---- commit 64d066b90b152ceb71b185b7e17313486974ae77 Author: MechCoder <mks...@nyu.edu> Date: 2016-06-27T20:42:44Z Add calculateGain method to all Impurity objects commit f1d8c8950f8adace6ee175cd569b20ed6468bb61 Author: MechCoder <mks...@nyu.edu> Date: 2016-06-27T21:32:56Z Refactor gain calculation for categorical splits commit 6e31e3a7b36981c8ccbf867e013363aa6f784e39 Author: MechCoder <mks...@nyu.edu> Date: 2016-06-27T22:58:10Z Remove impurity calculation to outside the for loop commit ea4a0735c14ff91ad1071fb517da3fd890080354 Author: MechCoder <mks...@nyu.edu> Date: 2016-06-28T00:45:36Z Remove per feature impurityCalculator initialization commit ca8b36088b74cacb7f162fb793070c4d3c6a1a8c Author: MechCoder <mks...@nyu.edu> Date: 2016-06-28T17:27:40Z Get rid of calculateImpurityStats commit 67b401a6a0e59b48a167e4f3036ca9f3f6a5df1f Author: MechCoder <mks...@nyu.edu> Date: 2016-06-28T18:17:32Z where did that come from? commit e8b89141f6cabfef5f582fe9521f4443afa9ec65 Author: MechCoder <mks...@nyu.edu> Date: 2016-06-29T00:01:55Z Add documentation ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org