GitHub user MechCoder opened a pull request:
https://github.com/apache/spark/pull/13959
[SPARK-14351] [MLlib] [ML] Optimize findBestSplits method for decision
trees (and random forest)
## What changes were proposed in this pull request?
The current `findBestSplits` method creates an instance of
`ImpurityCalculator` and `ImpurityStats` for every possible split and feature
in the search for the bestSplit. Every instance of `ImpurityCalculator` creates
an array of size `statsSize` which is unnecessary and take a non-negligible
amount of time. This pull request tackles this problem by the following
technique.
1. Remove the `impurityCalculator` instantiation for every possible split
and feature. Replace this by a `calculateGain` method for each impurity that
computes the gain directly from the `allStats` attribute of the
`DTStatsAggregator` which holds all the necessary information.
2. Replace returning an instance of `ImpurityStats` for every possible
split and feature with just the information gain since the gain is sufficient
to calculate the `bestSplit`. Just return an instance of `ImpurityStats` once
for the `bestSplit`
3. Remove the not-so-useful `calculateImpurityStats` method.
## How was this patch tested?
Since this is a performance improvement, tests are necessary. Here are the
improvements for a `RandomForestRegressor` with `maxDepth` set to 30,
`subSamplingRate` set to 1 and `maxBins` set to 20 on synthetic data. The
timings were calculated locally and the mean of 3 attempts were taken.
| n_trees | n_samples | n_features | time in master | total time
in this branch |
| - |:-:|
---:|---:|--:|
| 1 | 1 | 500 | 8.954 | 7.786
|
| 10| 1 | 500 | 9.44 | 6.825
|
| 100 | 1 | 500 | 18.457 | 16.498
|
| 1 | 500 | 1 | 8.718 | 6.783
|
| 10| 500 | 1 | 8.579 | 6.853
|
| 100 | 500 | 1 | 17.593 | 15.905
|
| 1 | 1000 | 1000| 8.323 | 6.456
|
| 10| 1000 | 1000| 8.841 | 6.633
|
| 100 | 1000 | 1000| 17.834 | 16.077
|
| 500 | 1000 | 1000| 64.3 | 58.94
|
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/MechCoder/spark again
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/13959.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #13959
commit 64d066b90b152ceb71b185b7e17313486974ae77
Author: MechCoder
Date: 2016-06-27T20:42:44Z
Add calculateGain method to all Impurity objects
commit f1d8c8950f8adace6ee175cd569b20ed6468bb61
Author: MechCoder
Date: 2016-06-27T21:32:56Z
Refactor gain calculation for categorical splits
commit 6e31e3a7b36981c8ccbf867e013363aa6f784e39
Author: MechCoder
Date: 2016-06-27T22:58:10Z
Remove impurity calculation to outside the for loop
commit ea4a0735c14ff91ad1071fb517da3fd890080354
Author: MechCoder
Date: 2016-06-28T00:45:36Z
Remove per feature impurityCalculator initialization
commit ca8b36088b74cacb7f162fb793070c4d3c6a1a8c
Author: MechCoder
Date: 2016-06-28T17:27:40Z
Get rid of calculateImpurityStats
commit 67b401a6a0e59b48a167e4f3036ca9f3f6a5df1f
Author: MechCoder
Date: 2016-06-28T18:17:32Z
where did that come from?
commit e8b89141f6cabfef5f582fe9521f4443afa9ec65
Author: MechCoder
Date: 2016-06-29T00:01:55Z
Add documentation
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org