[GitHub] spark pull request #13959: [SPARK-14351] [MLlib] [ML] Optimize findBestSplit...

MechCoder Tue, 28 Jun 2016 17:30:56 -0700

GitHub user MechCoder opened a pull request:

    https://github.com/apache/spark/pull/13959


    [SPARK-14351] [MLlib] [ML] Optimize findBestSplits method for decision 
trees (and random forest)

    ## What changes were proposed in this pull request?
    
    The current `findBestSplits` method creates an instance of 
`ImpurityCalculator` and `ImpurityStats` for every possible split and feature 
in the search for the bestSplit. Every instance of `ImpurityCalculator` creates 
an array of size `statsSize` which is unnecessary and take a non-negligible 
amount of time. This pull request tackles this problem by the following 
technique.
    
    1. Remove the `impurityCalculator` instantiation for every possible split 
and feature. Replace this by a `calculateGain` method for each impurity that 
computes the gain directly from the `allStats` attribute of the 
`DTStatsAggregator` which holds all the necessary information. 
    2. Replace returning an instance of `ImpurityStats` for every possible 
split and feature with just the information gain since the gain is sufficient 
to calculate the `bestSplit`. Just return an instance of `ImpurityStats` once 
for the `bestSplit` 
    3. Remove the not-so-useful `calculateImpurityStats` method. 
    
    ## How was this patch tested?
    
    Since this is a performance improvement, tests are necessary. Here are the 
improvements for a `RandomForestRegressor` with `maxDepth` set to 30, 
`subSamplingRate` set to 1 and `maxBins` set to 20 on synthetic data. The 
timings were calculated locally and the mean of 3 attempts were taken.
    
    
    | n_trees       | n_samples     | n_features  | time in master | total time 
in this branch |  
    | ------------- |:-------------:| 
-----------:|---------------:|--------------------------:|
    | 1             | 10000         | 500         | 8.954          | 7.786      
               |
    | 10            | 10000         | 500         | 9.44           | 6.825      
               |
    | 100           | 10000         | 500         | 18.457         | 16.498     
               |
    | 1             | 500           | 10000       | 8.718          | 6.783      
               |
    | 10            | 500           | 10000       | 8.579          | 6.853      
               |
    | 100           | 500           | 10000       | 17.593         | 15.905     
               |
    | 1             | 1000          | 1000        | 8.323          | 6.456      
               |
    | 10            | 1000          | 1000        | 8.841          | 6.633      
               |
    | 100           | 1000          | 1000        | 17.834         | 16.077     
               |
    | 500           | 1000          | 1000        | 64.3           | 58.94      
               | 


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/MechCoder/spark again

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13959.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13959
    
----
commit 64d066b90b152ceb71b185b7e17313486974ae77
Author: MechCoder <mks...@nyu.edu>
Date:   2016-06-27T20:42:44Z

    Add calculateGain method to all Impurity objects

commit f1d8c8950f8adace6ee175cd569b20ed6468bb61
Author: MechCoder <mks...@nyu.edu>
Date:   2016-06-27T21:32:56Z

    Refactor gain calculation for categorical splits

commit 6e31e3a7b36981c8ccbf867e013363aa6f784e39
Author: MechCoder <mks...@nyu.edu>
Date:   2016-06-27T22:58:10Z

    Remove impurity calculation to outside the for loop

commit ea4a0735c14ff91ad1071fb517da3fd890080354
Author: MechCoder <mks...@nyu.edu>
Date:   2016-06-28T00:45:36Z

    Remove per feature impurityCalculator initialization

commit ca8b36088b74cacb7f162fb793070c4d3c6a1a8c
Author: MechCoder <mks...@nyu.edu>
Date:   2016-06-28T17:27:40Z

    Get rid of calculateImpurityStats

commit 67b401a6a0e59b48a167e4f3036ca9f3f6a5df1f
Author: MechCoder <mks...@nyu.edu>
Date:   2016-06-28T18:17:32Z

    where did that come from?

commit e8b89141f6cabfef5f582fe9521f4443afa9ec65
Author: MechCoder <mks...@nyu.edu>
Date:   2016-06-29T00:01:55Z

    Add documentation

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13959: [SPARK-14351] [MLlib] [ML] Optimize findBestSplit...

Reply via email to