[GitHub] spark pull request: [SPARK-1545] [mllib] Add Random Forests

jkbradley Wed, 17 Sep 2014 12:45:37 -0700

GitHub user jkbradley opened a pull request:

    https://github.com/apache/spark/pull/2435


    [SPARK-1545] [mllib] Add Random Forests

    This PR adds RandomForest to MLlib.  The implementation is basic, and 
future performance optimizations will be important.  (Note: RFs = Random 
Forests.)
    
    # Overview
    
    ## RandomForest
    * trains multiple trees at once to reduce the number of passes over the data
    * allows feature subsets at each node
    * uses a queue of nodes instead of fixed groups for each level
    
    This implementation is based an implementation by @manishamde and the 
[Alpine Labs Sequoia Forest](https://github.com/AlpineNow/SparkML2) by 
@codedeft (in particular, the TreePoint, BaggedPoint, and node queue 
implementations).  Thank you for your inputs!
    
    ## Testing
    
    This has been tested for correctness with the test suites and with 
DecisionTreeRunner on example datasets.
    This has been performance tested using [this branch of 
spark-perf](https://github.com/jkbradley/spark-perf/tree/rfs).  For training 1 
tree, there are small regressions, especially from feature subsampling.
    
    Detailed results are below.  These were run on an EC2 cluster with 15 
workers, training 1 tree with maxDepth = 5 (= 6 levels).  The 2 result columns 
marked with (numTrees) are results after implementing RFs to train multiple 
trees at once, using a node queue.  The 2 columns marked with (features 
subsets) are follow-up results after adding feature subsampling.  Speedup 
values < 1 indicate slowdowns from the old DecisionTree implementation.
    
    numInstances | numFeatures | runtime (sec) | speedup | runtime (sec) | 
speedup
    ---- | ---- | ---- | ---- | ---- | ----
     | | (numTrees) | (numTrees) | (feature subsets) | (feature subsets)
    20000 | 100 | 4.051 | 1.044433473 | 4.478 | 0.9448414471
    20000 | 500 | 8.472 | 1.104461756 | 9.315 | 1.004508857
    20000 | 1500 | 19.354 | 1.05854087 | 20.863 | 0.9819776638
    20000 | 3500 | 43.674 | 1.072033704 | 45.887 | 1.020332556
    200000 | 100 | 4.196 | 1.171830315 | 4.848 | 1.014232673
    200000 | 500 | 8.926 | 1.082791844 | 9.771 | 0.989151571
    200000 | 1500 | 20.58 | 1.068415938 | 22.134 | 0.9934038131
    200000 | 3500 | 48.043 | 1.075203464 | 52.249 | 0.9886505005
    2000000 | 100 | 4.944 | 1.01355178 | 5.796 | 0.8645617667
    2000000 | 500 | 11.11 | 1.016831683 | 12.482 | 0.9050632911
    2000000 | 1500 | 31.144 | 1.017852556 | 35.274 | 0.8986789136
    2000000 | 3500 | 79.981 | 1.085382778 | 101.105 | 0.8586123337
    20000000 | 100 | 8.304 | 0.9270231214 | 9.073 | 0.8484514494
    20000000 | 500 | 28.174 | 1.083268262 | 34.236 | 0.8914592826
    20000000 | 1500 | 143.97 | 0.9579634646 | 159.275 | 0.8659111599
    
    # Details on specific classes
    
    ## Changes to DecisionTree
    * Main train() method is now in RandomForest.
    * findBestSplits() is no longer needed.  (It split levels into groups, but 
we now use a queue of nodes.)
    * Many small changes to support RFs.  (Note: These methods should be moved 
to RandomForest.scala in a later PR, but are in DecisionTree.scala to make code 
comparison easier.)
    
    ## RandomForest
    * Main train() method is from old DecisionTree.
    * selectNodesToSplit: Note that it selects nodes and feature subsets 
jointly to track memory usage.
    
    ## RandomForestModel
    * Stores an Array[DecisionTreeModel]
    * Prediction:
     * For classification, most common label.  For regression, mean.
     * We could support other methods later.
    
    ## examples/.../DecisionTreeRunner
    * This now takes numTrees and featureSubsetStrategy, to support RFs.
    
    ## DTStatsAggregator
    * 2 types of functionality (w/ and w/o subsampling features): These require 
different indexing methods.  (We could treat both as subsampling, but this is 
less efficient
      DTStatsAggregator is now abstract, and 2 child classes implement these 2 
types of functionality.
    
    ## impurities
    * These now take instance weights.
    
    ## Node
    * Some vals changed to vars.
     * This is unfortunately a public API change (DeveloperApi).  This could be 
avoided by creating a LearningNode struct, but would be awkward.
    
    ## RandomForestSuite
    Please let me know if there are missing tests!
    
    ## BaggedPoint
    This wraps TreePoint and holds bootstrap weights/counts.
    
    # Design decisions
    
    * BaggedPoint: BaggedPoint is separate from TreePoint since it may be 
useful for other bagging algorithms later on.
    
    * RandomForest public API: What options should be easily supported by the 
train* methods?  Should ALL options be in the Java-friendly constructors?  
Should there be a constructor taking Strategy?
    
    * Feature subsampling options: What options should be supported?  
scikit-learn supports the same options, except for "onethird."  One option 
would be to allow users to specific fractions ("0.1"): the current options 
could be supported, and any unrecognized values would be parsed as Doubles in 
[0,1].
    
    * Splits and bins are computed before bootstrapping, so all trees use the 
same discretization.
    
    * One queue, instead of one queue per tree.
    
    CC: @mengxr @manishamde @codedeft @chouqin  Please let me know if you have 
suggestions---thanks!


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jkbradley/spark rfs-new

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2435.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2435
    
----
commit ac4237808090237fe4c562da8c88c55c330d451f
Author: qiping.lqp <qiping....@alibaba-inc.com>
Date:   2014-09-09T03:17:58Z

    add min info gain and min instances per node parameters in decision tree

commit ff34845c8e43f5b9755dd1fdf428be8b2284c68b
Author: qiping.lqp <qiping....@alibaba-inc.com>
Date:   2014-09-09T04:29:12Z

    separate calculation of predict of node from calculation of info gain

commit 987cbf4b177f29e232bf2ba2ca595ea7015694da
Author: qiping.lqp <qiping....@alibaba-inc.com>
Date:   2014-09-09T04:30:01Z

    fix bug

commit f195e830a94097e5d6d42f22c67c32ca8900d848
Author: qiping.lqp <qiping....@alibaba-inc.com>
Date:   2014-09-09T06:04:20Z

    fix style

commit 845c6fa58c00bfba426e56e71eb46a6f8c3f5985
Author: qiping.lqp <qiping....@alibaba-inc.com>
Date:   2014-09-09T06:05:37Z

    fix style

commit e72c7e4d0ad015fdf25ea2959bdbf524056e38ca
Author: qiping.lqp <qiping....@alibaba-inc.com>
Date:   2014-09-09T06:52:24Z

    add comments

commit 46b891fd7f30b9f2d439134931b35dab387fe2b1
Author: qiping.lqp <qiping....@alibaba-inc.com>
Date:   2014-09-09T08:09:34Z

    fix bug

commit cadd569cf64d6eb7b9c9979a5066a2f63f15fed9
Author: qiping.lqp <qiping....@alibaba-inc.com>
Date:   2014-09-09T08:48:51Z

    add api docs

commit bb465cabc804ca53ef5005f6793b58aa2e4a5274
Author: qiping.lqp <qiping....@alibaba-inc.com>
Date:   2014-09-09T09:09:14Z

    Merge branch 'master' of https://github.com/apache/spark into dt-preprune
    
    Conflicts:
        
mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala

commit 6728fad304511030611c61592b1a590214e7f434
Author: qiping.lqp <qiping....@alibaba-inc.com>
Date:   2014-09-09T09:16:27Z

    minor fix: remove empty lines

commit 10b801269864cda2c00159518688942b1985061b
Author: qiping.lqp <qiping....@alibaba-inc.com>
Date:   2014-09-09T10:10:24Z

    fix style

commit efcc7369f7f52de2810446c6fb976ab1743a63cf
Author: qiping.lqp <qiping....@alibaba-inc.com>
Date:   2014-09-09T12:33:37Z

    fix bug

commit 2ab763b2ca1bbc8977777ab898b28965dce5a8a3
Author: Joseph K. Bradley <joseph.kurata.brad...@gmail.com>
Date:   2014-09-09T17:42:46Z

    Simplifications to DecisionTree code:
    
    No longer pre-allocate parentImpurities array in main train() method.
    * parentImpurities values are now stored in individual nodes (in 
Node.stats.impurity).
    
    No longer using Node.build since tree structure is constructed on-the-fly.
    * Did not eliminate since it is public (Developer) API.
    
    Also: Updated DecisionTreeSuite test "Second level node building with vs. 
without groups"
    * generateOrderedLabeledPoints() modified so that it really does require 2 
levels of internal nodes.

commit d593ec70d70b633b72e260c38e89d87ab14fcd69
Author: chouqin <liqiping1...@gmail.com>
Date:   2014-09-09T23:57:27Z

    fix docs and change minInstancesPerNode to 1

commit 0278a1198017aae578be3109a8311abc1f9a8e14
Author: chouqin <liqiping1...@gmail.com>
Date:   2014-09-10T02:31:57Z

    remove `noSplit` and set `Predict` private to tree

commit 1a8f0add470e4ed53100ce6cf344e24448a0ba42
Author: Joseph K. Bradley <joseph.kurata.brad...@gmail.com>
Date:   2014-09-10T02:34:55Z

    Eliminated pre-allocated nodes array in main train() method.
    * Nodes are constructed and added to the tree structure as needed during 
training.
    
    Moved tree construction from main train() method into 
findBestSplitsPerGroup() since there is no need to keep the (split, gain) array 
for an entire level of nodes.  Only one element of that array is needed at a 
time, so we do not the array.
    
    findBestSplits() now returns 2 items:
    * rootNode (newly created root node on first iteration, same root node on 
later iterations)
    * doneTraining (indicating if all nodes at that level were leafs)
    
    Also:
    * Added Node.deepCopy (private[tree]), used for test suite
    * Updated test suite (same functionality)

commit d4dbb99a50418e0168d85db457458d8d96edc242
Author: Joseph K. Bradley <joseph.kurata.brad...@gmail.com>
Date:   2014-09-10T02:35:06Z

    Merge remote-tracking branch 'upstream/master' into dt-spark-3160

commit d4d786407a9bb5fce14dd7999097b21d6fa1cf5e
Author: Joseph K. Bradley <joseph.kurata.brad...@gmail.com>
Date:   2014-09-10T02:45:30Z

    Marked Node.build as deprecated

commit eaa1dcf6a46501779ae58c746e672583d10ff6c8
Author: Joseph K. Bradley <joseph.kurata.brad...@gmail.com>
Date:   2014-09-10T02:58:27Z

    Added topNode doc in DecisionTree and scalastyle fix

commit 306120fc93021f3d2d86333c77296fe3d36b76e1
Author: Joseph K. Bradley <joseph.kurata.brad...@gmail.com>
Date:   2014-09-10T03:09:02Z

    Fixed typo in DecisionTreeModel.scala doc

commit c6e2dfcc62aaa0d26bff90fb34f5b81526ce71c8
Author: Joseph K. Bradley <joseph.kurata.brad...@gmail.com>
Date:   2014-09-10T04:51:35Z

    Added minInstancesPerNode and minInfoGain parameters to 
DecisionTreeRunner.scala and to Python API in tree.py

commit 39f9b60907050b4e1c78f7413282df13b7e6552c
Author: chouqin <liqiping1...@gmail.com>
Date:   2014-09-10T14:15:46Z

    change edge `minInstancesPerNode` to 2 and add one more test

commit c7ebaf1721ba414ed1539bfc4721c3bbfd70b77a
Author: chouqin <liqiping1...@gmail.com>
Date:   2014-09-10T14:27:08Z

    fix typo

commit f1d11d15fe519f9ef9d4e1158b309dc6af38864e
Author: chouqin <liqiping1...@gmail.com>
Date:   2014-09-10T14:30:22Z

    fix typo

commit 19b01af035719b7e9b67bc85611b4f04b790797a
Author: Joseph K. Bradley <joseph.kurata.brad...@gmail.com>
Date:   2014-09-10T15:52:14Z

    Merge remote-tracking branch 'chouqin/dt-preprune' into chouqin-dt-preprune

commit e2628b605459badb64b8d63059a2821dfff4bd4c
Author: Joseph K. Bradley <joseph.kurata.brad...@gmail.com>
Date:   2014-09-10T23:13:03Z

    Merge remote-tracking branch 'upstream/master' into chouqin-dt-preprune

commit 95c479d5a60b166d9c75b9a81cee82e808f23aa0
Author: Joseph K. Bradley <joseph.kurata.brad...@gmail.com>
Date:   2014-09-10T23:52:05Z

    * Fixed typo in tree suite test "do not choose split that does not satisfy 
min instance per node requirements"
    * small style fixes

commit 0dd4d874705643a9d82b9a2a4246a75ba9a7dae9
Author: Joseph K. Bradley <joseph.kurata.brad...@gmail.com>
Date:   2014-09-11T01:18:18Z

    Merge remote-tracking branch 'upstream/master' into dt-spark-3160
    
    Conflicts:
        mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
        
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala
        mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala

commit 5c4ac3303fcf94bb5cbbc272013a88ff8c4e7749
Author: Joseph K. Bradley <joseph.kurata.brad...@gmail.com>
Date:   2014-09-11T01:26:19Z

    Added check in Strategy to make sure minInstancesPerNode >= 1

commit eddd1eb60e1b63079af2883ad5854fdc22ff07ef
Author: Joseph K. Bradley <joseph.kurata.brad...@gmail.com>
Date:   2014-09-11T17:54:38Z

    Merge remote-tracking branch 'upstream/master' into rfs-new

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1545] [mllib] Add Random Forests

Reply via email to