GitHub user jkbradley opened a pull request: https://github.com/apache/spark/pull/2435
[SPARK-1545] [mllib] Add Random Forests This PR adds RandomForest to MLlib. The implementation is basic, and future performance optimizations will be important. (Note: RFs = Random Forests.) # Overview ## RandomForest * trains multiple trees at once to reduce the number of passes over the data * allows feature subsets at each node * uses a queue of nodes instead of fixed groups for each level This implementation is based an implementation by @manishamde and the [Alpine Labs Sequoia Forest](https://github.com/AlpineNow/SparkML2) by @codedeft (in particular, the TreePoint, BaggedPoint, and node queue implementations). Thank you for your inputs! ## Testing This has been tested for correctness with the test suites and with DecisionTreeRunner on example datasets. This has been performance tested using [this branch of spark-perf](https://github.com/jkbradley/spark-perf/tree/rfs). For training 1 tree, there are small regressions, especially from feature subsampling. Detailed results are below. These were run on an EC2 cluster with 15 workers, training 1 tree with maxDepth = 5 (= 6 levels). The 2 result columns marked with (numTrees) are results after implementing RFs to train multiple trees at once, using a node queue. The 2 columns marked with (features subsets) are follow-up results after adding feature subsampling. Speedup values < 1 indicate slowdowns from the old DecisionTree implementation. numInstances | numFeatures | runtime (sec) | speedup | runtime (sec) | speedup ---- | ---- | ---- | ---- | ---- | ---- | | (numTrees) | (numTrees) | (feature subsets) | (feature subsets) 20000 | 100 | 4.051 | 1.044433473 | 4.478 | 0.9448414471 20000 | 500 | 8.472 | 1.104461756 | 9.315 | 1.004508857 20000 | 1500 | 19.354 | 1.05854087 | 20.863 | 0.9819776638 20000 | 3500 | 43.674 | 1.072033704 | 45.887 | 1.020332556 200000 | 100 | 4.196 | 1.171830315 | 4.848 | 1.014232673 200000 | 500 | 8.926 | 1.082791844 | 9.771 | 0.989151571 200000 | 1500 | 20.58 | 1.068415938 | 22.134 | 0.9934038131 200000 | 3500 | 48.043 | 1.075203464 | 52.249 | 0.9886505005 2000000 | 100 | 4.944 | 1.01355178 | 5.796 | 0.8645617667 2000000 | 500 | 11.11 | 1.016831683 | 12.482 | 0.9050632911 2000000 | 1500 | 31.144 | 1.017852556 | 35.274 | 0.8986789136 2000000 | 3500 | 79.981 | 1.085382778 | 101.105 | 0.8586123337 20000000 | 100 | 8.304 | 0.9270231214 | 9.073 | 0.8484514494 20000000 | 500 | 28.174 | 1.083268262 | 34.236 | 0.8914592826 20000000 | 1500 | 143.97 | 0.9579634646 | 159.275 | 0.8659111599 # Details on specific classes ## Changes to DecisionTree * Main train() method is now in RandomForest. * findBestSplits() is no longer needed. (It split levels into groups, but we now use a queue of nodes.) * Many small changes to support RFs. (Note: These methods should be moved to RandomForest.scala in a later PR, but are in DecisionTree.scala to make code comparison easier.) ## RandomForest * Main train() method is from old DecisionTree. * selectNodesToSplit: Note that it selects nodes and feature subsets jointly to track memory usage. ## RandomForestModel * Stores an Array[DecisionTreeModel] * Prediction: * For classification, most common label. For regression, mean. * We could support other methods later. ## examples/.../DecisionTreeRunner * This now takes numTrees and featureSubsetStrategy, to support RFs. ## DTStatsAggregator * 2 types of functionality (w/ and w/o subsampling features): These require different indexing methods. (We could treat both as subsampling, but this is less efficient DTStatsAggregator is now abstract, and 2 child classes implement these 2 types of functionality. ## impurities * These now take instance weights. ## Node * Some vals changed to vars. * This is unfortunately a public API change (DeveloperApi). This could be avoided by creating a LearningNode struct, but would be awkward. ## RandomForestSuite Please let me know if there are missing tests! ## BaggedPoint This wraps TreePoint and holds bootstrap weights/counts. # Design decisions * BaggedPoint: BaggedPoint is separate from TreePoint since it may be useful for other bagging algorithms later on. * RandomForest public API: What options should be easily supported by the train* methods? Should ALL options be in the Java-friendly constructors? Should there be a constructor taking Strategy? * Feature subsampling options: What options should be supported? scikit-learn supports the same options, except for "onethird." One option would be to allow users to specific fractions ("0.1"): the current options could be supported, and any unrecognized values would be parsed as Doubles in [0,1]. * Splits and bins are computed before bootstrapping, so all trees use the same discretization. * One queue, instead of one queue per tree. CC: @mengxr @manishamde @codedeft @chouqin Please let me know if you have suggestions---thanks! You can merge this pull request into a Git repository by running: $ git pull https://github.com/jkbradley/spark rfs-new Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2435.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2435 ---- commit ac4237808090237fe4c562da8c88c55c330d451f Author: qiping.lqp <qiping....@alibaba-inc.com> Date: 2014-09-09T03:17:58Z add min info gain and min instances per node parameters in decision tree commit ff34845c8e43f5b9755dd1fdf428be8b2284c68b Author: qiping.lqp <qiping....@alibaba-inc.com> Date: 2014-09-09T04:29:12Z separate calculation of predict of node from calculation of info gain commit 987cbf4b177f29e232bf2ba2ca595ea7015694da Author: qiping.lqp <qiping....@alibaba-inc.com> Date: 2014-09-09T04:30:01Z fix bug commit f195e830a94097e5d6d42f22c67c32ca8900d848 Author: qiping.lqp <qiping....@alibaba-inc.com> Date: 2014-09-09T06:04:20Z fix style commit 845c6fa58c00bfba426e56e71eb46a6f8c3f5985 Author: qiping.lqp <qiping....@alibaba-inc.com> Date: 2014-09-09T06:05:37Z fix style commit e72c7e4d0ad015fdf25ea2959bdbf524056e38ca Author: qiping.lqp <qiping....@alibaba-inc.com> Date: 2014-09-09T06:52:24Z add comments commit 46b891fd7f30b9f2d439134931b35dab387fe2b1 Author: qiping.lqp <qiping....@alibaba-inc.com> Date: 2014-09-09T08:09:34Z fix bug commit cadd569cf64d6eb7b9c9979a5066a2f63f15fed9 Author: qiping.lqp <qiping....@alibaba-inc.com> Date: 2014-09-09T08:48:51Z add api docs commit bb465cabc804ca53ef5005f6793b58aa2e4a5274 Author: qiping.lqp <qiping....@alibaba-inc.com> Date: 2014-09-09T09:09:14Z Merge branch 'master' of https://github.com/apache/spark into dt-preprune Conflicts: mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala commit 6728fad304511030611c61592b1a590214e7f434 Author: qiping.lqp <qiping....@alibaba-inc.com> Date: 2014-09-09T09:16:27Z minor fix: remove empty lines commit 10b801269864cda2c00159518688942b1985061b Author: qiping.lqp <qiping....@alibaba-inc.com> Date: 2014-09-09T10:10:24Z fix style commit efcc7369f7f52de2810446c6fb976ab1743a63cf Author: qiping.lqp <qiping....@alibaba-inc.com> Date: 2014-09-09T12:33:37Z fix bug commit 2ab763b2ca1bbc8977777ab898b28965dce5a8a3 Author: Joseph K. Bradley <joseph.kurata.brad...@gmail.com> Date: 2014-09-09T17:42:46Z Simplifications to DecisionTree code: No longer pre-allocate parentImpurities array in main train() method. * parentImpurities values are now stored in individual nodes (in Node.stats.impurity). No longer using Node.build since tree structure is constructed on-the-fly. * Did not eliminate since it is public (Developer) API. Also: Updated DecisionTreeSuite test "Second level node building with vs. without groups" * generateOrderedLabeledPoints() modified so that it really does require 2 levels of internal nodes. commit d593ec70d70b633b72e260c38e89d87ab14fcd69 Author: chouqin <liqiping1...@gmail.com> Date: 2014-09-09T23:57:27Z fix docs and change minInstancesPerNode to 1 commit 0278a1198017aae578be3109a8311abc1f9a8e14 Author: chouqin <liqiping1...@gmail.com> Date: 2014-09-10T02:31:57Z remove `noSplit` and set `Predict` private to tree commit 1a8f0add470e4ed53100ce6cf344e24448a0ba42 Author: Joseph K. Bradley <joseph.kurata.brad...@gmail.com> Date: 2014-09-10T02:34:55Z Eliminated pre-allocated nodes array in main train() method. * Nodes are constructed and added to the tree structure as needed during training. Moved tree construction from main train() method into findBestSplitsPerGroup() since there is no need to keep the (split, gain) array for an entire level of nodes. Only one element of that array is needed at a time, so we do not the array. findBestSplits() now returns 2 items: * rootNode (newly created root node on first iteration, same root node on later iterations) * doneTraining (indicating if all nodes at that level were leafs) Also: * Added Node.deepCopy (private[tree]), used for test suite * Updated test suite (same functionality) commit d4dbb99a50418e0168d85db457458d8d96edc242 Author: Joseph K. Bradley <joseph.kurata.brad...@gmail.com> Date: 2014-09-10T02:35:06Z Merge remote-tracking branch 'upstream/master' into dt-spark-3160 commit d4d786407a9bb5fce14dd7999097b21d6fa1cf5e Author: Joseph K. Bradley <joseph.kurata.brad...@gmail.com> Date: 2014-09-10T02:45:30Z Marked Node.build as deprecated commit eaa1dcf6a46501779ae58c746e672583d10ff6c8 Author: Joseph K. Bradley <joseph.kurata.brad...@gmail.com> Date: 2014-09-10T02:58:27Z Added topNode doc in DecisionTree and scalastyle fix commit 306120fc93021f3d2d86333c77296fe3d36b76e1 Author: Joseph K. Bradley <joseph.kurata.brad...@gmail.com> Date: 2014-09-10T03:09:02Z Fixed typo in DecisionTreeModel.scala doc commit c6e2dfcc62aaa0d26bff90fb34f5b81526ce71c8 Author: Joseph K. Bradley <joseph.kurata.brad...@gmail.com> Date: 2014-09-10T04:51:35Z Added minInstancesPerNode and minInfoGain parameters to DecisionTreeRunner.scala and to Python API in tree.py commit 39f9b60907050b4e1c78f7413282df13b7e6552c Author: chouqin <liqiping1...@gmail.com> Date: 2014-09-10T14:15:46Z change edge `minInstancesPerNode` to 2 and add one more test commit c7ebaf1721ba414ed1539bfc4721c3bbfd70b77a Author: chouqin <liqiping1...@gmail.com> Date: 2014-09-10T14:27:08Z fix typo commit f1d11d15fe519f9ef9d4e1158b309dc6af38864e Author: chouqin <liqiping1...@gmail.com> Date: 2014-09-10T14:30:22Z fix typo commit 19b01af035719b7e9b67bc85611b4f04b790797a Author: Joseph K. Bradley <joseph.kurata.brad...@gmail.com> Date: 2014-09-10T15:52:14Z Merge remote-tracking branch 'chouqin/dt-preprune' into chouqin-dt-preprune commit e2628b605459badb64b8d63059a2821dfff4bd4c Author: Joseph K. Bradley <joseph.kurata.brad...@gmail.com> Date: 2014-09-10T23:13:03Z Merge remote-tracking branch 'upstream/master' into chouqin-dt-preprune commit 95c479d5a60b166d9c75b9a81cee82e808f23aa0 Author: Joseph K. Bradley <joseph.kurata.brad...@gmail.com> Date: 2014-09-10T23:52:05Z * Fixed typo in tree suite test "do not choose split that does not satisfy min instance per node requirements" * small style fixes commit 0dd4d874705643a9d82b9a2a4246a75ba9a7dae9 Author: Joseph K. Bradley <joseph.kurata.brad...@gmail.com> Date: 2014-09-11T01:18:18Z Merge remote-tracking branch 'upstream/master' into dt-spark-3160 Conflicts: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala commit 5c4ac3303fcf94bb5cbbc272013a88ff8c4e7749 Author: Joseph K. Bradley <joseph.kurata.brad...@gmail.com> Date: 2014-09-11T01:26:19Z Added check in Strategy to make sure minInstancesPerNode >= 1 commit eddd1eb60e1b63079af2883ad5854fdc22ff07ef Author: Joseph K. Bradley <joseph.kurata.brad...@gmail.com> Date: 2014-09-11T17:54:38Z Merge remote-tracking branch 'upstream/master' into rfs-new ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org