GitHub user smurching opened a pull request: https://github.com/apache/spark/pull/19433
[SPARK-3162] [MLlib][WIP] Add local tree training for decision tree regressors ## What changes were proposed in this pull request? #### WIP, DO NOT MERGE ### Overview This PR adds local tree training for decision tree regressors as a first step for addressing [SPARK-3162](https://issues.apache.org/jira/browse/SPARK-3162) (train decision trees locally when possible). See [this design doc](https://docs.google.com/document/d/1baU5KeorrmLpC4EZoqLuG-E8sUJqmdELLbr8o6wdbVM/edit) for a detailed description of the proposed changes. Distributed training logic has been refactored but only minimally modified; the local tree training implementation leverages existing distributed training logic for computing impurities and splits. This shared logic has been refactored into `...Utils` objects (e.g. `SplitUtils.scala`, `ImpurityUtils.scala`). ### How to Review Each commit in this PR adds non-overlapping functionality, so the PR should be reviewable commit-by-commit. Changes introduced by each commit: 1. Adds new data structures for local tree training (`FeatureVector`, `TrainingInfo`) & associated unit tests (`LocalTreeDataSuite`) 2. Adds shared utility methods for computing splits/impurities (`SplitUtils`, `ImpurityUtils`, `AggUpdateUtils`), largely copied from existing distributed training code in `RandomForest.scala`. 3. Unit tests for split/impurity utility methods (`TreeSplitUtilsSuite`) 4. Updates distributed training code in `RandomForest.scala` to depend on the utility methods introduced in 2. 5. Adds local tree training logic (`LocalDecisionTree`) 6. Local tree unit/integration tests (`LocalTreeUnitSuite`, `LocalTreeIntegrationSuite`) ## How was this patch tested? No existing tests were modified. The following new tests were added (also described above): * Unit tests for new data structures specific to local tree training (`LocalTreeDataSuite`, `LocalTreeUtilsSuite`) * Unit tests for impurity/split utility methods (`TreeSplitUtilsSuite`) * Unit tests for local tree training logic (`LocalTreeUnitSuite`) * Integration tests verifying that local & distributed tree training produce the same trees (`LocalTreeIntegrationSuite`) (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/smurching/spark pr-splitup Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19433.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19433 ---- commit 219a12001383017e70f10cd7c785272e70e64b28 Author: Sid Murching <sid.murch...@databricks.com> Date: 2017-10-04T20:55:35Z Add data structures for local tree training & associated tests (in LocalTreeDataSuite): * TrainingInfo: primary local tree training data structure, contains all information required to describe state of algorithm at any point during learning * FeatureVector: Stores data for an individual feature as an Array[Int] commit 710714395c966f664af7f7b62226336675ec2ea7 Author: Sid Murching <sid.murch...@databricks.com> Date: 2017-10-04T20:57:30Z Add utility methods used for impurity and split calculations during both local & distributed training: * AggUpdateUtils: Helper methods for updating sufficient stats for a given node * ImpurityUtils: Helper methods for impurity-related calcluations during node split decisions * SplitUtils: Helper methods for choosing splits given sufficient stats NOTE: Both ImpurityUtils and SplitUtils primarily contain code taken from RandomForest.scala, with slight modifications. Tests for SplitUtils are contained in the next commit. commit 49bf0ae9b275264e757de573f81b816437be77e7 Author: Sid Murching <sid.murch...@databricks.com> Date: 2017-10-04T21:36:15Z Add test suites for utility methods used during best-split computation: * TreeSplitUtilsSuite: Test suite for SplitUtils * TreeTests: Add utility method (getMetadata) for TreeSplitUtilsSuite Also add methods used by these tests in LocalDecisionTree.scala, RandomForest.scala commit bc54b165849202269b80bbac1a84afb857e87e31 Author: Sid Murching <sid.murch...@databricks.com> Date: 2017-10-04T21:48:33Z Update RandomForest.scala to use new utility methods for impurity/split calculations commit 6a68a5cc6a6b7087163bbe5681ad41aef5e3fd0a Author: Sid Murching <sid.murch...@databricks.com> Date: 2017-10-04T21:51:39Z Add local decision tree training logic commit 9a7174ed4a62033abfe2325dc1a8c5850e07f5f3 Author: Sid Murching <sid.murch...@databricks.com> Date: 2017-10-04T21:52:06Z Add local decision tree unit/integration tests ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org