GitHub user smurching opened a pull request:

    https://github.com/apache/spark/pull/19433

    [SPARK-3162] [MLlib][WIP] Add local tree training for decision tree 
regressors

    ## What changes were proposed in this pull request?
    #### WIP, DO NOT MERGE
    
    ### Overview
    This PR adds local tree training for decision tree regressors as a first 
step for addressing 
[SPARK-3162](https://issues.apache.org/jira/browse/SPARK-3162) (train decision 
trees locally when possible). See [this design 
doc](https://docs.google.com/document/d/1baU5KeorrmLpC4EZoqLuG-E8sUJqmdELLbr8o6wdbVM/edit)
 for a detailed description of the proposed changes.
    
    Distributed training logic has been refactored but only minimally modified; 
the local tree training implementation leverages existing distributed training 
logic for computing impurities and splits. This shared logic has been 
refactored into `...Utils` objects (e.g. `SplitUtils.scala`, 
`ImpurityUtils.scala`). 
    
    ### How to Review
    
    Each commit in this PR adds non-overlapping functionality, so the PR should 
be reviewable commit-by-commit.
    
    Changes introduced by each commit:
    1. Adds new data structures for local tree training (`FeatureVector`, 
`TrainingInfo`) & associated unit tests (`LocalTreeDataSuite`)
    2. Adds shared utility methods for computing splits/impurities 
(`SplitUtils`, `ImpurityUtils`, `AggUpdateUtils`), largely copied from existing 
distributed training code in `RandomForest.scala`.
    3. Unit tests for split/impurity utility methods (`TreeSplitUtilsSuite`)
    4. Updates distributed training code in `RandomForest.scala` to depend on 
the utility methods introduced in 2.
    5. Adds local tree training logic (`LocalDecisionTree`) 
    6. Local tree unit/integration tests (`LocalTreeUnitSuite`, 
`LocalTreeIntegrationSuite`)
    
    ## How was this patch tested?
    No existing tests were modified. The following new tests were added (also 
described above):
    * Unit tests for new data structures specific to local tree training 
(`LocalTreeDataSuite`, `LocalTreeUtilsSuite`)
    * Unit tests for impurity/split utility methods (`TreeSplitUtilsSuite`)
    * Unit tests for local tree training logic (`LocalTreeUnitSuite`)
    * Integration tests verifying that local & distributed tree training 
produce the same trees (`LocalTreeIntegrationSuite`)
    
    (Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/smurching/spark pr-splitup

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19433.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19433
    
----
commit 219a12001383017e70f10cd7c785272e70e64b28
Author: Sid Murching <sid.murch...@databricks.com>
Date:   2017-10-04T20:55:35Z

    Add data structures for local tree training & associated tests (in 
LocalTreeDataSuite):
        * TrainingInfo: primary local tree training data structure, contains 
all information required to describe state of
        algorithm at any point during learning
        * FeatureVector: Stores data for an individual feature as an Array[Int]

commit 710714395c966f664af7f7b62226336675ec2ea7
Author: Sid Murching <sid.murch...@databricks.com>
Date:   2017-10-04T20:57:30Z

    Add utility methods used for impurity and split calculations during both 
local & distributed training:
     * AggUpdateUtils: Helper methods for updating sufficient stats for a given 
node
     * ImpurityUtils: Helper methods for impurity-related calcluations during 
node split decisions
     * SplitUtils: Helper methods for choosing splits given sufficient stats
    
    NOTE: Both ImpurityUtils and SplitUtils primarily contain code taken from 
RandomForest.scala, with slight modifications.
    Tests for SplitUtils are contained in the next commit.

commit 49bf0ae9b275264e757de573f81b816437be77e7
Author: Sid Murching <sid.murch...@databricks.com>
Date:   2017-10-04T21:36:15Z

    Add test suites for utility methods used during best-split computation:
     * TreeSplitUtilsSuite: Test suite for SplitUtils
     * TreeTests: Add utility method (getMetadata) for TreeSplitUtilsSuite
    
     Also add methods used by these tests in LocalDecisionTree.scala, 
RandomForest.scala

commit bc54b165849202269b80bbac1a84afb857e87e31
Author: Sid Murching <sid.murch...@databricks.com>
Date:   2017-10-04T21:48:33Z

     Update RandomForest.scala to use new utility methods for impurity/split 
calculations

commit 6a68a5cc6a6b7087163bbe5681ad41aef5e3fd0a
Author: Sid Murching <sid.murch...@databricks.com>
Date:   2017-10-04T21:51:39Z

    Add local decision tree training logic

commit 9a7174ed4a62033abfe2325dc1a8c5850e07f5f3
Author: Sid Murching <sid.murch...@databricks.com>
Date:   2017-10-04T21:52:06Z

    Add local decision tree unit/integration tests

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to