[GitHub] spark pull request #14547: [SPARK-16718][MLlib] gbm-style treeboost

vlad17 Mon, 12 Sep 2016 18:17:46 -0700

Github user vlad17 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14547#discussion_r78482842
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala ---
    @@ -38,25 +38,35 @@ import org.apache.spark.sql.{DataFrame, Dataset}
     import org.apache.spark.sql.functions._
     
     /**
    - * [[http://en.wikipedia.org/wiki/Gradient_boosting Gradient-Boosted Trees 
(GBTs)]]
    + * Gradient-Boosted Trees (GBTs) 
(http://en.wikipedia.org/wiki/Gradient_boosting)
      * learning algorithm for regression.
      * It supports both continuous and categorical features.
      *
    - * The implementation is based upon: J.H. Friedman. "Stochastic Gradient 
Boosting." 1999.
    + * The implemention offers both Stochastic Gradient Boosting, as in J.H. 
Friedman 1999,
    + * "Stochastic Gradient Boosting" and TreeBoost, as in Friedman 1999
    + * "Greedy Function Approximation: A Gradient Boosting Machine"
      *
    - * Notes on Gradient Boosting vs. TreeBoost:
    - *  - This implementation is for Stochastic Gradient Boosting, not for 
TreeBoost.
    + * Notes on Stochastic Gradient Boosting (SGB) vs. TreeBoost:
    + *  - TreeBoost algorithms are a subset of SGB algorithms.
      *  - Both algorithms learn tree ensembles by minimizing loss functions.
    - *  - TreeBoost (Friedman, 1999) additionally modifies the outputs at tree 
leaf nodes
    - *    based on the loss function, whereas the original gradient boosting 
method does not.
    - *     - When the loss is SquaredError, these methods give the same 
result, but they could differ
    - *       for other loss functions.
    - *  - We expect to implement TreeBoost in the future:
    - *    [https://issues.apache.org/jira/browse/SPARK-4240]
    + *  - TreeBoost has two additional properties that general SGB trees don't:
    + *     - The loss function gradients are directly used as an approximate 
impurity measure.
    + *     - The value reported at a leaf is given by optimizing the loss 
function is optimized on
    + *       that leaf node's partition of the data, rather than just being 
the mean.
    + *  - In the case of squared error loss, variance impurity and mean leaf 
estimates happen
    + *    to make the SGB and TreeBoost algorithms identical.
    + *
    + * [[GBTRegressor]] will use the usual `"variance"` impurity by default, 
conforming to
    + * SGB behavior. For TreeBoost, set impurity to `"loss-based"`. Note 
TreeBoost is currently
    + * incompatible with absolute error.
    + *
    + * Currently, however, even TreeBoost behavior uses variance impurity for 
split selection for
    + * ease and speed. Leaf selection is aligned with theory. This is the 
approach `R`'s
    --- End diff --
    
    done



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14547: [SPARK-16718][MLlib] gbm-style treeboost

Reply via email to