[GitHub] spark pull request #14547: [SPARK-16718][MLlib] gbm-style treeboost

vlad17 Mon, 12 Sep 2016 18:02:21 -0700

Github user vlad17 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14547#discussion_r78481687
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala ---
    @@ -42,18 +42,30 @@ import org.apache.spark.sql.types.DoubleType
     /**
      * Gradient-Boosted Trees (GBTs) 
(http://en.wikipedia.org/wiki/Gradient_boosting)
      * learning algorithm for classification.
    - * It supports binary labels, as well as both continuous and categorical 
features.
      * Note: Multiclass labels are not currently supported.
    + * It supports both continuous and categorical features.
      *
    - * The implementation is based upon: J.H. Friedman. "Stochastic Gradient 
Boosting." 1999.
    + * The implemention offers both Stochastic Gradient Boosting, as in J.H. 
Friedman 1999,
    + * "Stochastic Gradient Boosting" and TreeBoost, as in Friedman 1999
    + * "Greedy Function Approximation: A Gradient Boosting Machine"
      *
    - * Notes on Gradient Boosting vs. TreeBoost:
    - *  - This implementation is for Stochastic Gradient Boosting, not for 
TreeBoost.
    + * Notes on Stochastic Gradient Boosting (SGB) vs. TreeBoost:
    + *  - TreeBoost algorithms are a subset of SGB algorithms.
      *  - Both algorithms learn tree ensembles by minimizing loss functions.
    - *  - TreeBoost (Friedman, 1999) additionally modifies the outputs at tree 
leaf nodes
    - *    based on the loss function, whereas the original gradient boosting 
method does not.
    - *  - We expect to implement TreeBoost in the future:
    - *    [https://issues.apache.org/jira/browse/SPARK-4240]
    + *  - TreeBoost has two additional properties that general SGB trees don't:
    + *     - The loss function gradients are directly used as an approximate 
impurity measure.
    + *     - The value reported at a leaf is given by optimizing the loss 
function is optimized on
    + *       that leaf node's partition of the data, rather than just being 
the mean.
    + *  - In the case of squared error loss, variance impurity and mean leaf 
estimates happen
    + *    to make the SGB and TreeBoost algorithms identical.
    + *
    + * [[GBTClassifier]] will use the usual `"loss-based"` impurity by 
default, conforming to
    + * TreeBoost behavior. For SGB, set impurity to `"variance"`.
    + * use of TreeBoost, set impurity to `"loss-based"`.
    + *
    + * Currently, however, even TreeBoost behavior uses variance impurity for 
split selection for
    + * ease and speed. Leaf selection is aligned with theory. This is the 
approach `R`'s
    --- End diff --
    
    done



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14547: [SPARK-16718][MLlib] gbm-style treeboost

Reply via email to