Github user vlad17 commented on a diff in the pull request: https://github.com/apache/spark/pull/14547#discussion_r78482842 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala --- @@ -38,25 +38,35 @@ import org.apache.spark.sql.{DataFrame, Dataset} import org.apache.spark.sql.functions._ /** - * [[http://en.wikipedia.org/wiki/Gradient_boosting Gradient-Boosted Trees (GBTs)]] + * Gradient-Boosted Trees (GBTs) (http://en.wikipedia.org/wiki/Gradient_boosting) * learning algorithm for regression. * It supports both continuous and categorical features. * - * The implementation is based upon: J.H. Friedman. "Stochastic Gradient Boosting." 1999. + * The implemention offers both Stochastic Gradient Boosting, as in J.H. Friedman 1999, + * "Stochastic Gradient Boosting" and TreeBoost, as in Friedman 1999 + * "Greedy Function Approximation: A Gradient Boosting Machine" * - * Notes on Gradient Boosting vs. TreeBoost: - * - This implementation is for Stochastic Gradient Boosting, not for TreeBoost. + * Notes on Stochastic Gradient Boosting (SGB) vs. TreeBoost: + * - TreeBoost algorithms are a subset of SGB algorithms. * - Both algorithms learn tree ensembles by minimizing loss functions. - * - TreeBoost (Friedman, 1999) additionally modifies the outputs at tree leaf nodes - * based on the loss function, whereas the original gradient boosting method does not. - * - When the loss is SquaredError, these methods give the same result, but they could differ - * for other loss functions. - * - We expect to implement TreeBoost in the future: - * [https://issues.apache.org/jira/browse/SPARK-4240] + * - TreeBoost has two additional properties that general SGB trees don't: + * - The loss function gradients are directly used as an approximate impurity measure. + * - The value reported at a leaf is given by optimizing the loss function is optimized on + * that leaf node's partition of the data, rather than just being the mean. + * - In the case of squared error loss, variance impurity and mean leaf estimates happen + * to make the SGB and TreeBoost algorithms identical. + * + * [[GBTRegressor]] will use the usual `"variance"` impurity by default, conforming to + * SGB behavior. For TreeBoost, set impurity to `"loss-based"`. Note TreeBoost is currently + * incompatible with absolute error. + * + * Currently, however, even TreeBoost behavior uses variance impurity for split selection for + * ease and speed. Leaf selection is aligned with theory. This is the approach `R`'s --- End diff -- done
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org