[GitHub] spark pull request: [SPARK-4580] [SPARK-4610] [mllib] Documentatio...

manishamde Sun, 30 Nov 2014 19:21:07 -0800

Github user manishamde commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3461#discussion_r21068117
  
    --- Diff: docs/mllib-gbt.md ---
    @@ -0,0 +1,308 @@
    +---
    +layout: global
    +title: Gradient-Boosted Trees - MLlib
    +displayTitle: <a href="mllib-guide.html">MLlib</a> - Gradient-Boosted Trees
    +---
    +
    +* Table of contents
    +{:toc}
    +
    +[Gradient-Boosted Trees 
(GBTs)](http://en.wikipedia.org/wiki/Gradient_boosting)
    +are ensembles of [decision trees](mllib-decision-tree.html).
    +GBTs iteratively train decision trees in order to minimize a loss function.
    +Like decision trees, GBTs handle categorical features,
    +extend to the multiclass classification setting, do not require
    +feature scaling, and are able to capture non-linearities and feature 
interactions.
    +
    +MLlib supports GBTs for binary classification and for regression,
    +using both continuous and categorical features.
    +MLlib implements GBTs using the existing [decision 
tree](mllib-decision-tree.html) implementation.  Please see the decision tree 
guide for more information on trees.
    +
    +*Note*: GBTs do not yet support multiclass classification.  For multiclass 
problems, please use
    +[decision trees](mllib-decision-tree.html) or [Random 
Forests](mllib-random-forest.html).
    +
    +## Basic algorithm
    +
    +Gradient boosting iteratively trains a sequence of decision trees.
    +On each iteration, the algorithm uses the current ensemble to predict the 
label of each training instance and then compares the prediction with the true 
label.  The dataset is re-labeled to put more weight on training instances with 
poor predictions.  Thus, in the next iteration, the decision tree will help 
correct for previous mistakes.
    +
    +The specific weight mechanism is defined by a loss function (discussed 
below).  With each iteration, GBTs further reduce this loss function on the 
training data.
    +
    +### Comparison with Random Forests
    +
    +Both GBTs and [Random Forests](mllib-random-forest.html) are algorithms 
for learning ensembles of trees, but the training processes are different.  
There are several practical trade-offs:
    +
    + * GBTs may be able to achieve the same accuracy using fewer trees, so the 
model produced may be smaller (faster for test time prediction).
    + * GBTs train one tree at a time, so they can take longer to train than 
random forests.  Random Forests can train multiple trees in parallel.
    +   * On the other hand, it is often reasonable to use smaller trees with 
GBTs than with Random Forests, and training smaller trees takes less time.
    + * Random Forests can be less prone to overfitting.  Training more trees 
in a Random Forest reduces the likelihood of overfitting, but training more 
trees with GBTs increases the likelihood of overfitting.
    +
    +In short, both algorithms can be effective.  GBTs may be more useful if 
test time prediction speed is important.  Random Forests are arguably more 
successful in industry.
    --- End diff --
    
    I think we should avoid this lest people start quoting us in the future. 
:-) There is a lot of results comparing RF and Boosting and the results are 
mixed.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4580] [SPARK-4610] [mllib] Documentatio...

Reply via email to