[GitHub] spark pull request: [SPARK-4580] [SPARK-4610] [mllib] Documentatio...

jkbradley Mon, 01 Dec 2014 11:45:58 -0800

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3461#discussion_r21114104
  
    --- Diff: docs/mllib-gbt.md ---
    @@ -0,0 +1,308 @@
    +---
    +layout: global
    +title: Gradient-Boosted Trees - MLlib
    +displayTitle: <a href="mllib-guide.html">MLlib</a> - Gradient-Boosted Trees
    +---
    +
    +* Table of contents
    +{:toc}
    +
    +[Gradient-Boosted Trees 
(GBTs)](http://en.wikipedia.org/wiki/Gradient_boosting)
    +are ensembles of [decision trees](mllib-decision-tree.html).
    +GBTs iteratively train decision trees in order to minimize a loss function.
    +Like decision trees, GBTs handle categorical features,
    +extend to the multiclass classification setting, do not require
    +feature scaling, and are able to capture non-linearities and feature 
interactions.
    +
    +MLlib supports GBTs for binary classification and for regression,
    +using both continuous and categorical features.
    +MLlib implements GBTs using the existing [decision 
tree](mllib-decision-tree.html) implementation.  Please see the decision tree 
guide for more information on trees.
    +
    +*Note*: GBTs do not yet support multiclass classification.  For multiclass 
problems, please use
    +[decision trees](mllib-decision-tree.html) or [Random 
Forests](mllib-random-forest.html).
    +
    +## Basic algorithm
    +
    +Gradient boosting iteratively trains a sequence of decision trees.
    +On each iteration, the algorithm uses the current ensemble to predict the 
label of each training instance and then compares the prediction with the true 
label.  The dataset is re-labeled to put more weight on training instances with 
poor predictions.  Thus, in the next iteration, the decision tree will help 
correct for previous mistakes.
    +
    +The specific weight mechanism is defined by a loss function (discussed 
below).  With each iteration, GBTs further reduce this loss function on the 
training data.
    +
    +### Comparison with Random Forests
    --- End diff --
    
    Should we have a new Ensembles section in the guide?  It might be pretty 
short.  (And I haven't seen experimental results in the guide; would they 
belong elsewhere?)
    
    Eventually, I could imagine either (a) an Ensembles section once we have 
more ensemble algs or (b) a section in the guide covering all algorithms and 
how to choose between them.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4580] [SPARK-4610] [mllib] Documentatio...

Reply via email to