[GitHub] spark pull request: [SPARK-4580] [SPARK-4610] [mllib] Documentatio...

jkbradley Mon, 01 Dec 2014 11:36:12 -0800

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3461#discussion_r21113406
  
    --- Diff: docs/mllib-decision-tree.md ---
    @@ -103,36 +106,73 @@ and the resulting `$M-1$` split candidates are 
considered.
     
     ### Stopping rule
     
    -The recursive tree construction is stopped at a node when one of the two 
conditions is met:
    +The recursive tree construction is stopped at a node when one of the 
following conditions is met:
     
     1. The node depth is equal to the `maxDepth` training parameter.
    -2. No split candidate leads to an information gain at the node.
    +2. No split candidate leads to an information gain greater than 
`minInfoGain`.
    +3. No split candidate produces child nodes which each have at least 
`minInstancesPerNode` training instances.
    +
    +## Usage tips
    +
    +We include a few guidelines for using decision trees by discussing the 
various parameters.
    +There are many parameters, put in order here with the most imporant first. 
 New users should mainly consider the "Problem specification parameters" 
section below and the `maxDepth` parameter.
    +
    +### Problem specification parameters
    +
    +These parameters describe the problem you want to solve and your dataset.
    +They should be specified and do not require tuning.
    +
    +* **`algo`**: `Classification` or `Regression`
    +
    +* **`numClasses`**: Number of classes (for `Classification` only)
    +
    +* **`categoricalFeaturesInfo`**: Specifies which features are categorical 
and how many categorical values each of those features can take.  This is given 
as a map from feature indices to feature arity (number of categories).  Any 
features not in this map are treated as continuous.
    +  * E.g., `Map(1 -> 2, 4 -> 10)` specifies that feature `1` is binary 
(taking values `0` or `1`) and that feature `4` has 10 categories (values `{0, 
1, ..., 9}`).  Note that feature indices are 0-based: features `1` and `4` are 
the 2nd and 5th elements of an instance's feature vector.
    +  * Note that you do not have to specify `categoricalFeaturesInfo`.  The 
algorithm will still run and may get reasonable results.  However, performance 
should be better if categorical features are properly designated.
    +
    +### Stopping criteria
    +
    +These parameters determine when the tree stops building (adding new nodes).
    +These parameters may be tuned.  Be careful to validate on held-out test 
data when tuning in order to avoid overfitting.
    +
    +* **`maxDepth`**: Maximum depth of a tree.  Deeper trees are more 
expressive (potentially allowing higher accuracy), but deeper trees are also 
more costly to train and are more likely to overfit.
    +
    +* **`minInstancesPerNode`**: For a node to be split further, each of its 
children must receive at least this number of training instances.  This is 
commonly used with [Random Forests](mllib-random-forest.html) since those are 
often trained deeper than individual trees.
    +
    +* **`minInfoGain`**: For a node to be split further, the split must 
improve at least this much (in terms of information gain).
    +
    +### Tunable parameters
     
    -## Implementation details
    +These parameters may be tuned.  Be careful to validate on held-out test 
data when tuning in order to avoid overfitting.
     
    -### Max memory requirements
    +* **`maxBins`**: Number of bins used when discretizing continuous features.
    +  * Increasing `maxBins` allows the algorithm to consider more split 
candidates and make fine-grained split decisions.  However, it also increases 
computation and communication.
    +  * Note that the `maxBins` parameter must be at least the maximum number 
of categories `$M$` for any categorical feature.
     
    -For faster processing, the decision tree algorithm performs simultaneous 
histogram computations for
    -all nodes at each level of the tree. This could lead to high memory 
requirements at deeper levels
    -of the tree, potentially leading to memory overflow errors. To alleviate 
this problem, a `maxMemoryInMB`
    -training parameter specifies the maximum amount of memory at the workers 
(twice as much at the
    -master) to be allocated to the histogram computation. The default value is 
conservatively chosen to
    -be 256 MB to allow the decision algorithm to work in most scenarios. Once 
the memory requirements
    -for a level-wise computation cross the `maxMemoryInMB` threshold, the node 
training tasks at each
    -subsequent level are split into smaller tasks.
    +* **`maxMemoryInMB`**: Amount of memory to be used for collecting 
sufficient statistics.
    +  * The default value is conservatively chosen to be 256 MB to allow the 
decision algorithm to work in most scenarios.  Increasing `maxMemoryInMB` can 
lead to faster training (if the memory is available) by allowing fewer passes 
over the data.  However, there may be decreasing returns as `maxMemoryInMB` 
grows since the amount of communication on each iteration can be proportional 
to `maxMemoryInMB`.
    --- End diff --
    
    We communicate a fixed amount of data for every node in the tree which we 
eventually train.  Increasing maxMemoryInMB will increase the number of nodes 
we train in each iteration, but it doesn't change the overall amount of 
communication required (ignoring overhead for very small maxMemoryInMB).  Once 
we're training a fair number of nodes in each iteration, the overhead (making a 
pass over the data + latency in communication) becomes small relative to other 
costs (computing stats + communicating stats).  Once the overhead is small, 
then training more nodes per iteration doesn't help much.  This is what I've 
observed in tests, at least.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4580] [SPARK-4610] [mllib] Documentatio...

Reply via email to