[GitHub] spark pull request: [SPARK-4580] [SPARK-4610] [mllib] Documentatio...

jkbradley Mon, 01 Dec 2014 11:29:02 -0800

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3461#discussion_r21112912
  
    --- Diff: docs/mllib-decision-tree.md ---
    @@ -103,36 +106,73 @@ and the resulting `$M-1$` split candidates are 
considered.
     
     ### Stopping rule
     
    -The recursive tree construction is stopped at a node when one of the two 
conditions is met:
    +The recursive tree construction is stopped at a node when one of the 
following conditions is met:
     
     1. The node depth is equal to the `maxDepth` training parameter.
    -2. No split candidate leads to an information gain at the node.
    +2. No split candidate leads to an information gain greater than 
`minInfoGain`.
    +3. No split candidate produces child nodes which each have at least 
`minInstancesPerNode` training instances.
    +
    +## Usage tips
    +
    +We include a few guidelines for using decision trees by discussing the 
various parameters.
    +There are many parameters, put in order here with the most imporant first. 
 New users should mainly consider the "Problem specification parameters" 
section below and the `maxDepth` parameter.
    +
    +### Problem specification parameters
    +
    +These parameters describe the problem you want to solve and your dataset.
    +They should be specified and do not require tuning.
    +
    +* **`algo`**: `Classification` or `Regression`
    +
    +* **`numClasses`**: Number of classes (for `Classification` only)
    +
    +* **`categoricalFeaturesInfo`**: Specifies which features are categorical 
and how many categorical values each of those features can take.  This is given 
as a map from feature indices to feature arity (number of categories).  Any 
features not in this map are treated as continuous.
    +  * E.g., `Map(1 -> 2, 4 -> 10)` specifies that feature `1` is binary 
(taking values `0` or `1`) and that feature `4` has 10 categories (values `{0, 
1, ..., 9}`).  Note that feature indices are 0-based: features `1` and `4` are 
the 2nd and 5th elements of an instance's feature vector.
    +  * Note that you do not have to specify `categoricalFeaturesInfo`.  The 
algorithm will still run and may get reasonable results.  However, performance 
should be better if categorical features are properly designated.
    +
    +### Stopping criteria
    --- End diff --
    
    I have mixed feelings about that.  Do default parameters belong here, or 
just in the API docs?  (It's easy for these docs to get out of date if defaults 
change.)  @mengxr @codedeft If you have thoughts, let me know.
    I'd be OK either way.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4580] [SPARK-4610] [mllib] Documentatio...

Reply via email to