[GitHub] spark pull request: [SPARK-1545] [mllib] Add Random Forests

mengxr Thu, 25 Sep 2014 01:02:03 -0700

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2435#discussion_r18019000
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala
 ---
    @@ -128,13 +139,34 @@ private[tree] object DecisionTreeMetadata {
           }
         }
     
    +    // Set number of features to use per node (for random forests).
    +    val _featureSubsetStrategy = featureSubsetStrategy match {
    +      case "auto" => if (numTrees == 1) "all" else "sqrt"
    +      case _ => featureSubsetStrategy
    +    }
    +    val numFeaturesPerNode: Int = _featureSubsetStrategy match {
    +      case "all" => numFeatures
    +      case "sqrt" => math.sqrt(numFeatures).ceil.toInt
    +      case "log2" => math.max(1, (math.log(numFeatures) / 
math.log(2)).ceil.toInt)
    --- End diff --
    
    The `log2` is from Breiman's paper: 
http://www.stat.berkeley.edu/~breiman/randomforest2001.pdf
    
    From R's randomForest doc:
    > Note that the default values are different for classification (sqrt(p) 
where p is number of
    variables in x) and regression (p/3)
    
    From http://www.stat.berkeley.edu/~breiman/Using_random_forests_V3.1.pdf
    > this is the only parameter that requires some judgment to set, but
    forests isn't too sensitive to its value as long as it's in the right ball
    park. I have found that setting mtry equal to the square root of
    mdim gives generally near optimum results. My advice is to begin
    with this value and try a value twice as high and half as low
    monitoring the results by setting look=1 and checking the internal
    test set error for a small number of trees. With many noise
    variables present, mtry has to be set higher.
    
    Let's set the default to `sqrt`, keep `log2` and `onethird`, and mention 
the references in the doc or comments.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1545] [mllib] Add Random Forests

Reply via email to