Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13440#discussion_r218208123
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -670,14 +670,32 @@ private[spark] object RandomForest extends Logging {
         val leftImpurity = leftImpurityCalculator.calculate() // Note: This 
equals 0 if count = 0
         val rightImpurity = rightImpurityCalculator.calculate()
     
    -    val leftWeight = leftCount / totalCount.toDouble
    -    val rightWeight = rightCount / totalCount.toDouble
    +    val gain = metadata.impurity match {
    +      case imp if (imp.isTestStatistic) =>
    +        // For split quality measures based on a test-statistic, run the 
test on the
    +        // left and right sub-populations to get a p-value for the null 
hypothesis
    +        val pval = imp.calculate(leftImpurityCalculator, 
rightImpurityCalculator)
    +        // Transform the test statistic p-val into a larger-is-better gain 
value
    +        Impurity.pValToGain(pval)
    +
    +      case _ =>
    +        // Default purity-gain logic:
    +        // measure the weighted decrease in impurity from parent to the 
left and right
    +        val leftWeight = leftCount / totalCount.toDouble
    +        val rightWeight = rightCount / totalCount.toDouble
    +
    +        impurity - leftWeight * leftImpurity - rightWeight * rightImpurity
    +    }
     
    -    val gain = impurity - leftWeight * leftImpurity - rightWeight * 
rightImpurity
    +    // If the impurity being used is a test statistic p-val, apply a 
standard transform into
    +    // a larger-is-better gain value for the minimum-gain threshold
    +    val minGain =
    +      if (metadata.impurity.isTestStatistic) 
Impurity.pValToGain(metadata.minInfoGain)
    +      else metadata.minInfoGain
    --- End diff --
    
    Kind of a design question here... right now the caller has to switch logic 
based on what's inside metadata. Can methods like metadata.minInfoGain just 
implement different logic when the impurity is a test statistic, and so on? 
push this down towards the impurity implementation? I wonder if 
`isTestStatistic` can go away with the right API, but I am not familiar with 
the details of what that requires.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to