Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20632#discussion_r169732559
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/Node.scala ---
    @@ -287,6 +291,34 @@ private[tree] class LearningNode(
         }
       }
     
    +  /**
    +   * @return true iff the node is a leaf.
    +   */
    +  private def isLeafNode(): Boolean = leftChild.isEmpty && 
rightChild.isEmpty
    +
    +  // the set of (leaf) predictions appearing in the subtree rooted at the 
given node.
    +  private lazy val leafPredictions: Set[Double] = {
    --- End diff --
    
    It's only stored during training though. I agree it could turn into a 
problem, but, wondered how many distinct predictions there could be? in the 
case of regression, maybe a lot, hm.
    
    Doesn't this only prune cases where a node has two leaf-node children with 
the same prediction? 
    
    We should be able to prune much more than that. You could do so in multiple 
passes. The other way is roughly what's done here, to remember the predictions 
in each learning node.
    
    Hm, yeah now I'm wondering about the regression case here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to