I think the simple pruning you have in mind was just never implemented. That sort of pruning wouldn't help much if the nodes maintained a distribution over classes, as those are rarely identical, but, they just maintain a single class prediction. After training, I see no value in keeping those nodes. Whatever impurity gain the split managed on the training data is 'lost' when the prediction is collapsed to a single class anyway.
Whether it's easy to implement in the code I don't know, but it's straightforward conceptually. On Tue, Feb 13, 2018 at 4:21 AM Alessandro Solimando < alessandro.solima...@gmail.com> wrote: > Hello Nick, > thanks for the pointer, that's interesting. > > However, there seems to be a major difference with what I was discussing. > > The JIRA issue relates to overfitting and consideration on information > gain, while what I propose is a much simpler "syntactic" pruning. > > Consider a fragment of the example above, the leftmost subtree in > particular: > > If (feature 1 <= 0.5) >> If (feature 2 <= 0.5) >> If (feature 0 <= 0.5) >> Predict: 0.0 >> Else (feature 0 > 0.5) >> Predict: 0.0 >> Else (feature 2 > 0.5) >> If (feature 0 <= 0.5) >> Predict: 0.0 >> Else (feature 0 > 0.5) >> Predict: 0.0 > > > Which corresponds to the following "objects": > > -InternalNode(prediction = 0.0, impurity = 0.48753462603878117, split = >> org.apache.spark.ml.tree.ContinuousSplit@fdf00000) >> --InternalNode(prediction = 0.0, impurity = 0.345679012345679, split = >> org.apache.spark.ml.tree.ContinuousSplit@ffe00000) >> ---InternalNode(prediction = 0.0, impurity = 0.4444444444444445, split = >> org.apache.spark.ml.tree.ContinuousSplit@3fe00000) >> ----LeafNode(prediction = 0.0, impurity = -1.0) >> ----LeafNode(prediction = 0.0, impurity = 0.0) >> ---InternalNode(prediction = 0.0, impurity = 0.2777777777777777, split = >> org.apache.spark.ml.tree.ContinuousSplit@3fe00000) >> ----LeafNode(prediction = 0.0, impurity = 0.0) >> ----LeafNode(prediction = 0.0, impurity = -1.0) > > > For sure a more comprehensive policy for node splitting based on impurity > might prevent this situation (by splitting node "ffe00000" you have an > impurity gain on one child, and a loss on the other), but independently > from this, once the tree is built, I would cut the redundant subtree and > obtain the following: > > -InternalNode(prediction = 0.0, impurity = 0.48753462603878117, split = >> org.apache.spark.ml.tree.ContinuousSplit@fdf00000) >> --LeafNode(prediction = 0.0, impurity = ...) > > > I cannot say that this is relevant for all the tree ensemble methods, but > it for sure is for RF, even more than for DT, as the lever effect will be > even higher (and the code generating them is the same, DT calls RF with > numTree = 1 for what I can see). > > Being an optimization aiming at saving model memory footprint and > invocation time, it is independent from any consideration on the > statistical amortization of overfit, as your reply seems to imply. > > Am I missing something? > > Best regards, > Alessandro > > > > On 13 February 2018 at 10:57, Nick Pentreath <nick.pentre...@gmail.com> > wrote: > >> There is a long outstanding JIRA issue about it: >> https://issues.apache.org/jira/browse/SPARK-3155. >> >> It is probably still a useful feature to have for trees but the priority >> is not that high since it may not be that useful for the tree ensemble >> models. >> >> >> On Tue, 13 Feb 2018 at 11:52 Alessandro Solimando < >> alessandro.solima...@gmail.com> wrote: >> >>> Hello community, >>> I have recently manually inspected some decision trees computed with >>> Spark (2.2.1, but the behavior is the same with the latest code on the >>> repo). >>> >>> I have observed that the trees are always complete, even if an entire >>> subtree leads to the same prediction in its different leaves. >>> >>> In such case, the root of the subtree, instead of being an InternalNode, >>> could simply be a LeafNode with the (shared) prediction. >>> >>> I know that decision trees computed by scikit-learn share the same >>> feature, I understand that this is needed by construction, because you >>> realize this redundancy only at the end. >>> >>> So my question is, why is this "post-pruning" missing? >>> >>> Three hypothesis: >>> >>> 1) It is not suitable (for a reason I fail to see) >>> 2) Such addition to the code is considered as not worth (in terms of >>> code complexity, maybe) >>> 3) It has been overlooked, but could be a favorable addition >>> >>> For clarity, I have managed to isolate a small case to reproduce this, >>> in what follows. >>> >>> This is the dataset: >>> >>>> +-----+-------------+ >>>> |label|features | >>>> +-----+-------------+ >>>> |1.0 |[1.0,0.0,1.0]| >>>> |1.0 |[0.0,1.0,0.0]| >>>> |1.0 |[1.0,1.0,0.0]| >>>> |0.0 |[0.0,0.0,0.0]| >>>> |1.0 |[1.0,1.0,0.0]| >>>> |0.0 |[0.0,1.0,1.0]| >>>> |1.0 |[0.0,0.0,0.0]| >>>> |0.0 |[0.0,1.0,1.0]| >>>> |1.0 |[0.0,1.0,1.0]| >>>> |0.0 |[1.0,0.0,0.0]| >>>> |0.0 |[1.0,0.0,1.0]| >>>> |1.0 |[0.0,1.0,1.0]| >>>> |0.0 |[0.0,0.0,1.0]| >>>> |0.0 |[1.0,0.0,1.0]| >>>> |0.0 |[0.0,0.0,1.0]| >>>> |0.0 |[1.0,1.0,1.0]| >>>> |0.0 |[1.0,1.0,0.0]| >>>> |1.0 |[1.0,1.0,1.0]| >>>> |0.0 |[1.0,0.0,1.0]| >>>> +-----+-------------+ >>> >>> >>> Which generates the following model: >>> >>> DecisionTreeClassificationModel (uid=dtc_e794a5a3aa9e) of depth 3 with >>>> 15 nodes >>>> If (feature 1 <= 0.5) >>>> If (feature 2 <= 0.5) >>>> If (feature 0 <= 0.5) >>>> Predict: 0.0 >>>> Else (feature 0 > 0.5) >>>> Predict: 0.0 >>>> Else (feature 2 > 0.5) >>>> If (feature 0 <= 0.5) >>>> Predict: 0.0 >>>> Else (feature 0 > 0.5) >>>> Predict: 0.0 >>>> Else (feature 1 > 0.5) >>>> If (feature 2 <= 0.5) >>>> If (feature 0 <= 0.5) >>>> Predict: 1.0 >>>> Else (feature 0 > 0.5) >>>> Predict: 1.0 >>>> Else (feature 2 > 0.5) >>>> If (feature 0 <= 0.5) >>>> Predict: 0.0 >>>> Else (feature 0 > 0.5) >>>> Predict: 0.0 >>> >>> >>> As you can see, the following model would be equivalent, but smaller and >>> >>> DecisionTreeClassificationModel (uid=dtc_e794a5a3aa9e) of depth 3 with >>>> 15 nodes >>>> If (feature 1 <= 0.5) >>>> Predict: 0.0 >>>> Else (feature 1 > 0.5) >>>> If (feature 2 <= 0.5) >>>> Predict: 1.0 >>>> Else (feature 2 > 0.5) >>>> Predict: 0.0 >>> >>> >>> This happens pretty often in real cases, and despite the small gain in >>> the single model invocation for the "optimized" version, it can become non >>> negligible when the number of calls is massive, as one can expect in a Big >>> Data context. >>> >>> I would appreciate your opinion on this matter (if relevant for a PR or >>> not, pros/cons etc). >>> >>> Best regards, >>> Alessandro >>> >> >