[ 
https://issues.apache.org/jira/browse/SPARK-3159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17115343#comment-17115343
 ] 

xujiajin commented on SPARK-3159:
---------------------------------

it is possible to control the prune parameter during training decision tree 
model?

I need to use the Probability value of decision tree, according the source 
code, The prune parameter controls whether to merge nodes with the same 
prediction. Although the prune parameter does not affect the prediction, it 
does affect the probability. The default value of the prune parameter is true 
and cannot be changed. below is the source code:
{code:java}
public Node toNode(boolean prune) {
    Object var10000;
    if (this.leftChild().isEmpty() && this.rightChild().isEmpty()) {
        var10000 = this.stats().valid() ? new 
LeafNode(this.stats().impurityCalculator().predict(), this.stats().impurity(), 
this.stats().impurityCalculator()) : new 
LeafNode(this.stats().impurityCalculator().predict(), -1.0D, 
this.stats().impurityCalculator());
    } else {
        Object var7;
        label50: {
            .MODULE$.assert(this.leftChild().nonEmpty() && 
this.rightChild().nonEmpty() && this.split().nonEmpty() && this.stats() != 
null, new Serializable(this) {
                public static final long serialVersionUID = 0L;

                public final String apply() {
                    return "Unknown error during Decision Tree learning.  Could 
not convert LearningNode to Node.";
                }
            });
            Tuple2 var2 = new 
Tuple2(((LearningNode)this.leftChild().get()).toNode(prune), 
((LearningNode)this.rightChild().get()).toNode(prune));
            if (var2 != null) {
                Node l = (Node)var2._1();
                Node r = (Node)var2._2();
                if (l instanceof LeafNode) {
                    LeafNode var5 = (LeafNode)l;
                    if (r instanceof LeafNode) {
                        LeafNode var6 = (LeafNode)r;
                        if (prune && var5.prediction() == var6.prediction()) {
                            var7 = new LeafNode(var5.prediction(), 
this.stats().impurity(), this.stats().impurityCalculator());
                            break label50;
                        }
                    }
                }
            }

            if (var2 == null) {
                throw new MatchError(var2);
            }

            Node l = (Node)var2._1();
            Node r = (Node)var2._2();
            var7 = new 
InternalNode(this.stats().impurityCalculator().predict(), 
this.stats().impurity(), this.stats().gain(), l, r, (Split)this.split().get(), 
this.stats().impurityCalculator());
        }

        var10000 = var7;
    }

    return (Node)var10000;
}
{code}
The following is an example of the effect of prune parameter on probability: 
The following graph shows the tree structure when MinInstancesPerNode is equal 
to 29, when MinInstancesPerNode is equal to 30, the decision tree "feature2 
<=6.15" node will be deleted because all the predicted values of the children 
under "feature2 <=6.15" node are the same. But this result affects the 
probability more. 

!image-2020-05-24-23-00-38-419.png!

> Check for reducible DecisionTree
> --------------------------------
>
>                 Key: SPARK-3159
>                 URL: https://issues.apache.org/jira/browse/SPARK-3159
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: Joseph K. Bradley
>            Assignee: Alessandro Solimando
>            Priority: Minor
>             Fix For: 2.4.0
>
>
> Improvement: test-time computation
> Currently, pairs of leaf nodes with the same parent can both output the same 
> prediction.  This happens since the splitting criterion (e.g., Gini) is not 
> the same as prediction accuracy/MSE; the splitting criterion can sometimes be 
> improved even when both children would still output the same prediction 
> (e.g., based on the majority label for classification).
> We could check the tree and reduce it if possible after training.
> Note: This happens with scikit-learn as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to