[ https://issues.apache.org/jira/browse/SPARK-34591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Julian King updated SPARK-34591: -------------------------------- Attachment: Reproducible example of Spark bug.pdf > Pyspark undertakes pruning of decision trees and random forests outside the > control of the user, leading to undesirable and unexpected outcomes that are > challenging to diagnose and impossible to correct > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-34591 > URL: https://issues.apache.org/jira/browse/SPARK-34591 > Project: Spark > Issue Type: Bug > Components: ML > Affects Versions: 2.4.0, 2.4.4, 3.1.1 > Reporter: Julian King > Priority: Major > Labels: pyspark > Attachments: Reproducible example of Spark bug.pdf > > > *History of the issue* > SPARK-3159 implemented a method designed to reduce the computational burden > for predictions from decision trees and random forests by pruning the tree > after fitting. This is done in such a way that branches where child leaves > all produce the same classification prediction are merged. > This was implemented via a PR: [https://github.com/apache/spark/pull/20632] > This feature is controllable by a "prune" parameter in the Scala version of > the code, which is set to True as the default behaviour. However, this > parameter is not exposed in the Pyspark API, resulting in the behaviour above: > * Occurring always (despite the user may not wanting it to occur) > * Not being documented in the ML documentation, leading to decision tree > behavoiur that may be in conflict with what the user expects to happen > *Why is this a problem?* > +Problem 1: Inaccurate probabilities+ > Because the decision to prune is based on the classification prediction from > the tree (not the probability prediction from the node), this introduces > additional bias compared to the situation where the pruning is not done. The > impact here may be severe in some cases > +Problem 2: Leads to completely unacceptable behaviours in some circumstances > and for some hyper-parameters+ > My colleagues and I encountered this bug in a scenario where we could not get > a decision tree classifier (or random forest classifier with a single tree) > to split a single node, despite this being eminently supported by the data. > This renders the decision trees and random forests complete unusable > +Problem 3: Outcomes are highly sensitive to the hyper-parameters chosen, and > how they interact with the data+ > Small changes in the hyper-parameters should ideally produce small changes in > the built trees. However, here we have found that small changes in the > hyper-parameters lead to large and unpredictable changes in the resultant > trees as a result of this pruning. > In principle, this high degree of instability means that re-training the same > model, with the same hyper-parameter settings, on slightly different data may > lead to large variations in the tree structure simply as a result of the > pruning > +Problem 4: The problems above are much worse for unbalanced data sets+ > Probability estimation on unbalanced data sets using trees should be > supported, but the pruning method described will make this very difficult > +Problem 5: This pruning method is a substantial variation from the > description of the decision tree algorithm in the MLLib documents and is not > described+ > This made it extremely confusing for us in working out why we were seeing > certain behaviours - we had to trace back through all of the Spark detailed > release notes to identify where the problem might. > *Proposed solutions* > +Option 1 (much easier):+ > The proposed solution here is: > * Set the default pruning behaviour to False rather than True, thereby > bringing the default behaviour back into alignment with the documentation > whilst avoiding the issues described above > +Option 2 (more involved):+ > The proposed solution here is: > * Leave the default pruning behaviour set to False > * Expand the pyspark API to expose the pruning behaviour as a > user-controllable option > * Document the change to the API > * Document the change to the tree building behaviour at appropriate points > in the Spark ML and Spark MLLib documentation > We recommend that the default behaviour be set to False because this approach > is not the generally understood approach for building decision trees, where > pruning is decided a separate and user-controllable step. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org