[ 
https://issues.apache.org/jira/browse/SPARK-34591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julian King updated SPARK-34591:
--------------------------------
    Attachment: Reproducible example of Spark bug.pdf

> Pyspark undertakes pruning of decision trees and random forests outside the 
> control of the user, leading to undesirable and unexpected outcomes that are 
> challenging to diagnose and impossible to correct
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-34591
>                 URL: https://issues.apache.org/jira/browse/SPARK-34591
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.4.0, 2.4.4, 3.1.1
>            Reporter: Julian King
>            Priority: Major
>              Labels: pyspark
>         Attachments: Reproducible example of Spark bug.pdf
>
>
> *History of the issue*
> SPARK-3159 implemented a method designed to reduce the computational burden 
> for predictions from decision trees and random forests by pruning the tree 
> after fitting. This is done in such a way that branches where child leaves 
> all produce the same classification prediction are merged.
> This was implemented via a PR: [https://github.com/apache/spark/pull/20632]
> This feature is controllable by a "prune" parameter in the Scala version of 
> the code, which is set to True as the default behaviour. However, this 
> parameter is not exposed in the Pyspark API, resulting in the behaviour above:
>  * Occurring always (despite the user may not wanting it to occur)
>  * Not being documented in the ML documentation, leading to decision tree 
> behavoiur that may be in conflict with what the user expects to happen
> *Why is this a problem?*
> +Problem 1: Inaccurate probabilities+
> Because the decision to prune is based on the classification prediction from 
> the tree (not the probability prediction from the node), this introduces 
> additional bias compared to the situation where the pruning is not done. The 
> impact here may be severe in some cases
> +Problem 2: Leads to completely unacceptable behaviours in some circumstances 
> and for some hyper-parameters+
> My colleagues and I encountered this bug in a scenario where we could not get 
> a decision tree classifier (or random forest classifier with a single tree) 
> to split a single node, despite this being eminently supported by the data. 
> This renders the decision trees and random forests complete unusable
> +Problem 3: Outcomes are highly sensitive to the hyper-parameters chosen, and 
> how they interact with the data+
> Small changes in the hyper-parameters should ideally produce small changes in 
> the built trees. However, here we have found that small changes in the 
> hyper-parameters lead to large and unpredictable changes in the resultant 
> trees as a result of this pruning.
> In principle, this high degree of instability means that re-training the same 
> model, with the same hyper-parameter settings, on slightly different data may 
> lead to large variations in the tree structure simply as a result of the 
> pruning
> +Problem 4: The problems above are much worse for unbalanced data sets+
> Probability estimation on unbalanced data sets using trees should be 
> supported, but the pruning method described will make this very difficult
> +Problem 5: This pruning method is a substantial variation from the 
> description of the decision tree algorithm in the MLLib documents and is not 
> described+
> This made it extremely confusing for us in working out why we were seeing 
> certain behaviours - we had to trace back through all of the Spark detailed 
> release notes to identify where the problem might.
> *Proposed solutions*
> +Option 1 (much easier):+
> The proposed solution here is:
>  * Set the default pruning behaviour to False rather than True, thereby 
> bringing the default behaviour back into alignment with the documentation 
> whilst avoiding the issues described above
> +Option 2 (more involved):+
> The proposed solution here is:
>  * Leave the default pruning behaviour set to False
>  * Expand the pyspark API to expose the pruning behaviour as a 
> user-controllable option
>  * Document the change to the API
>  * Document the change to the tree building behaviour at appropriate points 
> in the Spark ML and Spark MLLib documentation
> We recommend that the default behaviour be set to False because this approach 
> is not the generally understood approach for building decision trees, where 
> pruning is decided a separate and user-controllable step.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to