[jira] [Created] (SPARK-34591) Pyspark undertakes pruning of decision trees and random forests outside the control of the user, leading to undesirable and unexpected outcomes that are challenging to diagnose and impossible to correct

Julian King (Jira) Tue, 02 Mar 2021 00:10:04 -0800

Julian King created SPARK-34591:
-----------------------------------

             Summary: Pyspark undertakes pruning of decision trees and random 
forests outside the control of the user, leading to undesirable and unexpected 
outcomes that are challenging to diagnose and impossible to correct
                 Key: SPARK-34591
                 URL: https://issues.apache.org/jira/browse/SPARK-34591
             Project: Spark
          Issue Type: Bug
          Components: ML
    Affects Versions: 2.4.0
            Reporter: Julian King



*History of the issue*

SPARK-3159 implemented a method designed to reduce the computational burden for 
predictions from decision trees and random forests by pruning the tree after 
fitting. This is done in such a way that branches where child leaves all 
produce the same classification prediction are merged.

This was implemented via a PR: [https://github.com/apache/spark/pull/20632]

This feature is controllable by a "prune" parameter in the Scala version of the 
code, which is set to True as the default behaviour. However, this parameter is 
not exposed in the Pyspark API, resulting in the behaviour above:
 * Occurring always (despite the user may not wanting it to occur)
 * Not being documented in the ML documentation, leading to decision tree 
behavoiur that may be in conflict with what the user expects to happen

*Why is this a problem?*

+Problem 1: Inaccurate probabilities+

Because the decision to prune is based on the classification prediction from 
the tree (not the probability prediction from the node), this introduces 
additional bias compared to the situation where the pruning is not done. The 
impact here may be severe in some cases

+Problem 2: Leads to completely unacceptable behaviours in some circumstances 
and for some hyper-parameters+

My colleagues and I encountered this bug in a scenario where we could not get a 
decision tree classifier (or random forest classifier with a single tree) to 
split a single node, despite this being eminently supported by the data. This 
renders the decision trees and random forests complete unusable

+Problem 3: Outcomes are highly sensitive to the hyper-parameters chosen, and 
how they interact with the data+

Small changes in the hyper-parameters should ideally produce small changes in 
the built trees. However, here we have found that small changes in the 
hyper-parameters lead to large and unpredictable changes in the resultant trees 
as a result of this pruning.

In principle, this high degree of instability means that re-training the same 
model, with the same hyper-parameter settings, on slightly different data may 
lead to large variations in the tree structure simply as a result of the pruning

+Problem 4: The problems above are much worse for unbalanced data sets+

Probability estimation on unbalanced data sets using trees should be supported, 
but the pruning method described will make this very difficult

+Problem 5: This pruning method is a substantial variation from the description 
of the decision tree algorithm in the MLLib documents and is not described+

This made it extremely confusing for us in working out why we were seeing 
certain behaviours - we had to trace back through all of the Spark detailed 
release notes to identify where the problem might.

*Proposed solutions*

+Option 1 (much easier):

+

The proposed solution here is:
 * Set the default pruning behaviour to False rather than True, thereby 
bringing the default behaviour back into alignment with the documentation 
whilst avoiding the issues described above

+Option 2 (more involved):+

The proposed solution here is:
 * Leave the default pruning behaviour set to False
 * Expand the pyspark API to expose the pruning behaviour as a 
user-controllable option
 * Document the change to the API
 * Document the change to the tree building behaviour at appropriate points in 
the Spark ML and Spark MLLib documentation

We recommend that the default behaviour be set to False because this approach 
is not the generally understood approach for building decision trees, where 
pruning is decided a separate and user-controllable step.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34591) Pyspark undertakes pruning of decision trees and random forests outside the control of the user, leading to undesirable and unexpected outcomes that are challenging to diagnose and impossible to correct

Reply via email to