[jira] [Commented] (SPARK-3727) DecisionTree, RandomForest: More prediction functionality
[ https://issues.apache.org/jira/browse/SPARK-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492839#comment-14492839 ] Max Kaznady commented on SPARK-3727: I implemented the same thing but for PySpark. Since there is no existing function, should I just call the function predict_proba like in sklearn? Also, does it make sense to open a new ticket for this, since it's so specific? Thanks, Max DecisionTree, RandomForest: More prediction functionality - Key: SPARK-3727 URL: https://issues.apache.org/jira/browse/SPARK-3727 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley DecisionTree and RandomForest currently predict the most likely label for classification and the mean for regression. Other info about predictions would be useful. For classification: estimated probability of each possible label For regression: variance of estimate RandomForest could also create aggregate predictions in multiple ways: * Predict mean or median value for regression. * Compute variance of estimates (across all trees) for both classification and regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3727) DecisionTree, RandomForest: More prediction functionality
[ https://issues.apache.org/jira/browse/SPARK-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492887#comment-14492887 ] Joseph K. Bradley commented on SPARK-3727: -- Thanks for your initial works on this ticket! The main issue with this extension is API stability: Modifying the existing classes will also make us have to update model save/load versioning, default constructors to ensure binary compatibility, etc. I just linked a JIRA which discusses updating the tree and ensemble APIs under the spark.ml package, which will permit us to redesign the APIs (and make it easier to specify class probabilities or stats for regression). What I'd like to do is get the tree API updates in (this week), and then we could work together to make the class probabilities available under the new API. Does that sound good? Also, if you're new to contributing to Spark, please make sure to check out: [https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark] Thanks! DecisionTree, RandomForest: More prediction functionality - Key: SPARK-3727 URL: https://issues.apache.org/jira/browse/SPARK-3727 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley DecisionTree and RandomForest currently predict the most likely label for classification and the mean for regression. Other info about predictions would be useful. For classification: estimated probability of each possible label For regression: variance of estimate RandomForest could also create aggregate predictions in multiple ways: * Predict mean or median value for regression. * Compute variance of estimates (across all trees) for both classification and regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3727) DecisionTree, RandomForest: More prediction functionality
[ https://issues.apache.org/jira/browse/SPARK-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492906#comment-14492906 ] Max Kaznady commented on SPARK-3727: Yes, probabilities have to be added to other models too, like LogisticRegression. Right now they are hardcoded in two places but not outputted in PySpark. I think is makes sense to split into PySpark, then classification, then probabilities, and then group different types of algorithms, all of which output probabilities: Logistic Regression, Random Forest, etc. Can also add probabilities for trees by counting the number of leaf 1's and 0's. What do you think? DecisionTree, RandomForest: More prediction functionality - Key: SPARK-3727 URL: https://issues.apache.org/jira/browse/SPARK-3727 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley DecisionTree and RandomForest currently predict the most likely label for classification and the mean for regression. Other info about predictions would be useful. For classification: estimated probability of each possible label For regression: variance of estimate RandomForest could also create aggregate predictions in multiple ways: * Predict mean or median value for regression. * Compute variance of estimates (across all trees) for both classification and regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3727) DecisionTree, RandomForest: More prediction functionality
[ https://issues.apache.org/jira/browse/SPARK-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492871#comment-14492871 ] Max Kaznady commented on SPARK-3727: I thought it would be more fitting to separate this: https://issues.apache.org/jira/browse/SPARK-6884 DecisionTree, RandomForest: More prediction functionality - Key: SPARK-3727 URL: https://issues.apache.org/jira/browse/SPARK-3727 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley DecisionTree and RandomForest currently predict the most likely label for classification and the mean for regression. Other info about predictions would be useful. For classification: estimated probability of each possible label For regression: variance of estimate RandomForest could also create aggregate predictions in multiple ways: * Predict mean or median value for regression. * Compute variance of estimates (across all trees) for both classification and regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3727) DecisionTree, RandomForest: More prediction functionality
[ https://issues.apache.org/jira/browse/SPARK-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491919#comment-14491919 ] Michael Kuhlen commented on SPARK-3727: --- Hello! I've implemented predictWithProbabilities() methods for DecisionTreeModel and treeEnsembleModels in scala. These methods return both the most likely class as well as the probabilities of each of the classes. As in scikit-learn, the probabilities are defined as the mean predicted class probabilities of the trees in the forest\[, where the\] class probability of a single tree is the fraction of samples of the same class in a leaf. ([sklearn.ensemble.RandomForestClassifier.predict_proba|http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.predict_proba]) My approach was to modify the Predict class to hold the class probabilities for all classes (as opposed to just of the majority class), and I utilize these probabilities to determine the means over all trees. I believe this should work for GBTrees as well, since I'm taking care to weight the probabilities by the weight of each tree (=1.0 for RandomForest). Here's a [link to my fork|https://github.com/apache/spark/compare/master...mqk:master] showing my modifications. I would be happy to issue a pull request for these changes, if that would be of interest to the community. Although I haven't done so yet, I believe it should be straightforward to extend this to also calculate the variance of estimates for regression algorithms, as suggested in this ticket. Best, Mike DecisionTree, RandomForest: More prediction functionality - Key: SPARK-3727 URL: https://issues.apache.org/jira/browse/SPARK-3727 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley DecisionTree and RandomForest currently predict the most likely label for classification and the mean for regression. Other info about predictions would be useful. For classification: estimated probability of each possible label For regression: variance of estimate RandomForest could also create aggregate predictions in multiple ways: * Predict mean or median value for regression. * Compute variance of estimates (across all trees) for both classification and regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org