[jira] [Commented] (SPARK-3727) Trees and ensembles: More prediction functionality
[ https://issues.apache.org/jira/browse/SPARK-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651520#comment-14651520 ] Max Kaznady commented on SPARK-3727: I am currently away from the office and will respond to your email on Wednesday, August 5-th. For urgent requests, please contact my manager, Steven Yuan. Trees and ensembles: More prediction functionality -- Key: SPARK-3727 URL: https://issues.apache.org/jira/browse/SPARK-3727 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley DecisionTree and RandomForest currently predict the most likely label for classification and the mean for regression. Other info about predictions would be useful. For classification: estimated probability of each possible label For regression: variance of estimate RandomForest could also create aggregate predictions in multiple ways: * Predict mean or median value for regression. * Compute variance of estimates (across all trees) for both classification and regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6884) Random forest: predict class probabilities
[ https://issues.apache.org/jira/browse/SPARK-6884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646013#comment-14646013 ] Max Kaznady commented on SPARK-6884: Sorry, I've been trying to setup a work environment to push the change. The problem is the security in my workplace - I can't push any code out which I've developed. So I would have to re-develop from scratch at home and push the change in. Random forest: predict class probabilities -- Key: SPARK-6884 URL: https://issues.apache.org/jira/browse/SPARK-6884 Project: Spark Issue Type: Sub-task Components: ML Reporter: Max Kaznady Labels: prediction, probability, randomforest, tree Original Estimate: 72h Remaining Estimate: 72h Currently, there is no way to extract the class probabilities from the RandomForest classifier. I implemented a probability predictor by counting votes from individual trees and adding up their votes for 1 and then dividing by the total number of votes. I opened this ticked to keep track of changes. Will update once I push my code to master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6884) Random forest: predict class probabilities
[ https://issues.apache.org/jira/browse/SPARK-6884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646014#comment-14646014 ] Max Kaznady commented on SPARK-6884: Thanks, this is a better way going forward. Random forest: predict class probabilities -- Key: SPARK-6884 URL: https://issues.apache.org/jira/browse/SPARK-6884 Project: Spark Issue Type: Sub-task Components: ML Reporter: Max Kaznady Labels: prediction, probability, randomforest, tree Original Estimate: 72h Remaining Estimate: 72h Currently, there is no way to extract the class probabilities from the RandomForest classifier. I implemented a probability predictor by counting votes from individual trees and adding up their votes for 1 and then dividing by the total number of votes. I opened this ticked to keep track of changes. Will update once I push my code to master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3727) DecisionTree, RandomForest: More prediction functionality
[ https://issues.apache.org/jira/browse/SPARK-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492839#comment-14492839 ] Max Kaznady commented on SPARK-3727: I implemented the same thing but for PySpark. Since there is no existing function, should I just call the function predict_proba like in sklearn? Also, does it make sense to open a new ticket for this, since it's so specific? Thanks, Max DecisionTree, RandomForest: More prediction functionality - Key: SPARK-3727 URL: https://issues.apache.org/jira/browse/SPARK-3727 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley DecisionTree and RandomForest currently predict the most likely label for classification and the mean for regression. Other info about predictions would be useful. For classification: estimated probability of each possible label For regression: variance of estimate RandomForest could also create aggregate predictions in multiple ways: * Predict mean or median value for regression. * Compute variance of estimates (across all trees) for both classification and regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6113) Stabilize DecisionTree and ensembles APIs
[ https://issues.apache.org/jira/browse/SPARK-6113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492959#comment-14492959 ] Max Kaznady commented on SPARK-6113: [~josephkb] Is it possible to host the API Design doc on something other than Google Docs? My (and most other) corporate policies forbid access to Google Docs, so I cannot download the file. Stabilize DecisionTree and ensembles APIs - Key: SPARK-6113 URL: https://issues.apache.org/jira/browse/SPARK-6113 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.4.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Critical *Issue*: The APIs for DecisionTree and ensembles (RandomForests and GradientBoostedTrees) have been experimental for a long time. The API has become very convoluted because trees and ensembles have many, many variants, some of which we have added incrementally without a long-term design. *Proposal*: This JIRA is for discussing changes required to finalize the APIs. After we discuss, I will make a PR to update the APIs and make them non-Experimental. This will require making many breaking changes; see the design doc for details. [Design doc | https://docs.google.com/document/d/1rJ_DZinyDG3PkYkAKSsQlY0QgCeefn4hUv7GsPkzBP4]: This outlines current issues and the proposed API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6113) Stabilize DecisionTree and ensembles APIs
[ https://issues.apache.org/jira/browse/SPARK-6113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492989#comment-14492989 ] Max Kaznady commented on SPARK-6113: Other places need serious improvement as well, LogisticRegressionWithLBFGS is another example. All LogisticRegression classifiers need a logistic function. I found this ticket, but I’m not sure why it’s closed: https://issues.apache.org/jira/browse/SPARK-3585 I think LogisticRegression and RandomForest should have the same name for the predict_proba function. I would just call it that, since then at least PySpark is consistent with sklearn library. Internally logistic function should be implemented as a single function, not hard-coded in multiple places the way that it is now. That’s another ticket. Aside: I haven’t looked at LogisticRegressionWithSGD, but it fails horribly sometimes: algo either diverges or gets stuck in local minima. Stabilize DecisionTree and ensembles APIs - Key: SPARK-6113 URL: https://issues.apache.org/jira/browse/SPARK-6113 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.4.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Critical *Issue*: The APIs for DecisionTree and ensembles (RandomForests and GradientBoostedTrees) have been experimental for a long time. The API has become very convoluted because trees and ensembles have many, many variants, some of which we have added incrementally without a long-term design. *Proposal*: This JIRA is for discussing changes required to finalize the APIs. After we discuss, I will make a PR to update the APIs and make them non-Experimental. This will require making many breaking changes; see the design doc for details. [Design doc | https://docs.google.com/document/d/1rJ_DZinyDG3PkYkAKSsQlY0QgCeefn4hUv7GsPkzBP4]: This outlines current issues and the proposed API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3727) DecisionTree, RandomForest: More prediction functionality
[ https://issues.apache.org/jira/browse/SPARK-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492906#comment-14492906 ] Max Kaznady commented on SPARK-3727: Yes, probabilities have to be added to other models too, like LogisticRegression. Right now they are hardcoded in two places but not outputted in PySpark. I think is makes sense to split into PySpark, then classification, then probabilities, and then group different types of algorithms, all of which output probabilities: Logistic Regression, Random Forest, etc. Can also add probabilities for trees by counting the number of leaf 1's and 0's. What do you think? DecisionTree, RandomForest: More prediction functionality - Key: SPARK-3727 URL: https://issues.apache.org/jira/browse/SPARK-3727 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley DecisionTree and RandomForest currently predict the most likely label for classification and the mean for regression. Other info about predictions would be useful. For classification: estimated probability of each possible label For regression: variance of estimate RandomForest could also create aggregate predictions in multiple ways: * Predict mean or median value for regression. * Compute variance of estimates (across all trees) for both classification and regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6884) random forest predict probabilities functionality (like in sklearn)
Max Kaznady created SPARK-6884: -- Summary: random forest predict probabilities functionality (like in sklearn) Key: SPARK-6884 URL: https://issues.apache.org/jira/browse/SPARK-6884 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.4.0 Environment: cross-platform Reporter: Max Kaznady Currently, there is no way to extract the class probabilities from the RandomForest classifier. I implemented a probability predictor by counting votes from individual trees and adding up their votes for 1 and then dividing by the total number of votes. I opened this ticked to keep track of changes. Will update once I push my code to master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6884) random forest predict probabilities functionality (like in sklearn)
[ https://issues.apache.org/jira/browse/SPARK-6884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492868#comment-14492868 ] Max Kaznady commented on SPARK-6884: Implemented a prototype, testing mapReduce code. random forest predict probabilities functionality (like in sklearn) --- Key: SPARK-6884 URL: https://issues.apache.org/jira/browse/SPARK-6884 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.4.0 Environment: cross-platform Reporter: Max Kaznady Labels: prediction, probability, randomforest, tree Original Estimate: 72h Remaining Estimate: 72h Currently, there is no way to extract the class probabilities from the RandomForest classifier. I implemented a probability predictor by counting votes from individual trees and adding up their votes for 1 and then dividing by the total number of votes. I opened this ticked to keep track of changes. Will update once I push my code to master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3727) DecisionTree, RandomForest: More prediction functionality
[ https://issues.apache.org/jira/browse/SPARK-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492871#comment-14492871 ] Max Kaznady commented on SPARK-3727: I thought it would be more fitting to separate this: https://issues.apache.org/jira/browse/SPARK-6884 DecisionTree, RandomForest: More prediction functionality - Key: SPARK-3727 URL: https://issues.apache.org/jira/browse/SPARK-3727 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley DecisionTree and RandomForest currently predict the most likely label for classification and the mean for regression. Other info about predictions would be useful. For classification: estimated probability of each possible label For regression: variance of estimate RandomForest could also create aggregate predictions in multiple ways: * Predict mean or median value for regression. * Compute variance of estimates (across all trees) for both classification and regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org