[jira] [Commented] (SPARK-3727) Trees and ensembles: More prediction functionality

2015-08-03 Thread Max Kaznady (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651520#comment-14651520
 ] 

Max Kaznady commented on SPARK-3727:


I am currently away from the office and will respond to your email on 
Wednesday, August 5-th.

For urgent requests, please contact my manager, Steven Yuan.



 Trees and ensembles: More prediction functionality
 --

 Key: SPARK-3727
 URL: https://issues.apache.org/jira/browse/SPARK-3727
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley

 DecisionTree and RandomForest currently predict the most likely label for 
 classification and the mean for regression.  Other info about predictions 
 would be useful.
 For classification: estimated probability of each possible label
 For regression: variance of estimate
 RandomForest could also create aggregate predictions in multiple ways:
 * Predict mean or median value for regression.
 * Compute variance of estimates (across all trees) for both classification 
 and regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6884) Random forest: predict class probabilities

2015-07-29 Thread Max Kaznady (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646013#comment-14646013
 ] 

Max Kaznady commented on SPARK-6884:


Sorry, I've been trying to setup a work environment to push the change.

The problem is the security in my workplace - I can't push any code out which 
I've developed. So I would have to re-develop from scratch at home and push the 
change in.

 Random forest: predict class probabilities
 --

 Key: SPARK-6884
 URL: https://issues.apache.org/jira/browse/SPARK-6884
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Max Kaznady
  Labels: prediction, probability, randomforest, tree
   Original Estimate: 72h
  Remaining Estimate: 72h

 Currently, there is no way to extract the class probabilities from the 
 RandomForest classifier. I implemented a probability predictor by counting 
 votes from individual trees and adding up their votes for 1 and then 
 dividing by the total number of votes.
 I opened this ticked to keep track of changes. Will update once I push my 
 code to master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6884) Random forest: predict class probabilities

2015-07-29 Thread Max Kaznady (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646014#comment-14646014
 ] 

Max Kaznady commented on SPARK-6884:


Thanks, this is a better way going forward.

 Random forest: predict class probabilities
 --

 Key: SPARK-6884
 URL: https://issues.apache.org/jira/browse/SPARK-6884
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Max Kaznady
  Labels: prediction, probability, randomforest, tree
   Original Estimate: 72h
  Remaining Estimate: 72h

 Currently, there is no way to extract the class probabilities from the 
 RandomForest classifier. I implemented a probability predictor by counting 
 votes from individual trees and adding up their votes for 1 and then 
 dividing by the total number of votes.
 I opened this ticked to keep track of changes. Will update once I push my 
 code to master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3727) DecisionTree, RandomForest: More prediction functionality

2015-04-13 Thread Max Kaznady (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492839#comment-14492839
 ] 

Max Kaznady commented on SPARK-3727:


I implemented the same thing but for PySpark. Since there is no existing 
function, should I just call the function predict_proba like in sklearn? 

Also, does it make sense to open a new ticket for this, since it's so specific?

Thanks,
Max

 DecisionTree, RandomForest: More prediction functionality
 -

 Key: SPARK-3727
 URL: https://issues.apache.org/jira/browse/SPARK-3727
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley

 DecisionTree and RandomForest currently predict the most likely label for 
 classification and the mean for regression.  Other info about predictions 
 would be useful.
 For classification: estimated probability of each possible label
 For regression: variance of estimate
 RandomForest could also create aggregate predictions in multiple ways:
 * Predict mean or median value for regression.
 * Compute variance of estimates (across all trees) for both classification 
 and regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6113) Stabilize DecisionTree and ensembles APIs

2015-04-13 Thread Max Kaznady (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492959#comment-14492959
 ] 

Max Kaznady commented on SPARK-6113:


[~josephkb] Is it possible to host the API Design doc on something other than 
Google Docs? My (and most other) corporate policies forbid access to Google 
Docs, so I cannot download the file.

 Stabilize DecisionTree and ensembles APIs
 -

 Key: SPARK-6113
 URL: https://issues.apache.org/jira/browse/SPARK-6113
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Critical

 *Issue*: The APIs for DecisionTree and ensembles (RandomForests and 
 GradientBoostedTrees) have been experimental for a long time.  The API has 
 become very convoluted because trees and ensembles have many, many variants, 
 some of which we have added incrementally without a long-term design.
 *Proposal*: This JIRA is for discussing changes required to finalize the 
 APIs.  After we discuss, I will make a PR to update the APIs and make them 
 non-Experimental.  This will require making many breaking changes; see the 
 design doc for details.
 [Design doc | 
 https://docs.google.com/document/d/1rJ_DZinyDG3PkYkAKSsQlY0QgCeefn4hUv7GsPkzBP4]:
  This outlines current issues and the proposed API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6113) Stabilize DecisionTree and ensembles APIs

2015-04-13 Thread Max Kaznady (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492989#comment-14492989
 ] 

Max Kaznady commented on SPARK-6113:


Other places need serious improvement as well, LogisticRegressionWithLBFGS is 
another example.
 
All LogisticRegression classifiers need a logistic function. I found this 
ticket, but I’m not sure why it’s closed:
https://issues.apache.org/jira/browse/SPARK-3585
 
I think LogisticRegression and RandomForest should have the same name for the 
predict_proba function. I would just call it that, since then at least PySpark 
is consistent with sklearn library.
 
Internally logistic function should be implemented as a single function, not 
hard-coded in multiple places the way that it is now. That’s another ticket.
 
Aside: I haven’t looked at LogisticRegressionWithSGD, but it fails horribly 
sometimes: algo either diverges or gets stuck in local minima.


 Stabilize DecisionTree and ensembles APIs
 -

 Key: SPARK-6113
 URL: https://issues.apache.org/jira/browse/SPARK-6113
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Critical

 *Issue*: The APIs for DecisionTree and ensembles (RandomForests and 
 GradientBoostedTrees) have been experimental for a long time.  The API has 
 become very convoluted because trees and ensembles have many, many variants, 
 some of which we have added incrementally without a long-term design.
 *Proposal*: This JIRA is for discussing changes required to finalize the 
 APIs.  After we discuss, I will make a PR to update the APIs and make them 
 non-Experimental.  This will require making many breaking changes; see the 
 design doc for details.
 [Design doc | 
 https://docs.google.com/document/d/1rJ_DZinyDG3PkYkAKSsQlY0QgCeefn4hUv7GsPkzBP4]:
  This outlines current issues and the proposed API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3727) DecisionTree, RandomForest: More prediction functionality

2015-04-13 Thread Max Kaznady (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492906#comment-14492906
 ] 

Max Kaznady commented on SPARK-3727:


Yes, probabilities have to be added to other models too, like 
LogisticRegression. Right now they are hardcoded in two places but not 
outputted in PySpark.

I think is makes sense to split into PySpark, then classification, then 
probabilities, and then group different types of algorithms, all of which 
output probabilities: Logistic Regression, Random Forest, etc.

Can also add probabilities for trees by counting the number of leaf 1's and 0's.

What do you think?

 DecisionTree, RandomForest: More prediction functionality
 -

 Key: SPARK-3727
 URL: https://issues.apache.org/jira/browse/SPARK-3727
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley

 DecisionTree and RandomForest currently predict the most likely label for 
 classification and the mean for regression.  Other info about predictions 
 would be useful.
 For classification: estimated probability of each possible label
 For regression: variance of estimate
 RandomForest could also create aggregate predictions in multiple ways:
 * Predict mean or median value for regression.
 * Compute variance of estimates (across all trees) for both classification 
 and regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6884) random forest predict probabilities functionality (like in sklearn)

2015-04-13 Thread Max Kaznady (JIRA)
Max Kaznady created SPARK-6884:
--

 Summary: random forest predict probabilities functionality (like 
in sklearn)
 Key: SPARK-6884
 URL: https://issues.apache.org/jira/browse/SPARK-6884
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.4.0
 Environment: cross-platform
Reporter: Max Kaznady


Currently, there is no way to extract the class probabilities from the 
RandomForest classifier. I implemented a probability predictor by counting 
votes from individual trees and adding up their votes for 1 and then dividing 
by the total number of votes.

I opened this ticked to keep track of changes. Will update once I push my code 
to master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6884) random forest predict probabilities functionality (like in sklearn)

2015-04-13 Thread Max Kaznady (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492868#comment-14492868
 ] 

Max Kaznady commented on SPARK-6884:


Implemented a prototype, testing mapReduce code.

 random forest predict probabilities functionality (like in sklearn)
 ---

 Key: SPARK-6884
 URL: https://issues.apache.org/jira/browse/SPARK-6884
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.4.0
 Environment: cross-platform
Reporter: Max Kaznady
  Labels: prediction, probability, randomforest, tree
   Original Estimate: 72h
  Remaining Estimate: 72h

 Currently, there is no way to extract the class probabilities from the 
 RandomForest classifier. I implemented a probability predictor by counting 
 votes from individual trees and adding up their votes for 1 and then 
 dividing by the total number of votes.
 I opened this ticked to keep track of changes. Will update once I push my 
 code to master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3727) DecisionTree, RandomForest: More prediction functionality

2015-04-13 Thread Max Kaznady (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492871#comment-14492871
 ] 

Max Kaznady commented on SPARK-3727:


I thought it would be more fitting to separate this: 
https://issues.apache.org/jira/browse/SPARK-6884

 DecisionTree, RandomForest: More prediction functionality
 -

 Key: SPARK-3727
 URL: https://issues.apache.org/jira/browse/SPARK-3727
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley

 DecisionTree and RandomForest currently predict the most likely label for 
 classification and the mean for regression.  Other info about predictions 
 would be useful.
 For classification: estimated probability of each possible label
 For regression: variance of estimate
 RandomForest could also create aggregate predictions in multiple ways:
 * Predict mean or median value for regression.
 * Compute variance of estimates (across all trees) for both classification 
 and regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org