[jira] [Commented] (SPARK-6885) Decision trees: predict class probabilities
[ https://issues.apache.org/jira/browse/SPARK-6885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14647741#comment-14647741 ] Yanbo Liang commented on SPARK-6885: [~josephkb] I create a new version of InformationGainStats called ImpurityStats. It stores information gain, impurity, prediction related data all in one data structure which make LearningNode simplicity. Meanwhile it simplifies and optimizes binsToBestSplit function. I will fix some trivial issues after your reviews. It looks like code refactor in a way. > Decision trees: predict class probabilities > --- > > Key: SPARK-6885 > URL: https://issues.apache.org/jira/browse/SPARK-6885 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang > > Under spark.ml, have DecisionTreeClassifier (currently being added) extend > ProbabilisticClassifier. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6885) Decision trees: predict class probabilities
[ https://issues.apache.org/jira/browse/SPARK-6885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14644908#comment-14644908 ] Joseph K. Bradley commented on SPARK-6885: -- I'd prefer a variant of #1. I think it will be nice if LearningNodes store something like an ImpurityCalculator (from the mllib.tree implementation), which can store label counts for classification and other stats for regression. (That way, we can add probabilistic predictions for regression in a later PR.) So, rather than PredictionStats storing something specific to classification, it could store an abstract object usable for either classification or regression. We can keep all of these representations as private API, so I'm OK with creating a new version of InformationGainStats if it's helpful. (I hope we can lazily migrate those classes to spark.ml anyways.) As far as where we store stats, I'd prefer we store them at all LearningNodes for simplicity. We can make it more efficient later on. I actually did a bit of implementation on this a while ago; please check it out and see if anything is useful to you: [https://github.com/apache/spark/compare/master...jkbradley:dt-pred-prob] > Decision trees: predict class probabilities > --- > > Key: SPARK-6885 > URL: https://issues.apache.org/jira/browse/SPARK-6885 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang > > Under spark.ml, have DecisionTreeClassifier (currently being added) extend > ProbabilisticClassifier. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6885) Decision trees: predict class probabilities
[ https://issues.apache.org/jira/browse/SPARK-6885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14644197#comment-14644197 ] Yanbo Liang commented on SPARK-6885: [~josephkb] Thanks for your comments. After survey I found that we have two candidate plan: #1 We record the raw counts for each label in an Array[Double] at every LearningNode. That is we need to implement a new class PredictionStats which stores the "counts" array. class PredictionStats( val predict: Double, val counts: Array[Double]) extends Serializable { } Compared with the old Predict class, we just add more prediction statistic information. class Predict( val predict: Double, val prob: Double = 0.0) extends Serializable { } And we need to make corresponding change to InformationGainStats and calculatePredictionStats(), maybe need a new InformationGainStats which will not affect the old mllib code. #2 We only record the raw counts for each label at leaf node of LearningNode. That is we need to implement two kinds of LearningNode (InternalLearningNode and LeafLearningNode). I prefer the #1, looking forward your comments. > Decision trees: predict class probabilities > --- > > Key: SPARK-6885 > URL: https://issues.apache.org/jira/browse/SPARK-6885 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang > > Under spark.ml, have DecisionTreeClassifier (currently being added) extend > ProbabilisticClassifier. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6885) Decision trees: predict class probabilities
[ https://issues.apache.org/jira/browse/SPARK-6885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14644196#comment-14644196 ] Yanbo Liang commented on SPARK-6885: [~josephkb] Thanks for your comments. After survey I found that we have two candidate plan: #1 We record the raw counts for each label in an Array[Double] at every LearningNode. That is we need to implement a new class PredictionStats which stores the "counts" array. class PredictionStats( val predict: Double, val counts: Array[Double]) extends Serializable { } Compared with the old Predict class, we just add more prediction statistic information. class Predict( val predict: Double, val prob: Double = 0.0) extends Serializable { } And we need to make corresponding change to InformationGainStats and calculatePredictionStats(), maybe need a new InformationGainStats which will not affect the old mllib code. #2 We only record the raw counts for each label at leaf node of LearningNode. That is we need to implement two kinds of LearningNode (InternalLearningNode and LeafLearningNode). I prefer the #1, looking forward your comments. > Decision trees: predict class probabilities > --- > > Key: SPARK-6885 > URL: https://issues.apache.org/jira/browse/SPARK-6885 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang > > Under spark.ml, have DecisionTreeClassifier (currently being added) extend > ProbabilisticClassifier. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6885) Decision trees: predict class probabilities
[ https://issues.apache.org/jira/browse/SPARK-6885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14643720#comment-14643720 ] Joseph K. Bradley commented on SPARK-6885: -- I was thinking it might be nice to return the raw counts for predictRaw since I could imagine users wanting to know counts in addition to probabilities. Would you be OK with changing rawPrediction to have counts? numClasses can be set pretty easily from the data. The user should not need to specify numClasses; that should be available in the metadata (from StringIndexer), or from a scan over the data if no metadata are available. It should not be a Param. > Decision trees: predict class probabilities > --- > > Key: SPARK-6885 > URL: https://issues.apache.org/jira/browse/SPARK-6885 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang > > Under spark.ml, have DecisionTreeClassifier (currently being added) extend > ProbabilisticClassifier. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6885) Decision trees: predict class probabilities
[ https://issues.apache.org/jira/browse/SPARK-6885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14642568#comment-14642568 ] Yanbo Liang commented on SPARK-6885: [~josephkb] I referred the old DecisionTree API and the sklean API (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py#L593) to implement the prediction probabilities function. I make the DecisionTreeClassificationModel inherit from ProbabilisticClassificationModel, make the predictRaw to return the probabilities and make raw2probabilityInPlace just return the rawPrediction. Any comments? Another issue is the "numClasses" variables which is not tackled appropriately at present, I think the numClasses should become one of the ClassifierParams and can be set by Classifier. I will optimize this issue after collecting comments. > Decision trees: predict class probabilities > --- > > Key: SPARK-6885 > URL: https://issues.apache.org/jira/browse/SPARK-6885 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley > > Under spark.ml, have DecisionTreeClassifier (currently being added) extend > ProbabilisticClassifier. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6885) Decision trees: predict class probabilities
[ https://issues.apache.org/jira/browse/SPARK-6885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14642565#comment-14642565 ] Apache Spark commented on SPARK-6885: - User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/7694 > Decision trees: predict class probabilities > --- > > Key: SPARK-6885 > URL: https://issues.apache.org/jira/browse/SPARK-6885 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley > > Under spark.ml, have DecisionTreeClassifier (currently being added) extend > ProbabilisticClassifier. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6885) Decision trees: predict class probabilities
[ https://issues.apache.org/jira/browse/SPARK-6885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14637316#comment-14637316 ] Joseph K. Bradley commented on SPARK-6885: -- We can resume this work. Do you think you'd have time to finish it by the end of this week? Sorry for the rush, but the code cutoff for the next release is in ~9 days. If you don't have time right now, I can send a patch instead. Thanks! > Decision trees: predict class probabilities > --- > > Key: SPARK-6885 > URL: https://issues.apache.org/jira/browse/SPARK-6885 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley > > Under spark.ml, have DecisionTreeClassifier (currently being added) extend > ProbabilisticClassifier. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6885) Decision trees: predict class probabilities
[ https://issues.apache.org/jira/browse/SPARK-6885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14615494#comment-14615494 ] Joseph K. Bradley commented on SPARK-6885: -- This is still blocking on other JIRAs, so it cannot be done yet. (Please see the parent JIRA for details) > Decision trees: predict class probabilities > --- > > Key: SPARK-6885 > URL: https://issues.apache.org/jira/browse/SPARK-6885 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley > > Under spark.ml, have DecisionTreeClassifier (currently being added) extend > ProbabilisticClassifier. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6885) Decision trees: predict class probabilities
[ https://issues.apache.org/jira/browse/SPARK-6885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14614629#comment-14614629 ] Venkata Vineel commented on SPARK-6885: --- [~josephkb] Can I work on this. Can you please assign this to me ? > Decision trees: predict class probabilities > --- > > Key: SPARK-6885 > URL: https://issues.apache.org/jira/browse/SPARK-6885 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley > > Under spark.ml, have DecisionTreeClassifier (currently being added) extend > ProbabilisticClassifier. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org