[ https://issues.apache.org/jira/browse/SPARK-6885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14644908#comment-14644908 ]
Joseph K. Bradley commented on SPARK-6885: ------------------------------------------ I'd prefer a variant of #1. I think it will be nice if LearningNodes store something like an ImpurityCalculator (from the mllib.tree implementation), which can store label counts for classification and other stats for regression. (That way, we can add probabilistic predictions for regression in a later PR.) So, rather than PredictionStats storing something specific to classification, it could store an abstract object usable for either classification or regression. We can keep all of these representations as private API, so I'm OK with creating a new version of InformationGainStats if it's helpful. (I hope we can lazily migrate those classes to spark.ml anyways.) As far as where we store stats, I'd prefer we store them at all LearningNodes for simplicity. We can make it more efficient later on. I actually did a bit of implementation on this a while ago; please check it out and see if anything is useful to you: [https://github.com/apache/spark/compare/master...jkbradley:dt-pred-prob] > Decision trees: predict class probabilities > ------------------------------------------- > > Key: SPARK-6885 > URL: https://issues.apache.org/jira/browse/SPARK-6885 > Project: Spark > Issue Type: Sub-task > Components: ML > Affects Versions: 1.3.0 > Reporter: Joseph K. Bradley > Assignee: Yanbo Liang > > Under spark.ml, have DecisionTreeClassifier (currently being added) extend > ProbabilisticClassifier. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org