[ https://issues.apache.org/jira/browse/SPARK-18348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yanbo Liang updated SPARK-18348: -------------------------------- Issue Type: Improvement (was: Bug) > Improve tree ensemble model summary > ----------------------------------- > > Key: SPARK-18348 > URL: https://issues.apache.org/jira/browse/SPARK-18348 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR > Affects Versions: 2.0.0, 2.1.0 > Reporter: Felix Cheung > > During work on R APIs for tree ensemble models (eg. Random Forest, GBT) it is > discovered and discussed that > - we don't have a good summary on nodes or trees for their observations, > loss, probability and so on > - we don't have a shared API with nicely formatted output > We believe this could be a shared API that benefits multiple language > bindings, including R, when available. > For example, here is what R {code}rpart{code} shows for model summary: > {code} > Call: > rpart(formula = Kyphosis ~ Age + Number + Start, data = kyphosis, > method = "class") > n= 81 > CP nsplit rel error xerror xstd > 1 0.17647059 0 1.0000000 1.0000000 0.2155872 > 2 0.01960784 1 0.8235294 0.9411765 0.2107780 > 3 0.01000000 4 0.7647059 1.0588235 0.2200975 > Variable importance > Start Age Number > 64 24 12 > Node number 1: 81 observations, complexity param=0.1764706 > predicted class=absent expected loss=0.2098765 P(node) =1 > class counts: 64 17 > probabilities: 0.790 0.210 > left son=2 (62 obs) right son=3 (19 obs) > Primary splits: > Start < 8.5 to the right, improve=6.762330, (0 missing) > Number < 5.5 to the left, improve=2.866795, (0 missing) > Age < 39.5 to the left, improve=2.250212, (0 missing) > Surrogate splits: > Number < 6.5 to the left, agree=0.802, adj=0.158, (0 split) > Node number 2: 62 observations, complexity param=0.01960784 > predicted class=absent expected loss=0.09677419 P(node) =0.7654321 > class counts: 56 6 > probabilities: 0.903 0.097 > left son=4 (29 obs) right son=5 (33 obs) > Primary splits: > Start < 14.5 to the right, improve=1.0205280, (0 missing) > Age < 55 to the left, improve=0.6848635, (0 missing) > Number < 4.5 to the left, improve=0.2975332, (0 missing) > Surrogate splits: > Number < 3.5 to the left, agree=0.645, adj=0.241, (0 split) > Age < 16 to the left, agree=0.597, adj=0.138, (0 split) > Node number 3: 19 observations > predicted class=present expected loss=0.4210526 P(node) =0.2345679 > class counts: 8 11 > probabilities: 0.421 0.579 > Node number 4: 29 observations > predicted class=absent expected loss=0 P(node) =0.3580247 > class counts: 29 0 > probabilities: 1.000 0.000 > Node number 5: 33 observations, complexity param=0.01960784 > predicted class=absent expected loss=0.1818182 P(node) =0.4074074 > class counts: 27 6 > probabilities: 0.818 0.182 > left son=10 (12 obs) right son=11 (21 obs) > Primary splits: > Age < 55 to the left, improve=1.2467530, (0 missing) > Start < 12.5 to the right, improve=0.2887701, (0 missing) > Number < 3.5 to the right, improve=0.1753247, (0 missing) > Surrogate splits: > Start < 9.5 to the left, agree=0.758, adj=0.333, (0 split) > Number < 5.5 to the right, agree=0.697, adj=0.167, (0 split) > Node number 10: 12 observations > predicted class=absent expected loss=0 P(node) =0.1481481 > class counts: 12 0 > probabilities: 1.000 0.000 > Node number 11: 21 observations, complexity param=0.01960784 > predicted class=absent expected loss=0.2857143 P(node) =0.2592593 > class counts: 15 6 > probabilities: 0.714 0.286 > left son=22 (14 obs) right son=23 (7 obs) > Primary splits: > Age < 111 to the right, improve=1.71428600, (0 missing) > Start < 12.5 to the right, improve=0.79365080, (0 missing) > Number < 3.5 to the right, improve=0.07142857, (0 missing) > Node number 22: 14 observations > predicted class=absent expected loss=0.1428571 P(node) =0.1728395 > class counts: 12 2 > probabilities: 0.857 0.143 > Node number 23: 7 observations > predicted class=present expected loss=0.4285714 P(node) =0.08641975 > class counts: 3 4 > probabilities: 0.429 0.571 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org