[jira] [Commented] (SPARK-18348) Improve tree ensemble model summary

2018-01-06 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16314976#comment-16314976
 ] 

Joseph K. Bradley commented on SPARK-18348:
---

I'll remove the target, but please re-add as needed

> Improve tree ensemble model summary
> ---
>
> Key: SPARK-18348
> URL: https://issues.apache.org/jira/browse/SPARK-18348
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Felix Cheung
>
> During work on R APIs for tree ensemble models (eg. Random Forest, GBT) it is 
> discovered and discussed that
> - we don't have a good summary on nodes or trees for their observations, 
> loss, probability and so on
> - we don't have a shared API with nicely formatted output
> We believe this could be a shared API that benefits multiple language 
> bindings, including R, when available.
> For example, here is what R {code}rpart{code} shows for model summary:
> {code}
> Call:
> rpart(formula = Kyphosis ~ Age + Number + Start, data = kyphosis,
> method = "class")
>   n= 81
>   CP nsplit rel errorxerror  xstd
> 1 0.17647059  0 1.000 1.000 0.2155872
> 2 0.01960784  1 0.8235294 0.9411765 0.2107780
> 3 0.0100  4 0.7647059 1.0588235 0.2200975
> Variable importance
>  StartAge Number
> 64 24 12
> Node number 1: 81 observations,complexity param=0.1764706
>   predicted class=absent   expected loss=0.2098765  P(node) =1
> class counts:6417
>probabilities: 0.790 0.210
>   left son=2 (62 obs) right son=3 (19 obs)
>   Primary splits:
>   Start  < 8.5  to the right, improve=6.762330, (0 missing)
>   Number < 5.5  to the left,  improve=2.866795, (0 missing)
>   Age< 39.5 to the left,  improve=2.250212, (0 missing)
>   Surrogate splits:
>   Number < 6.5  to the left,  agree=0.802, adj=0.158, (0 split)
> Node number 2: 62 observations,complexity param=0.01960784
>   predicted class=absent   expected loss=0.09677419  P(node) =0.7654321
> class counts:56 6
>probabilities: 0.903 0.097
>   left son=4 (29 obs) right son=5 (33 obs)
>   Primary splits:
>   Start  < 14.5 to the right, improve=1.0205280, (0 missing)
>   Age< 55   to the left,  improve=0.6848635, (0 missing)
>   Number < 4.5  to the left,  improve=0.2975332, (0 missing)
>   Surrogate splits:
>   Number < 3.5  to the left,  agree=0.645, adj=0.241, (0 split)
>   Age< 16   to the left,  agree=0.597, adj=0.138, (0 split)
> Node number 3: 19 observations
>   predicted class=present  expected loss=0.4210526  P(node) =0.2345679
> class counts: 811
>probabilities: 0.421 0.579
> Node number 4: 29 observations
>   predicted class=absent   expected loss=0  P(node) =0.3580247
> class counts:29 0
>probabilities: 1.000 0.000
> Node number 5: 33 observations,complexity param=0.01960784
>   predicted class=absent   expected loss=0.1818182  P(node) =0.4074074
> class counts:27 6
>probabilities: 0.818 0.182
>   left son=10 (12 obs) right son=11 (21 obs)
>   Primary splits:
>   Age< 55   to the left,  improve=1.2467530, (0 missing)
>   Start  < 12.5 to the right, improve=0.2887701, (0 missing)
>   Number < 3.5  to the right, improve=0.1753247, (0 missing)
>   Surrogate splits:
>   Start  < 9.5  to the left,  agree=0.758, adj=0.333, (0 split)
>   Number < 5.5  to the right, agree=0.697, adj=0.167, (0 split)
> Node number 10: 12 observations
>   predicted class=absent   expected loss=0  P(node) =0.1481481
> class counts:12 0
>probabilities: 1.000 0.000
> Node number 11: 21 observations,complexity param=0.01960784
>   predicted class=absent   expected loss=0.2857143  P(node) =0.2592593
> class counts:15 6
>probabilities: 0.714 0.286
>   left son=22 (14 obs) right son=23 (7 obs)
>   Primary splits:
>   Age< 111  to the right, improve=1.71428600, (0 missing)
>   Start  < 12.5 to the right, improve=0.79365080, (0 missing)
>   Number < 3.5  to the right, improve=0.07142857, (0 missing)
> Node number 22: 14 observations
>   predicted class=absent   expected loss=0.1428571  P(node) =0.1728395
> class counts:12 2
>probabilities: 0.857 0.143
> Node number 23: 7 observations
>   predicted class=present  expected loss=0.4285714  P(node) =0.08641975
> class counts: 3 4
>probabilities: 0.429 0.571
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18348) Improve tree ensemble model summary

2017-04-27 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987594#comment-15987594
 ] 

Joseph K. Bradley commented on SPARK-18348:
---

Retargeting since 2.2 has been cut

> Improve tree ensemble model summary
> ---
>
> Key: SPARK-18348
> URL: https://issues.apache.org/jira/browse/SPARK-18348
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Felix Cheung
>
> During work on R APIs for tree ensemble models (eg. Random Forest, GBT) it is 
> discovered and discussed that
> - we don't have a good summary on nodes or trees for their observations, 
> loss, probability and so on
> - we don't have a shared API with nicely formatted output
> We believe this could be a shared API that benefits multiple language 
> bindings, including R, when available.
> For example, here is what R {code}rpart{code} shows for model summary:
> {code}
> Call:
> rpart(formula = Kyphosis ~ Age + Number + Start, data = kyphosis,
> method = "class")
>   n= 81
>   CP nsplit rel errorxerror  xstd
> 1 0.17647059  0 1.000 1.000 0.2155872
> 2 0.01960784  1 0.8235294 0.9411765 0.2107780
> 3 0.0100  4 0.7647059 1.0588235 0.2200975
> Variable importance
>  StartAge Number
> 64 24 12
> Node number 1: 81 observations,complexity param=0.1764706
>   predicted class=absent   expected loss=0.2098765  P(node) =1
> class counts:6417
>probabilities: 0.790 0.210
>   left son=2 (62 obs) right son=3 (19 obs)
>   Primary splits:
>   Start  < 8.5  to the right, improve=6.762330, (0 missing)
>   Number < 5.5  to the left,  improve=2.866795, (0 missing)
>   Age< 39.5 to the left,  improve=2.250212, (0 missing)
>   Surrogate splits:
>   Number < 6.5  to the left,  agree=0.802, adj=0.158, (0 split)
> Node number 2: 62 observations,complexity param=0.01960784
>   predicted class=absent   expected loss=0.09677419  P(node) =0.7654321
> class counts:56 6
>probabilities: 0.903 0.097
>   left son=4 (29 obs) right son=5 (33 obs)
>   Primary splits:
>   Start  < 14.5 to the right, improve=1.0205280, (0 missing)
>   Age< 55   to the left,  improve=0.6848635, (0 missing)
>   Number < 4.5  to the left,  improve=0.2975332, (0 missing)
>   Surrogate splits:
>   Number < 3.5  to the left,  agree=0.645, adj=0.241, (0 split)
>   Age< 16   to the left,  agree=0.597, adj=0.138, (0 split)
> Node number 3: 19 observations
>   predicted class=present  expected loss=0.4210526  P(node) =0.2345679
> class counts: 811
>probabilities: 0.421 0.579
> Node number 4: 29 observations
>   predicted class=absent   expected loss=0  P(node) =0.3580247
> class counts:29 0
>probabilities: 1.000 0.000
> Node number 5: 33 observations,complexity param=0.01960784
>   predicted class=absent   expected loss=0.1818182  P(node) =0.4074074
> class counts:27 6
>probabilities: 0.818 0.182
>   left son=10 (12 obs) right son=11 (21 obs)
>   Primary splits:
>   Age< 55   to the left,  improve=1.2467530, (0 missing)
>   Start  < 12.5 to the right, improve=0.2887701, (0 missing)
>   Number < 3.5  to the right, improve=0.1753247, (0 missing)
>   Surrogate splits:
>   Start  < 9.5  to the left,  agree=0.758, adj=0.333, (0 split)
>   Number < 5.5  to the right, agree=0.697, adj=0.167, (0 split)
> Node number 10: 12 observations
>   predicted class=absent   expected loss=0  P(node) =0.1481481
> class counts:12 0
>probabilities: 1.000 0.000
> Node number 11: 21 observations,complexity param=0.01960784
>   predicted class=absent   expected loss=0.2857143  P(node) =0.2592593
> class counts:15 6
>probabilities: 0.714 0.286
>   left son=22 (14 obs) right son=23 (7 obs)
>   Primary splits:
>   Age< 111  to the right, improve=1.71428600, (0 missing)
>   Start  < 12.5 to the right, improve=0.79365080, (0 missing)
>   Number < 3.5  to the right, improve=0.07142857, (0 missing)
> Node number 22: 14 observations
>   predicted class=absent   expected loss=0.1428571  P(node) =0.1728395
> class counts:12 2
>probabilities: 0.857 0.143
> Node number 23: 7 observations
>   predicted class=present  expected loss=0.4285714  P(node) =0.08641975
> class counts: 3 4
>probabilities: 0.429 0.571
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18348) Improve tree ensemble model summary

2017-01-18 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15828278#comment-15828278
 ] 

Yanbo Liang commented on SPARK-18348:
-

[~josephkb] [~felixcheung] I can shepherd this. Thanks.

> Improve tree ensemble model summary
> ---
>
> Key: SPARK-18348
> URL: https://issues.apache.org/jira/browse/SPARK-18348
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Felix Cheung
>
> During work on R APIs for tree ensemble models (eg. Random Forest, GBT) it is 
> discovered and discussed that
> - we don't have a good summary on nodes or trees for their observations, 
> loss, probability and so on
> - we don't have a shared API with nicely formatted output
> We believe this could be a shared API that benefits multiple language 
> bindings, including R, when available.
> For example, here is what R {code}rpart{code} shows for model summary:
> {code}
> Call:
> rpart(formula = Kyphosis ~ Age + Number + Start, data = kyphosis,
> method = "class")
>   n= 81
>   CP nsplit rel errorxerror  xstd
> 1 0.17647059  0 1.000 1.000 0.2155872
> 2 0.01960784  1 0.8235294 0.9411765 0.2107780
> 3 0.0100  4 0.7647059 1.0588235 0.2200975
> Variable importance
>  StartAge Number
> 64 24 12
> Node number 1: 81 observations,complexity param=0.1764706
>   predicted class=absent   expected loss=0.2098765  P(node) =1
> class counts:6417
>probabilities: 0.790 0.210
>   left son=2 (62 obs) right son=3 (19 obs)
>   Primary splits:
>   Start  < 8.5  to the right, improve=6.762330, (0 missing)
>   Number < 5.5  to the left,  improve=2.866795, (0 missing)
>   Age< 39.5 to the left,  improve=2.250212, (0 missing)
>   Surrogate splits:
>   Number < 6.5  to the left,  agree=0.802, adj=0.158, (0 split)
> Node number 2: 62 observations,complexity param=0.01960784
>   predicted class=absent   expected loss=0.09677419  P(node) =0.7654321
> class counts:56 6
>probabilities: 0.903 0.097
>   left son=4 (29 obs) right son=5 (33 obs)
>   Primary splits:
>   Start  < 14.5 to the right, improve=1.0205280, (0 missing)
>   Age< 55   to the left,  improve=0.6848635, (0 missing)
>   Number < 4.5  to the left,  improve=0.2975332, (0 missing)
>   Surrogate splits:
>   Number < 3.5  to the left,  agree=0.645, adj=0.241, (0 split)
>   Age< 16   to the left,  agree=0.597, adj=0.138, (0 split)
> Node number 3: 19 observations
>   predicted class=present  expected loss=0.4210526  P(node) =0.2345679
> class counts: 811
>probabilities: 0.421 0.579
> Node number 4: 29 observations
>   predicted class=absent   expected loss=0  P(node) =0.3580247
> class counts:29 0
>probabilities: 1.000 0.000
> Node number 5: 33 observations,complexity param=0.01960784
>   predicted class=absent   expected loss=0.1818182  P(node) =0.4074074
> class counts:27 6
>probabilities: 0.818 0.182
>   left son=10 (12 obs) right son=11 (21 obs)
>   Primary splits:
>   Age< 55   to the left,  improve=1.2467530, (0 missing)
>   Start  < 12.5 to the right, improve=0.2887701, (0 missing)
>   Number < 3.5  to the right, improve=0.1753247, (0 missing)
>   Surrogate splits:
>   Start  < 9.5  to the left,  agree=0.758, adj=0.333, (0 split)
>   Number < 5.5  to the right, agree=0.697, adj=0.167, (0 split)
> Node number 10: 12 observations
>   predicted class=absent   expected loss=0  P(node) =0.1481481
> class counts:12 0
>probabilities: 1.000 0.000
> Node number 11: 21 observations,complexity param=0.01960784
>   predicted class=absent   expected loss=0.2857143  P(node) =0.2592593
> class counts:15 6
>probabilities: 0.714 0.286
>   left son=22 (14 obs) right son=23 (7 obs)
>   Primary splits:
>   Age< 111  to the right, improve=1.71428600, (0 missing)
>   Start  < 12.5 to the right, improve=0.79365080, (0 missing)
>   Number < 3.5  to the right, improve=0.07142857, (0 missing)
> Node number 22: 14 observations
>   predicted class=absent   expected loss=0.1428571  P(node) =0.1728395
> class counts:12 2
>probabilities: 0.857 0.143
> Node number 23: 7 observations
>   predicted class=present  expected loss=0.4285714  P(node) =0.08641975
> class counts: 3 4
>probabilities: 0.429 0.571
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18348) Improve tree ensemble model summary

2017-01-17 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827474#comment-15827474
 ] 

Felix Cheung commented on SPARK-18348:
--

[~yanboliang] would you like to run with this? you had a few great points when 
we discussed this earlier.

> Improve tree ensemble model summary
> ---
>
> Key: SPARK-18348
> URL: https://issues.apache.org/jira/browse/SPARK-18348
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Felix Cheung
>
> During work on R APIs for tree ensemble models (eg. Random Forest, GBT) it is 
> discovered and discussed that
> - we don't have a good summary on nodes or trees for their observations, 
> loss, probability and so on
> - we don't have a shared API with nicely formatted output
> We believe this could be a shared API that benefits multiple language 
> bindings, including R, when available.
> For example, here is what R {code}rpart{code} shows for model summary:
> {code}
> Call:
> rpart(formula = Kyphosis ~ Age + Number + Start, data = kyphosis,
> method = "class")
>   n= 81
>   CP nsplit rel errorxerror  xstd
> 1 0.17647059  0 1.000 1.000 0.2155872
> 2 0.01960784  1 0.8235294 0.9411765 0.2107780
> 3 0.0100  4 0.7647059 1.0588235 0.2200975
> Variable importance
>  StartAge Number
> 64 24 12
> Node number 1: 81 observations,complexity param=0.1764706
>   predicted class=absent   expected loss=0.2098765  P(node) =1
> class counts:6417
>probabilities: 0.790 0.210
>   left son=2 (62 obs) right son=3 (19 obs)
>   Primary splits:
>   Start  < 8.5  to the right, improve=6.762330, (0 missing)
>   Number < 5.5  to the left,  improve=2.866795, (0 missing)
>   Age< 39.5 to the left,  improve=2.250212, (0 missing)
>   Surrogate splits:
>   Number < 6.5  to the left,  agree=0.802, adj=0.158, (0 split)
> Node number 2: 62 observations,complexity param=0.01960784
>   predicted class=absent   expected loss=0.09677419  P(node) =0.7654321
> class counts:56 6
>probabilities: 0.903 0.097
>   left son=4 (29 obs) right son=5 (33 obs)
>   Primary splits:
>   Start  < 14.5 to the right, improve=1.0205280, (0 missing)
>   Age< 55   to the left,  improve=0.6848635, (0 missing)
>   Number < 4.5  to the left,  improve=0.2975332, (0 missing)
>   Surrogate splits:
>   Number < 3.5  to the left,  agree=0.645, adj=0.241, (0 split)
>   Age< 16   to the left,  agree=0.597, adj=0.138, (0 split)
> Node number 3: 19 observations
>   predicted class=present  expected loss=0.4210526  P(node) =0.2345679
> class counts: 811
>probabilities: 0.421 0.579
> Node number 4: 29 observations
>   predicted class=absent   expected loss=0  P(node) =0.3580247
> class counts:29 0
>probabilities: 1.000 0.000
> Node number 5: 33 observations,complexity param=0.01960784
>   predicted class=absent   expected loss=0.1818182  P(node) =0.4074074
> class counts:27 6
>probabilities: 0.818 0.182
>   left son=10 (12 obs) right son=11 (21 obs)
>   Primary splits:
>   Age< 55   to the left,  improve=1.2467530, (0 missing)
>   Start  < 12.5 to the right, improve=0.2887701, (0 missing)
>   Number < 3.5  to the right, improve=0.1753247, (0 missing)
>   Surrogate splits:
>   Start  < 9.5  to the left,  agree=0.758, adj=0.333, (0 split)
>   Number < 5.5  to the right, agree=0.697, adj=0.167, (0 split)
> Node number 10: 12 observations
>   predicted class=absent   expected loss=0  P(node) =0.1481481
> class counts:12 0
>probabilities: 1.000 0.000
> Node number 11: 21 observations,complexity param=0.01960784
>   predicted class=absent   expected loss=0.2857143  P(node) =0.2592593
> class counts:15 6
>probabilities: 0.714 0.286
>   left son=22 (14 obs) right son=23 (7 obs)
>   Primary splits:
>   Age< 111  to the right, improve=1.71428600, (0 missing)
>   Start  < 12.5 to the right, improve=0.79365080, (0 missing)
>   Number < 3.5  to the right, improve=0.07142857, (0 missing)
> Node number 22: 14 observations
>   predicted class=absent   expected loss=0.1428571  P(node) =0.1728395
> class counts:12 2
>probabilities: 0.857 0.143
> Node number 23: 7 observations
>   predicted class=present  expected loss=0.4285714  P(node) =0.08641975
> class counts: 3 4
>probabilities: 0.429 0.571
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18348) Improve tree ensemble model summary

2017-01-17 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827023#comment-15827023
 ] 

Joseph K. Bradley commented on SPARK-18348:
---

I'm doing a general pass to enforce the Shepherd requirement from the roadmap 
process in [SPARK-18813].  Note that the process is a new proposal and can be 
updated as needed.

Is someone interested in shepherding this issue, or shall I remove the target 
version?


> Improve tree ensemble model summary
> ---
>
> Key: SPARK-18348
> URL: https://issues.apache.org/jira/browse/SPARK-18348
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Felix Cheung
>
> During work on R APIs for tree ensemble models (eg. Random Forest, GBT) it is 
> discovered and discussed that
> - we don't have a good summary on nodes or trees for their observations, 
> loss, probability and so on
> - we don't have a shared API with nicely formatted output
> We believe this could be a shared API that benefits multiple language 
> bindings, including R, when available.
> For example, here is what R {code}rpart{code} shows for model summary:
> {code}
> Call:
> rpart(formula = Kyphosis ~ Age + Number + Start, data = kyphosis,
> method = "class")
>   n= 81
>   CP nsplit rel errorxerror  xstd
> 1 0.17647059  0 1.000 1.000 0.2155872
> 2 0.01960784  1 0.8235294 0.9411765 0.2107780
> 3 0.0100  4 0.7647059 1.0588235 0.2200975
> Variable importance
>  StartAge Number
> 64 24 12
> Node number 1: 81 observations,complexity param=0.1764706
>   predicted class=absent   expected loss=0.2098765  P(node) =1
> class counts:6417
>probabilities: 0.790 0.210
>   left son=2 (62 obs) right son=3 (19 obs)
>   Primary splits:
>   Start  < 8.5  to the right, improve=6.762330, (0 missing)
>   Number < 5.5  to the left,  improve=2.866795, (0 missing)
>   Age< 39.5 to the left,  improve=2.250212, (0 missing)
>   Surrogate splits:
>   Number < 6.5  to the left,  agree=0.802, adj=0.158, (0 split)
> Node number 2: 62 observations,complexity param=0.01960784
>   predicted class=absent   expected loss=0.09677419  P(node) =0.7654321
> class counts:56 6
>probabilities: 0.903 0.097
>   left son=4 (29 obs) right son=5 (33 obs)
>   Primary splits:
>   Start  < 14.5 to the right, improve=1.0205280, (0 missing)
>   Age< 55   to the left,  improve=0.6848635, (0 missing)
>   Number < 4.5  to the left,  improve=0.2975332, (0 missing)
>   Surrogate splits:
>   Number < 3.5  to the left,  agree=0.645, adj=0.241, (0 split)
>   Age< 16   to the left,  agree=0.597, adj=0.138, (0 split)
> Node number 3: 19 observations
>   predicted class=present  expected loss=0.4210526  P(node) =0.2345679
> class counts: 811
>probabilities: 0.421 0.579
> Node number 4: 29 observations
>   predicted class=absent   expected loss=0  P(node) =0.3580247
> class counts:29 0
>probabilities: 1.000 0.000
> Node number 5: 33 observations,complexity param=0.01960784
>   predicted class=absent   expected loss=0.1818182  P(node) =0.4074074
> class counts:27 6
>probabilities: 0.818 0.182
>   left son=10 (12 obs) right son=11 (21 obs)
>   Primary splits:
>   Age< 55   to the left,  improve=1.2467530, (0 missing)
>   Start  < 12.5 to the right, improve=0.2887701, (0 missing)
>   Number < 3.5  to the right, improve=0.1753247, (0 missing)
>   Surrogate splits:
>   Start  < 9.5  to the left,  agree=0.758, adj=0.333, (0 split)
>   Number < 5.5  to the right, agree=0.697, adj=0.167, (0 split)
> Node number 10: 12 observations
>   predicted class=absent   expected loss=0  P(node) =0.1481481
> class counts:12 0
>probabilities: 1.000 0.000
> Node number 11: 21 observations,complexity param=0.01960784
>   predicted class=absent   expected loss=0.2857143  P(node) =0.2592593
> class counts:15 6
>probabilities: 0.714 0.286
>   left son=22 (14 obs) right son=23 (7 obs)
>   Primary splits:
>   Age< 111  to the right, improve=1.71428600, (0 missing)
>   Start  < 12.5 to the right, improve=0.79365080, (0 missing)
>   Number < 3.5  to the right, improve=0.07142857, (0 missing)
> Node number 22: 14 observations
>   predicted class=absent   expected loss=0.1428571  P(node) =0.1728395
> class counts:12 2
>probabilities: 0.857 0.143
> Node number 23: 7 observations
>   predicted class=present  expected loss=0.4285714  P(node) =0.08641975
> class counts: 3 4
>probabilities: 0.429 0.571
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To 

[jira] [Commented] (SPARK-18348) Improve tree ensemble model summary

2016-11-08 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15648312#comment-15648312
 ] 

Felix Cheung commented on SPARK-18348:
--

yes, thanks

> Improve tree ensemble model summary
> ---
>
> Key: SPARK-18348
> URL: https://issues.apache.org/jira/browse/SPARK-18348
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Felix Cheung
>
> During work on R APIs for tree ensemble models (eg. Random Forest, GBT) it is 
> discovered and discussed that
> - we don't have a good summary on nodes or trees for their observations, 
> loss, probability and so on
> - we don't have a shared API with nicely formatted output
> We believe this could be a shared API that benefits multiple language 
> bindings, including R, when available.
> For example, here is what R {code}rpart{code} shows for model summary:
> {code}
> Call:
> rpart(formula = Kyphosis ~ Age + Number + Start, data = kyphosis,
> method = "class")
>   n= 81
>   CP nsplit rel errorxerror  xstd
> 1 0.17647059  0 1.000 1.000 0.2155872
> 2 0.01960784  1 0.8235294 0.9411765 0.2107780
> 3 0.0100  4 0.7647059 1.0588235 0.2200975
> Variable importance
>  StartAge Number
> 64 24 12
> Node number 1: 81 observations,complexity param=0.1764706
>   predicted class=absent   expected loss=0.2098765  P(node) =1
> class counts:6417
>probabilities: 0.790 0.210
>   left son=2 (62 obs) right son=3 (19 obs)
>   Primary splits:
>   Start  < 8.5  to the right, improve=6.762330, (0 missing)
>   Number < 5.5  to the left,  improve=2.866795, (0 missing)
>   Age< 39.5 to the left,  improve=2.250212, (0 missing)
>   Surrogate splits:
>   Number < 6.5  to the left,  agree=0.802, adj=0.158, (0 split)
> Node number 2: 62 observations,complexity param=0.01960784
>   predicted class=absent   expected loss=0.09677419  P(node) =0.7654321
> class counts:56 6
>probabilities: 0.903 0.097
>   left son=4 (29 obs) right son=5 (33 obs)
>   Primary splits:
>   Start  < 14.5 to the right, improve=1.0205280, (0 missing)
>   Age< 55   to the left,  improve=0.6848635, (0 missing)
>   Number < 4.5  to the left,  improve=0.2975332, (0 missing)
>   Surrogate splits:
>   Number < 3.5  to the left,  agree=0.645, adj=0.241, (0 split)
>   Age< 16   to the left,  agree=0.597, adj=0.138, (0 split)
> Node number 3: 19 observations
>   predicted class=present  expected loss=0.4210526  P(node) =0.2345679
> class counts: 811
>probabilities: 0.421 0.579
> Node number 4: 29 observations
>   predicted class=absent   expected loss=0  P(node) =0.3580247
> class counts:29 0
>probabilities: 1.000 0.000
> Node number 5: 33 observations,complexity param=0.01960784
>   predicted class=absent   expected loss=0.1818182  P(node) =0.4074074
> class counts:27 6
>probabilities: 0.818 0.182
>   left son=10 (12 obs) right son=11 (21 obs)
>   Primary splits:
>   Age< 55   to the left,  improve=1.2467530, (0 missing)
>   Start  < 12.5 to the right, improve=0.2887701, (0 missing)
>   Number < 3.5  to the right, improve=0.1753247, (0 missing)
>   Surrogate splits:
>   Start  < 9.5  to the left,  agree=0.758, adj=0.333, (0 split)
>   Number < 5.5  to the right, agree=0.697, adj=0.167, (0 split)
> Node number 10: 12 observations
>   predicted class=absent   expected loss=0  P(node) =0.1481481
> class counts:12 0
>probabilities: 1.000 0.000
> Node number 11: 21 observations,complexity param=0.01960784
>   predicted class=absent   expected loss=0.2857143  P(node) =0.2592593
> class counts:15 6
>probabilities: 0.714 0.286
>   left son=22 (14 obs) right son=23 (7 obs)
>   Primary splits:
>   Age< 111  to the right, improve=1.71428600, (0 missing)
>   Start  < 12.5 to the right, improve=0.79365080, (0 missing)
>   Number < 3.5  to the right, improve=0.07142857, (0 missing)
> Node number 22: 14 observations
>   predicted class=absent   expected loss=0.1428571  P(node) =0.1728395
> class counts:12 2
>probabilities: 0.857 0.143
> Node number 23: 7 observations
>   predicted class=present  expected loss=0.4285714  P(node) =0.08641975
> class counts: 3 4
>probabilities: 0.429 0.571
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18348) Improve tree ensemble model summary

2016-11-08 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15647574#comment-15647574
 ] 

Yanbo Liang commented on SPARK-18348:
-

Should the type of this task as improvement rather than bug? I converted it to 
improvement.

> Improve tree ensemble model summary
> ---
>
> Key: SPARK-18348
> URL: https://issues.apache.org/jira/browse/SPARK-18348
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Felix Cheung
>
> During work on R APIs for tree ensemble models (eg. Random Forest, GBT) it is 
> discovered and discussed that
> - we don't have a good summary on nodes or trees for their observations, 
> loss, probability and so on
> - we don't have a shared API with nicely formatted output
> We believe this could be a shared API that benefits multiple language 
> bindings, including R, when available.
> For example, here is what R {code}rpart{code} shows for model summary:
> {code}
> Call:
> rpart(formula = Kyphosis ~ Age + Number + Start, data = kyphosis,
> method = "class")
>   n= 81
>   CP nsplit rel errorxerror  xstd
> 1 0.17647059  0 1.000 1.000 0.2155872
> 2 0.01960784  1 0.8235294 0.9411765 0.2107780
> 3 0.0100  4 0.7647059 1.0588235 0.2200975
> Variable importance
>  StartAge Number
> 64 24 12
> Node number 1: 81 observations,complexity param=0.1764706
>   predicted class=absent   expected loss=0.2098765  P(node) =1
> class counts:6417
>probabilities: 0.790 0.210
>   left son=2 (62 obs) right son=3 (19 obs)
>   Primary splits:
>   Start  < 8.5  to the right, improve=6.762330, (0 missing)
>   Number < 5.5  to the left,  improve=2.866795, (0 missing)
>   Age< 39.5 to the left,  improve=2.250212, (0 missing)
>   Surrogate splits:
>   Number < 6.5  to the left,  agree=0.802, adj=0.158, (0 split)
> Node number 2: 62 observations,complexity param=0.01960784
>   predicted class=absent   expected loss=0.09677419  P(node) =0.7654321
> class counts:56 6
>probabilities: 0.903 0.097
>   left son=4 (29 obs) right son=5 (33 obs)
>   Primary splits:
>   Start  < 14.5 to the right, improve=1.0205280, (0 missing)
>   Age< 55   to the left,  improve=0.6848635, (0 missing)
>   Number < 4.5  to the left,  improve=0.2975332, (0 missing)
>   Surrogate splits:
>   Number < 3.5  to the left,  agree=0.645, adj=0.241, (0 split)
>   Age< 16   to the left,  agree=0.597, adj=0.138, (0 split)
> Node number 3: 19 observations
>   predicted class=present  expected loss=0.4210526  P(node) =0.2345679
> class counts: 811
>probabilities: 0.421 0.579
> Node number 4: 29 observations
>   predicted class=absent   expected loss=0  P(node) =0.3580247
> class counts:29 0
>probabilities: 1.000 0.000
> Node number 5: 33 observations,complexity param=0.01960784
>   predicted class=absent   expected loss=0.1818182  P(node) =0.4074074
> class counts:27 6
>probabilities: 0.818 0.182
>   left son=10 (12 obs) right son=11 (21 obs)
>   Primary splits:
>   Age< 55   to the left,  improve=1.2467530, (0 missing)
>   Start  < 12.5 to the right, improve=0.2887701, (0 missing)
>   Number < 3.5  to the right, improve=0.1753247, (0 missing)
>   Surrogate splits:
>   Start  < 9.5  to the left,  agree=0.758, adj=0.333, (0 split)
>   Number < 5.5  to the right, agree=0.697, adj=0.167, (0 split)
> Node number 10: 12 observations
>   predicted class=absent   expected loss=0  P(node) =0.1481481
> class counts:12 0
>probabilities: 1.000 0.000
> Node number 11: 21 observations,complexity param=0.01960784
>   predicted class=absent   expected loss=0.2857143  P(node) =0.2592593
> class counts:15 6
>probabilities: 0.714 0.286
>   left son=22 (14 obs) right son=23 (7 obs)
>   Primary splits:
>   Age< 111  to the right, improve=1.71428600, (0 missing)
>   Start  < 12.5 to the right, improve=0.79365080, (0 missing)
>   Number < 3.5  to the right, improve=0.07142857, (0 missing)
> Node number 22: 14 observations
>   predicted class=absent   expected loss=0.1428571  P(node) =0.1728395
> class counts:12 2
>probabilities: 0.857 0.143
> Node number 23: 7 observations
>   predicted class=present  expected loss=0.4285714  P(node) =0.08641975
> class counts: 3 4
>probabilities: 0.429 0.571
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org