[jira] [Commented] (SPARK-9478) Add class weights to Random Forest

2016-06-22 Thread Yuewei Na (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15345447#comment-15345447
 ] 

Yuewei Na commented on SPARK-9478:
--

Hi [~sethah]. Actually, the code I PR has been used in our company for a period 
of time and we recently decide to make it open sourced. We used this 
implementation due to the fact that there is no class weights support in the 
current version and we do have practical needs. Comparing to sample weights, 
our version saves more memory since ours don't need to add a column to store 
sample weights.

At the same time, I browsed the APIs and implementations of the ensemble 
methods in scikit-learn. It's true that the class weights are integrated 
together with sample weights there. Together with the need of sample weights in 
other various models, I agree that a functionality that supports sample weights 
is a better choice. So now I have some thoughts on this problem:
  1. I agree with you on implementing a mechanism to support class weights. I 
think it will reduce users' effort to achieve their goal.
  2. Since my PR is a lightweight version and it has been tested and used in 
our company for a period of time, we could review and merge my PR to the master 
branch first to make it available to users who need it. And we can remove it 
when there are no problems that block the instance weight version while 
preserving the same interface for setting the class weights. We could either 
create a new JIRA which separates the problem 'adding class weights' and the 
problem 'adding instance weights'. But at least, the title of the current JIRA 
should be changed or a new JIRA should be created.

 

> Add class weights to Random Forest
> --
>
> Key: SPARK-9478
> URL: https://issues.apache.org/jira/browse/SPARK-9478
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.4.1
>Reporter: Patrick Crenshaw
>
> Currently, this implementation of random forest does not support class 
> weights. Class weights are important when there is imbalanced training data 
> or the evaluation metric of a classifier is imbalanced (e.g. true positive 
> rate at some false positive threshold). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9478) Add class weights to Random Forest

2016-06-22 Thread Yuewei Na (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15345105#comment-15345105
 ] 

Yuewei Na edited comment on SPARK-9478 at 6/22/16 8:28 PM:
---

Hi [~sethah], thanks a lot for your comment on my PR and your continual 
concerns on this problem. Sorry for not commenting before I made this PR. Like 
what you said, the major reason for me to make another PR is exactly because of 
the title of this JIRA. 

I implement this class weight version instead of sticking to instance weight 
because:
1. Existing implementations in other languages or packages, e.g. rpart in R and 
sklearn in Python all support class weights instead of instance weights. 
Indeed, instance weights make weighting in regression also possible. But the 
major application in handling imbalanced dataset is classification. If one does 
need such feature, it could be done by downsampling or upsampling the whole 
dataset. For the materials that I've read, including the book 'Elements of 
Statistical Learning', Rpart's 
documentation(https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf)
 and some professor's PPT. I've never seen the use cases for handling 
imbalanced dataset in regression problems using Random Forest. I would be very 
happy if someone could tell me under what circumstances it's needed.

2. As you commented in the first PR, the instance weight implementation makes 
'minInstancesPerNode' feature in trouble. The class weight implementation has 
no such issue, which will make the code more stable because very few inner 
modifications are needed.


was (Author: vincentna):
Hi [~sethah], thanks a lot for your comment on my PR and your continual 
concerns on this problem. Sorry for not commenting before I made this PR. Like 
what you said, the major reason for me to make another PR is exactly because of 
the title of this JIRA. 

I implement this class weight version instead of sticking to instance weight 
because:
1. Existing implementations in other languages or packages, e.g. rpart in R and 
sklearn in Python all support class weights instead of instance weights. 
Indeed, instance weights make weighting in regression also possible. But the 
major application in handling imbalanced dataset is classification. If one does 
need such feature, it could be done by downsampling or upsampling the whole 
dataset. For the materials that I've read, including the book 'Elements of 
Statistical Learning', Rpart's 
documentation(https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf)
 and some professor's PPT. I've never seen the use cases for handling 
imbalanced dataset in regression problems using Random Forest. I would be very 
happy if someone could tell me when it's needed.

2. As you commented in the first PR, the instance weight implementation makes 
'minInstancesPerNode' feature in trouble. The class weight implementation has 
no such issue, which will make the code more stable because very few inner 
modifications are needed.

> Add class weights to Random Forest
> --
>
> Key: SPARK-9478
> URL: https://issues.apache.org/jira/browse/SPARK-9478
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.4.1
>Reporter: Patrick Crenshaw
>
> Currently, this implementation of random forest does not support class 
> weights. Class weights are important when there is imbalanced training data 
> or the evaluation metric of a classifier is imbalanced (e.g. true positive 
> rate at some false positive threshold). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9478) Add class weights to Random Forest

2016-06-22 Thread Yuewei Na (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15345105#comment-15345105
 ] 

Yuewei Na commented on SPARK-9478:
--

Hi [~sethah], thanks a lot for your comment on my PR and your continual 
concerns on this problem. Sorry for not commenting before I made this PR. Like 
what you said, the major reason for me to make another PR is exactly because of 
the title of this JIRA. 

I implement this class weight version instead of sticking to instance weight 
because:
1. Existing implementations in other languages or packages, e.g. rpart in R and 
sklearn in Python all support class weights instead of instance weights. 
Indeed, instance weights make weighting in regression also possible. But the 
major application in handling imbalanced dataset is classification. If one does 
need such feature, it could be done by downsampling or upsampling the whole 
dataset. For the materials that I've read, including the book 'Elements of 
Statistical Learning', Rpart's 
documentation(https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf)
 and some professor's PPT. I've never seen the use cases for handling 
imbalanced dataset in regression problems using Random Forest. I would be very 
happy if someone could tell me when it's needed.

2. As you commented in the first PR, the instance weight implementation makes 
'minInstancesPerNode' feature in trouble. The class weight implementation has 
no such issue, which will make the code more stable because very few inner 
modifications are needed.

> Add class weights to Random Forest
> --
>
> Key: SPARK-9478
> URL: https://issues.apache.org/jira/browse/SPARK-9478
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.4.1
>Reporter: Patrick Crenshaw
>
> Currently, this implementation of random forest does not support class 
> weights. Class weights are important when there is imbalanced training data 
> or the evaluation metric of a classifier is imbalanced (e.g. true positive 
> rate at some false positive threshold). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15936) CLONE - Add class weights to Random Forest

2016-06-13 Thread Yuewei Na (JIRA)
Yuewei Na created SPARK-15936:
-

 Summary: CLONE - Add class weights to Random Forest
 Key: SPARK-15936
 URL: https://issues.apache.org/jira/browse/SPARK-15936
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 1.4.1
Reporter: Yuewei Na


Currently, this implementation of random forest does not support class weights. 
Class weights are important when there is imbalanced training data or the 
evaluation metric of a classifier is imbalanced (e.g. true positive rate at 
some false positive threshold). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org