[ 
https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15344950#comment-15344950
 ] 

Seth Hendrickson commented on SPARK-9478:
-----------------------------------------

There has been a bit of confusion regarding this JIRA, I think. [~pcrenshaw] 
Please do correct me if I'm wrong, but the JIRA is, in truth, for adding a 
mechanism to handle imbalanced classification datasets. This could be done 
through class weighting or through instance weighting, I suppose the 
implementation is up for debate. 

There has been potentially more confusion since an initial PR was made using 
instance weighting. Now there is a PR made which adds class weighting. I think 
adding instance weighting is the best approach here because it allows users to 
handle imbalanced outcome classes in their data, but also adds the ability to 
use instance weighting generically which has a broad range of use cases. 
Additionally, it is not specific to classification. Also, this is how the other 
ML algorithms have so far dealt with it and it will allow forests and trees to 
conform to the same API as Logistic/Linear regression, for example. 

I vote to change this JIRA title to "Add instance weights for Random Forest and 
Decision Trees" and proceed accordingly, but I'm open to other opinions. If we 
want to pursue class weights we can do it in a separate JIRA. And again, I have 
a PR ready for this which I have not submitted because of a.) other blocking 
issues and b.) Spark 2.0 QA takes review precedence for the time being. 

I look forward to others' thoughts. Ping [~josephkb] (I cannot ping n-triple-a 
because I don't know the JIRA username).

> Add class weights to Random Forest
> ----------------------------------
>
>                 Key: SPARK-9478
>                 URL: https://issues.apache.org/jira/browse/SPARK-9478
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, MLlib
>    Affects Versions: 1.4.1
>            Reporter: Patrick Crenshaw
>
> Currently, this implementation of random forest does not support class 
> weights. Class weights are important when there is imbalanced training data 
> or the evaluation metric of a classifier is imbalanced (e.g. true positive 
> rate at some false positive threshold). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to