[ https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15345208#comment-15345208 ]
Seth Hendrickson commented on SPARK-9478: ----------------------------------------- Thanks for your timely feedback! There are many use cases for sample weights in machine learning algorithms that are broadly applicable. In regression, it is common to use sample weights to account for changing variance in the data generation process. Sample weights can also be used in both classification and regression to weight more recent data points that may be more reflective of the data generation model. Handling imbalanced datasets with class weights can be seen as a specific case of sample weights. Using upsampling/downsampling can cause unnecessary duplication of the input data and also makes it more difficult to assign arbitrary weights to samples. Even further, implementing weighted boosting algorithms like AdaBoost/LogitBoost etc... will not be possible without sample weights. Scikit-learn does indeed support sample weights, as you can see [here|http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.fit], and in fact the algorithms simply convert class weights into sample weights. With this in mind, I think we should support sample weights. We might also want to implement a mechanism to support class weights in the API where users don't have to manually convert class weights to sample weights - we can open a new JIRA to discuss it. [There is an ongoing effort in MLlib to support instance weighting|https://issues.apache.org/jira/browse/SPARK-9610] in the various algorithms and so I think it is beneficial to add it to trees and forests. > Add class weights to Random Forest > ---------------------------------- > > Key: SPARK-9478 > URL: https://issues.apache.org/jira/browse/SPARK-9478 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib > Affects Versions: 1.4.1 > Reporter: Patrick Crenshaw > > Currently, this implementation of random forest does not support class > weights. Class weights are important when there is imbalanced training data > or the evaluation metric of a classifier is imbalanced (e.g. true positive > rate at some false positive threshold). -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org