[ 
https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15870490#comment-15870490
 ] 

Camilo Lamus commented on SPARK-9478:
-------------------------------------

[~sethah] It is very exiting that you guys are working on adding a weighted 
version of the random forest. I am really looking forward to use it in spark ml 
RF and other algorithms. As [~josephkb] mention, adding weights to data points 
(samples/instances) has a myriad of application in data analysis.

I have a question about the way you are thinking on using the weights. Are you 
thinking on using the weights both in the bootstrap sample step as well as in 
growing the trees? Using it in both steps might make the weights overly 
“important”.

If you are using it in the tree growing process, are you doing something like 
what is shown here in slide 4 
(http://www.stat.cmu.edu/~ryantibs/datamining/lectures/25-boost.pdf)?

In the case where you would use the weights in the constructing the bootstrap 
samples, as you mention, the distribution of the (marginal) counts each data 
point is selected in a bootstrap sample is binomial. However, the joint 
distribution of the counts is multinomial. Specifically, If you draw N samples 
with replacement from the original N data points, selecting each with 
probability p_i = 1/ N, the joint distribution is Multinomial(N, p_i = 1/N, 
i=1,2,…,N), and this is not the same as drawing independently N times from 
Binomial(N, 1/N). For one thing, you might end up with more or less than N 
samples. In regard to the poisson approximation, I think this might be more 
problematic since I think it requires one of the counts to dominate (i.e, 
happen with high probability) (see here: 
http://www.jstor.org/stable/3314676?seq=1#page_scan_tab_contents). This is a 
theoretical issue, which might not be matter in practice. But who know, it 
might. And after all, it might be just better to get the counts from 
Multinomial(N, p_i = w_i / sum(w_j)).

Either way, if the poisson approximation is good enough, it does make more 
sense to use what you suggest at the end, which is to sample from 
Poisson(lambda_i = N  w_i / sum(w_j)). Sampling from Poisson(lambda_i = 1), and 
then multiply by  N  w_i / sum(w_j) can worsen the Poisson approximation to the 
binomial since the variance of multiplied version is lambda_i^2, and not 
lambda_i, as it should be a poisson rv.


> Add sample weights to Random Forest
> -----------------------------------
>
>                 Key: SPARK-9478
>                 URL: https://issues.apache.org/jira/browse/SPARK-9478
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 1.4.1
>            Reporter: Patrick Crenshaw
>
> Currently, this implementation of random forest does not support class 
> weights. Class weights are important when there is imbalanced training data 
> or the evaluation metric of a classifier is imbalanced (e.g. true positive 
> rate at some false positive threshold). 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to