[ 
https://issues.apache.org/jira/browse/MAHOUT-945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187050#comment-13187050
 ] 

Wang Yue commented on MAHOUT-945:
---------------------------------

Hi,
I would still doubt the correctness of not dividing by n because of round 
error. 

Please correct me if I am wrong.

Below is a counter example.
  
Here is how I illustrate 

generate 110 numbers, with two groups 
first group with 10 numbers 
t1=rnorm(10,1,1)
> var(t1)
[1] 1.667472

second group 100 numbers.
t2=rnorm(100,1,0.5)
> var(t2)
[1] 0.1928758

The overall variance 110 numbers
> t=c(t1,t2)
> var(t)
[1] 0.2339202

Above split represent first way of split the 110 numbers.
  
one group with 10,  variance 1.66  your calculated variance is 16.6
second group with 100, variance 0.19. your calculated variance is 19
overall variance 0.233 your calculated variance is  25.63    

your calculated variance  reduced = 25.63-(16.6+19) = ~10
real variance reduced is 0.233-(1.66+0.19) = -1.62

Second split 
> tt2=t2[1:10]
> var(tt2)
0.1673325
> tt1=c(t2[11:100],t1)
> var(tt1)
[1] 0.3757684

one group with 10,  variance 0.167  your calculated variance is 1.67
second group with 100, variance 0.375. your calculated variance is  37.5
overall variance  0.233  your calculated variance is  25.63

your calculated variance reduced  = 25.63-(1.67+37.5) = -13
0.233-(0.167+0.375) = -~0.3x

your program will choose the first split while the real split may be second 
split.


                
> The variance calculation of Random forest regression tree
> ---------------------------------------------------------
>
>                 Key: MAHOUT-945
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-945
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.6
>            Reporter: Wang Yue
>              Labels: Regressionsplit.java
>         Attachments: MAHOUT-945.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Hi, Mukai
>   Thanks for your efforts in expand the RF to regression. However, I have a 
> doubt about your implementation regarding to Regressionsplit.java. The 
> variance method 
> "
>  private static double variance(double[] s, double[] ss, double[] dataSize) {
>     double var = 0;
>     for (int i = 0; i < s.length; i++) {
>       if (dataSize[i] > 0) {
>         var += ss[i] - ((s[i] * s[i]) / dataSize[i]);
>       }
>     }
>     return var;
>   }
> "
> While the variance in my mind should be something like 
> var += ss[i]/dataSize[i] - ((s[i] * s[i]) / (dataSize[i]*dataSize[i]));
> Please help correct me if I am wrong. Thanks

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to