[ 
https://issues.apache.org/jira/browse/MAHOUT-945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187920#comment-13187920
 ] 

Wang Yue commented on MAHOUT-945:
---------------------------------

Hi,
   Partly Yes, I would say my point is that when we calculate the variance, we 
should divide the size "n", otherwise, it will cause errors.
   When I ask why you did not divide the size n, your answer is that it will 
cause rounding error. 
   I would say this is not very sound. :(

   If you stick to use the total values or your original code, then calculation 
for the first split is 25.63-(1.67+37.5) = -13
   How do I get this formula?
   Recall that your original code did not divide the size n(here is 110.) so 
your code will obtain the total variance  as 25.63, first group variance is 
1.67, second group variance is 37.5 while real variance is 25.63/110 = 0.233. 
1.67/10=0.167, 37.5/100=0.375
   Hope this clarifies.
   
                
> The variance calculation of Random forest regression tree
> ---------------------------------------------------------
>
>                 Key: MAHOUT-945
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-945
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.6
>            Reporter: Wang Yue
>              Labels: Regressionsplit.java
>         Attachments: MAHOUT-945.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Hi, Mukai
>   Thanks for your efforts in expand the RF to regression. However, I have a 
> doubt about your implementation regarding to Regressionsplit.java. The 
> variance method 
> "
>  private static double variance(double[] s, double[] ss, double[] dataSize) {
>     double var = 0;
>     for (int i = 0; i < s.length; i++) {
>       if (dataSize[i] > 0) {
>         var += ss[i] - ((s[i] * s[i]) / dataSize[i]);
>       }
>     }
>     return var;
>   }
> "
> While the variance in my mind should be something like 
> var += ss[i]/dataSize[i] - ((s[i] * s[i]) / (dataSize[i]*dataSize[i]));
> Please help correct me if I am wrong. Thanks

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to