[ https://issues.apache.org/jira/browse/MAHOUT-945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187920#comment-13187920 ]
Wang Yue commented on MAHOUT-945: --------------------------------- Hi, Partly Yes, I would say my point is that when we calculate the variance, we should divide the size "n", otherwise, it will cause errors. When I ask why you did not divide the size n, your answer is that it will cause rounding error. I would say this is not very sound. :( If you stick to use the total values or your original code, then calculation for the first split is 25.63-(1.67+37.5) = -13 How do I get this formula? Recall that your original code did not divide the size n(here is 110.) so your code will obtain the total variance as 25.63, first group variance is 1.67, second group variance is 37.5 while real variance is 25.63/110 = 0.233. 1.67/10=0.167, 37.5/100=0.375 Hope this clarifies. > The variance calculation of Random forest regression tree > --------------------------------------------------------- > > Key: MAHOUT-945 > URL: https://issues.apache.org/jira/browse/MAHOUT-945 > Project: Mahout > Issue Type: Improvement > Components: Classification > Affects Versions: 0.6 > Reporter: Wang Yue > Labels: Regressionsplit.java > Attachments: MAHOUT-945.patch > > Original Estimate: 48h > Remaining Estimate: 48h > > Hi, Mukai > Thanks for your efforts in expand the RF to regression. However, I have a > doubt about your implementation regarding to Regressionsplit.java. The > variance method > " > private static double variance(double[] s, double[] ss, double[] dataSize) { > double var = 0; > for (int i = 0; i < s.length; i++) { > if (dataSize[i] > 0) { > var += ss[i] - ((s[i] * s[i]) / dataSize[i]); > } > } > return var; > } > " > While the variance in my mind should be something like > var += ss[i]/dataSize[i] - ((s[i] * s[i]) / (dataSize[i]*dataSize[i])); > Please help correct me if I am wrong. Thanks -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira