[ https://issues.apache.org/jira/browse/MAHOUT-945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187050#comment-13187050 ]
Wang Yue commented on MAHOUT-945: --------------------------------- Hi, I would still doubt the correctness of not dividing by n because of round error. Please correct me if I am wrong. Below is a counter example. Here is how I illustrate generate 110 numbers, with two groups first group with 10 numbers t1=rnorm(10,1,1) > var(t1) [1] 1.667472 second group 100 numbers. t2=rnorm(100,1,0.5) > var(t2) [1] 0.1928758 The overall variance 110 numbers > t=c(t1,t2) > var(t) [1] 0.2339202 Above split represent first way of split the 110 numbers. one group with 10, variance 1.66 your calculated variance is 16.6 second group with 100, variance 0.19. your calculated variance is 19 overall variance 0.233 your calculated variance is 25.63 your calculated variance reduced = 25.63-(16.6+19) = ~10 real variance reduced is 0.233-(1.66+0.19) = -1.62 Second split > tt2=t2[1:10] > var(tt2) 0.1673325 > tt1=c(t2[11:100],t1) > var(tt1) [1] 0.3757684 one group with 10, variance 0.167 your calculated variance is 1.67 second group with 100, variance 0.375. your calculated variance is 37.5 overall variance 0.233 your calculated variance is 25.63 your calculated variance reduced = 25.63-(1.67+37.5) = -13 0.233-(0.167+0.375) = -~0.3x your program will choose the first split while the real split may be second split. > The variance calculation of Random forest regression tree > --------------------------------------------------------- > > Key: MAHOUT-945 > URL: https://issues.apache.org/jira/browse/MAHOUT-945 > Project: Mahout > Issue Type: Improvement > Components: Classification > Affects Versions: 0.6 > Reporter: Wang Yue > Labels: Regressionsplit.java > Attachments: MAHOUT-945.patch > > Original Estimate: 48h > Remaining Estimate: 48h > > Hi, Mukai > Thanks for your efforts in expand the RF to regression. However, I have a > doubt about your implementation regarding to Regressionsplit.java. The > variance method > " > private static double variance(double[] s, double[] ss, double[] dataSize) { > double var = 0; > for (int i = 0; i < s.length; i++) { > if (dataSize[i] > 0) { > var += ss[i] - ((s[i] * s[i]) / dataSize[i]); > } > } > return var; > } > " > While the variance in my mind should be something like > var += ss[i]/dataSize[i] - ((s[i] * s[i]) / (dataSize[i]*dataSize[i])); > Please help correct me if I am wrong. Thanks -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira