[ https://issues.apache.org/jira/browse/MAHOUT-945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ikumasa Mukai updated MAHOUT-945: --------------------------------- Attachment: MAHOUT-945.patch Hi I made a new patch which has the Wang-san's point. Thank you Wang-san. On this, I adopt using FullRunningAverageAndStdDev instead of the own code for calculating the variances. And for the performance, this patch has the modification on FullRunningAverageAndStdDev. It is nice if you would check whether the modification is acceptable. Regards, > The variance calculation of Random forest regression tree > --------------------------------------------------------- > > Key: MAHOUT-945 > URL: https://issues.apache.org/jira/browse/MAHOUT-945 > Project: Mahout > Issue Type: Improvement > Components: Classification > Affects Versions: 0.6 > Reporter: Wang Yue > Labels: Regressionsplit.java > Attachments: MAHOUT-945.patch, MAHOUT-945.patch > > Original Estimate: 48h > Remaining Estimate: 48h > > Hi, Mukai > Thanks for your efforts in expand the RF to regression. However, I have a > doubt about your implementation regarding to Regressionsplit.java. The > variance method > " > private static double variance(double[] s, double[] ss, double[] dataSize) { > double var = 0; > for (int i = 0; i < s.length; i++) { > if (dataSize[i] > 0) { > var += ss[i] - ((s[i] * s[i]) / dataSize[i]); > } > } > return var; > } > " > While the variance in my mind should be something like > var += ss[i]/dataSize[i] - ((s[i] * s[i]) / (dataSize[i]*dataSize[i])); > Please help correct me if I am wrong. Thanks -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira