Hi Ted-san. I made a patch using Welford's method which you advised, not Weighted incremental algorithm.
And now the duplicate code is being checked to merge with FullRunningAverageAndStdDev. Thanks, 2012/1/16 Ted Dunning <[email protected]>: > WHy not just use an OnlineAccumulator? Why duplicate code? > > On Sun, Jan 15, 2012 at 11:59 AM, Wang Yue (Commented) (JIRA) < > [email protected]> wrote: > >> >> [ >> https://issues.apache.org/jira/browse/MAHOUT-945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186485#comment-13186485] >> >> Wang Yue commented on MAHOUT-945: >> --------------------------------- >> >> Hi, Ikumaso Mukai, >> Thanks for your improvement, I realize that you actually implement the >> new online version of variance calculation according to >> http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance, >> however, the problem I indicate still exists, that is, the final variance >> should divide by n(which is sample size.) So, I would suggest to modify the >> third last line of following code, do you think so? >> >> + /** >> + * Calculator for variance calculation >> + */ >> + private static class VarianceCalculator { >> + >> + private int n; >> + private double mean; >> + private double var; >> + >> + void add(double value) { >> + n++; >> + double oldMean = mean; >> + mean += (value - mean) / n; >> + double diff = (value - mean) * (value - oldMean); >> + var += diff; >> + } >> + >> + double getVariance() { >> + return var/n; //// suggested by Wang Yue >> >> + } >> + } >> >> > The variance calculation of Random forest regression tree >> > --------------------------------------------------------- >> > >> > Key: MAHOUT-945 >> > URL: https://issues.apache.org/jira/browse/MAHOUT-945 >> > Project: Mahout >> > Issue Type: Improvement >> > Components: Classification >> > Affects Versions: 0.6 >> > Reporter: Wang Yue >> > Labels: Regressionsplit.java >> > Attachments: MAHOUT-945.patch >> > >> > Original Estimate: 48h >> > Remaining Estimate: 48h >> > >> > Hi, Mukai >> > Thanks for your efforts in expand the RF to regression. However, I >> have a doubt about your implementation regarding to Regressionsplit.java. >> The variance method >> > " >> > private static double variance(double[] s, double[] ss, double[] >> dataSize) { >> > double var = 0; >> > for (int i = 0; i < s.length; i++) { >> > if (dataSize[i] > 0) { >> > var += ss[i] - ((s[i] * s[i]) / dataSize[i]); >> > } >> > } >> > return var; >> > } >> > " >> > While the variance in my mind should be something like >> > var += ss[i]/dataSize[i] - ((s[i] * s[i]) / (dataSize[i]*dataSize[i])); >> > Please help correct me if I am wrong. Thanks >> >> -- >> This message is automatically generated by JIRA. >> If you think it was sent incorrectly, please contact your JIRA >> administrators: >> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa >> For more information on JIRA, see: http://www.atlassian.com/software/jira >> >> >> -- - - - - - - - IKumasa Mukai at Recruit Co.,Ltd.
