WHy not just use an OnlineAccumulator?  Why duplicate code?

On Sun, Jan 15, 2012 at 11:59 AM, Wang Yue (Commented) (JIRA) <
[email protected]> wrote:

>
>    [
> https://issues.apache.org/jira/browse/MAHOUT-945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186485#comment-13186485]
>
> Wang Yue commented on MAHOUT-945:
> ---------------------------------
>
> Hi, Ikumaso Mukai,
>  Thanks for your improvement, I realize that you actually implement the
> new online version of variance calculation according to
> http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance,
> however, the problem I indicate still exists, that is, the final variance
> should divide by n(which is sample size.) So, I would suggest to modify the
> third last line of following code, do you think so?
>
> +  /**
> +   * Calculator for variance calculation
> +   */
> +  private static class VarianceCalculator {
> +
> +    private int n;
> +    private double mean;
> +    private double var;
> +
> +    void add(double value) {
> +      n++;
> +      double oldMean = mean;
> +      mean += (value - mean) / n;
> +      double diff = (value - mean) * (value - oldMean);
> +      var += diff;
> +    }
> +
> +    double getVariance() {
> +      return var/n;   //// suggested by Wang Yue
>
> +    }
> +  }
>
> > The variance calculation of Random forest regression tree
> > ---------------------------------------------------------
> >
> >                 Key: MAHOUT-945
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-945
> >             Project: Mahout
> >          Issue Type: Improvement
> >          Components: Classification
> >    Affects Versions: 0.6
> >            Reporter: Wang Yue
> >              Labels: Regressionsplit.java
> >         Attachments: MAHOUT-945.patch
> >
> >   Original Estimate: 48h
> >  Remaining Estimate: 48h
> >
> > Hi, Mukai
> >   Thanks for your efforts in expand the RF to regression. However, I
> have a doubt about your implementation regarding to Regressionsplit.java.
> The variance method
> > "
> >  private static double variance(double[] s, double[] ss, double[]
> dataSize) {
> >     double var = 0;
> >     for (int i = 0; i < s.length; i++) {
> >       if (dataSize[i] > 0) {
> >         var += ss[i] - ((s[i] * s[i]) / dataSize[i]);
> >       }
> >     }
> >     return var;
> >   }
> > "
> > While the variance in my mind should be something like
> > var += ss[i]/dataSize[i] - ((s[i] * s[i]) / (dataSize[i]*dataSize[i]));
> > Please help correct me if I am wrong. Thanks
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA
> administrators:
> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>
>

Reply via email to