Hi Ted-san.

I made a patch using Welford's method which you advised, not Weighted
incremental algorithm.

And now the duplicate code is being checked to merge with
FullRunningAverageAndStdDev.

Thanks,

2012/1/16 Ted Dunning <[email protected]>:
> WHy not just use an OnlineAccumulator?  Why duplicate code?
>
> On Sun, Jan 15, 2012 at 11:59 AM, Wang Yue (Commented) (JIRA) <
> [email protected]> wrote:
>
>>
>>    [
>> https://issues.apache.org/jira/browse/MAHOUT-945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186485#comment-13186485]
>>
>> Wang Yue commented on MAHOUT-945:
>> ---------------------------------
>>
>> Hi, Ikumaso Mukai,
>>  Thanks for your improvement, I realize that you actually implement the
>> new online version of variance calculation according to
>> http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance,
>> however, the problem I indicate still exists, that is, the final variance
>> should divide by n(which is sample size.) So, I would suggest to modify the
>> third last line of following code, do you think so?
>>
>> +  /**
>> +   * Calculator for variance calculation
>> +   */
>> +  private static class VarianceCalculator {
>> +
>> +    private int n;
>> +    private double mean;
>> +    private double var;
>> +
>> +    void add(double value) {
>> +      n++;
>> +      double oldMean = mean;
>> +      mean += (value - mean) / n;
>> +      double diff = (value - mean) * (value - oldMean);
>> +      var += diff;
>> +    }
>> +
>> +    double getVariance() {
>> +      return var/n;   //// suggested by Wang Yue
>>
>> +    }
>> +  }
>>
>> > The variance calculation of Random forest regression tree
>> > ---------------------------------------------------------
>> >
>> >                 Key: MAHOUT-945
>> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-945
>> >             Project: Mahout
>> >          Issue Type: Improvement
>> >          Components: Classification
>> >    Affects Versions: 0.6
>> >            Reporter: Wang Yue
>> >              Labels: Regressionsplit.java
>> >         Attachments: MAHOUT-945.patch
>> >
>> >   Original Estimate: 48h
>> >  Remaining Estimate: 48h
>> >
>> > Hi, Mukai
>> >   Thanks for your efforts in expand the RF to regression. However, I
>> have a doubt about your implementation regarding to Regressionsplit.java.
>> The variance method
>> > "
>> >  private static double variance(double[] s, double[] ss, double[]
>> dataSize) {
>> >     double var = 0;
>> >     for (int i = 0; i < s.length; i++) {
>> >       if (dataSize[i] > 0) {
>> >         var += ss[i] - ((s[i] * s[i]) / dataSize[i]);
>> >       }
>> >     }
>> >     return var;
>> >   }
>> > "
>> > While the variance in my mind should be something like
>> > var += ss[i]/dataSize[i] - ((s[i] * s[i]) / (dataSize[i]*dataSize[i]));
>> > Please help correct me if I am wrong. Thanks
>>
>> --
>> This message is automatically generated by JIRA.
>> If you think it was sent incorrectly, please contact your JIRA
>> administrators:
>> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>
>>
>>



-- 
- - - - - - -
IKumasa Mukai at Recruit Co.,Ltd.

Reply via email to