ani5rudh commented on code in PR #52: URL: https://github.com/apache/commons-statistics/pull/52#discussion_r1299203658
########## commons-statistics-descriptive/src/main/java/org/apache/commons/statistics/descriptive/Variance.java: ########## @@ -0,0 +1,233 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.commons.statistics.descriptive; + +/** + * Computes the variance of a set of values. By default, the + * "sample variance" is computed. The definitional formula for sample + * variance is: + * <p> + * sum((x_i - mean)^2) / (n - 1) + * <p>This formula does not have good numerical properties, so this + * implementation does not use it to compute the statistic. + * <ul> + * <li> The {@link #accept(double)} method computes the variance using + * updating formulae based on West's algorithm, as described in + * <a href="http://doi.acm.org/10.1145/359146.359152"> Chan, T. F. and + * J. G. Lewis 1979, <i>Communications of the ACM</i>, + * vol. 22 no. 9, pp. 526-531.</a></li> + * + * <li> The {@link #of(double...)} method leverages the fact that it has the + * full array of values in memory to execute a two-pass algorithm. + * Specifically, this method uses the "corrected two-pass algorithm" from + * Chan, Golub, Levesque, <i>Algorithms for Computing the Sample Variance</i>, + * American Statistician, vol. 37, no. 3 (1983) pp. 242-247.</li></ul> + * + * Note that adding values using {@code accept} and then executing {@code getAsDouble} will + * sometimes give a different, less accurate, result than executing + * {@code of} with the full array of values. The former approach + * should only be used when the full array of values is not available. + * + * <p> + * Returns <code>Double.NaN</code> if no data values have been added and + * returns <code>0</code> if there is just one finite value in the data set. + * Note that <code>Double.NaN</code> may also be returned if the input includes + * <code>Double.NaN</code> and / or infinite values. + * + * <p>This class is designed to work with (though does not require) + * {@linkplain java.util.stream streams}. + * + * <p><strong>Note that this implementation is not synchronized.</strong> If + * multiple threads access an instance of this class concurrently, and at least + * one of the threads invokes the <code>increment()</code> or + * <code>clear()</code> method, it must be synchronized externally. + * + * <p>However, it is safe to use <code>accept()</code> and <code>combine()</code> + * as <code>accumulator</code> and <code>combiner</code> functions of + * {@link java.util.stream.Collector Collector} on a parallel stream, + * because the parallel implementation of {@link java.util.stream.Stream#collect Stream.collect()} + * provides the necessary partitioning, isolation, and merging of results for + * safe and efficient parallel execution. + * + * @since 1.1 + */ +public abstract class Variance implements DoubleStatistic, DoubleStatisticAccumulator<Variance> { + + /** + * Create a Variance instance. + */ + Variance() { + // No-op + } + + /** + * Creates a {@code Variance} implementation which does not store the input value(s) it consumes. + * + * <p>The result is <code>NaN</code> if: + * <ul> + * <li>no values have been added,</li> + * <li>any of the values is <code>NaN</code>, or</li> + * <li>an infinite value of either sign is encountered</li> + * </ul> + * + * @return {@code Variance} implementation. + */ + public static Variance create() { + return new StorelessSampleVariance(); + } + + /** + * Returns a {@code Variance} instance that has the variance of all input values, or <code>NaN</code> + * if: + * <ul> + * <li>the input array is empty,</li> + * <li>any of the values is <code>NaN</code>, or</li> + * <li>an infinite value of either sign is encountered</li> + * </ul> + * + * <p>Note: {@code Variance} computed using {@link Variance#accept Variance.accept()} may be different + * from this variance. + * + * <p>See {@link Variance} for details on the computing algorithm. + * + * @param values Values. + * @return {@code Variance} instance. + */ + public static Variance of(double... values) { + final double mean = Mean.of(values).getAsDouble(); + double accum = 0.0; + double dev; + double accum2 = 0.0; + double squaredDevSum; + for (final double value : values) { + dev = value - mean; + accum += dev * dev; + accum2 += dev; + } + final double accum2Squared = accum2 * accum2; + final long n = values.length; + // The sum of squared deviations is accum - (accum2 * accum2 / n). + // To prevent squaredDevSum from spuriously attaining a NaN value + // when accum2Squared (which implies accum is also infinite) is infinite, assign it + // an infinite value which is its intended value. + if (accum2Squared == Double.POSITIVE_INFINITY) { + squaredDevSum = Double.POSITIVE_INFINITY; + } else { + squaredDevSum = accum - (accum2 * accum2 / n); + } + return StorelessSampleVariance.create(squaredDevSum, mean, n, accum2 + (mean * n)); Review Comment: West's algorithm is used in the `accept(double d)` method for updating the value of the statistic. The two-pass algorithm used in the `of(double... values)` method is : ![image](https://github.com/apache/commons-statistics/assets/129569933/f1c9eb96-2926-4a5d-a61b-c14a5718c78a) This formula is from the paper/report titled - [Algorithms for computing the sample variance : analysis and recommendations](https://www.cs.yale.edu/publications/techreports/tr222.pdf) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@commons.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org