Re: [Numpy-discussion] non-standard standard deviation
Anne Archibald wrote: 2009/11/29 Dr. Phillip M. Feldman pfeld...@verizon.net: All of the statistical packages that I am currently using and have used in the past (Matlab, Minitab, R, S-plus) calculate standard deviation using the sqrt(1/(n-1)) normalization, which gives a result that is unbiased when sampling from a normally-distributed population. NumPy uses the sqrt(1/n) normalization. I'm currently using the following code to calculate standard deviations, but would much prefer if this could be fixed in NumPy itself: This issue was the subject of lengthy discussions on the mailing list, the upshot of which is that in current versions of scipy, std and var take an optional argument ddof, into which you can supply 1 to get the normalization you want. Anne You are right that I can get the result that I want by setting ddof. Thanks! I still feel that the default value for ddof should be 1 rather than 0; new users are unlikely to read the documentation for a command like std, because it is reasonable to expect standard behavior across all statistical packages. Phillip -- View this message in context: http://old.nabble.com/non-standard-standard-deviation-tp26566808p26753999.html Sent from the Numpy-discussion mailing list archive at Nabble.com. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] non-standard standard deviation
On 04-Dec-09 10:54 AM, Bruce Southey wrote: On 12/04/2009 06:18 AM, yogesh karpate wrote: @ Pauli and @ Colin: Sorry for the late reply. I was busy in some other assignments. # As far as normalization by(n) is concerned then its common assumption that the population is normally distributed and population size is fairly large enough to fit the normal distribution. But this standard deviation, when applied to a small population, tends to be too low therefore it is called as biased. # The correction known as bessel correction is there for small sample size std. deviation. i.e. normalization by (n-1). # In electrical-and-electronic-measurements-and-instrumentation by A.K. Sawhney . In 1st chapter of the book Fundamentals of Meausrements . Its shown that for N=16 the std. deviation normalization was (n-1)=15 # While I was learning statistics in my course Instructor would advise to take n=20 for normalization by (n-1) # Probability and statistics by Schuam Series is good reading. Regards ~ymk ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Hi, Basically, all that I see with these arbitrary values is that you are relying on the 'central limit theorem' (http://en.wikipedia.org/wiki/Central_limit_theorem). Really the issue in using these values is how much statistical bias will you tolerate especially in the impact on usage of that estimate because the usage of variance (such as in statistical tests) tend to be more influenced by bias than the estimate of variance. (Of course, many features rely on asymptotic properties so bias concerns are less apparent in large sample sizes.) Obviously the default relies on the developers background and requirements. There are multiple valid variance estimators in statistics with different denominators like N (maximum likelihood estimator), N-1 (restricted maximum likelihood estimator and certain Bayesian estimators) and Stein's (http://en.wikipedia.org/wiki/James%E2%80%93Stein_estimator). So thecurrent default behavior is a valid and documented. Consequently you can not just have one option or different functions (like certain programs) and Numpy's implementation actually allows you do all these in a single function. So I also see no reason change even if I have to add the ddof=1 argument, after all 'Explicit is better than implicit' :-). Bruce Bruce, I suggest that the Central Limit Theorem is tied in with the Law of Large Numbers. When one has a smallish sample size, what give the best estimate of the variance? The Bessel Correction provides a rationale, based on expectations: (http://en.wikipedia.org/wiki/Bessel%27s_correction). It is difficult to understand the proof of Stein: http://en.wikipedia.org/wiki/Proof_of_Stein%27s_example The symbols used are not clearly stated. He seems interested in a decision rule for the calculation of the mean of a sample and claims that his approach is better than the traditional Least Squares approach. In most cases, the interest is likely to be in the variance, with a view to establishing a confidence interval. In the widely used Analysis of Variance (ANOVA), the degrees of freedom are reduced for each mean estimated, see: http://www.mnstate.edu/wasson/ed602lesson13.htm for the example below: *Analysis of Variance Table* ** Source of Variation Sum of Squares Degrees of Freedom Mean Square F Ratio p Between Groups 25.20 2 12.60 5.178 .05 Within Groups 29.20 12 2.43 Total 54.40 14 There is a sample of 15 observations, which is divided into three groups, depending on the number of hours of therapy. Thus, the Total degrees of freedom are 15-1 = 14, the Between Groups 3-1 = 2 and the Residual is 14 - 2 = 12. Colin W. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] non-standard standard deviation
On Sun, Dec 6, 2009 at 11:01 AM, Colin J. Williams c...@ncf.ca wrote: On 04-Dec-09 10:54 AM, Bruce Southey wrote: On 12/04/2009 06:18 AM, yogesh karpate wrote: @ Pauli and @ Colin: Sorry for the late reply. I was busy in some other assignments. # As far as normalization by(n) is concerned then its common assumption that the population is normally distributed and population size is fairly large enough to fit the normal distribution. But this standard deviation, when applied to a small population, tends to be too low therefore it is called as biased. # The correction known as bessel correction is there for small sample size std. deviation. i.e. normalization by (n-1). # In electrical-and-electronic-measurements-and-instrumentation by A.K. Sawhney . In 1st chapter of the book Fundamentals of Meausrements . Its shown that for N=16 the std. deviation normalization was (n-1)=15 # While I was learning statistics in my course Instructor would advise to take n=20 for normalization by (n-1) # Probability and statistics by Schuam Series is good reading. Regards ~ymk ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Hi, Basically, all that I see with these arbitrary values is that you are relying on the 'central limit theorem' (http://en.wikipedia.org/wiki/Central_limit_theorem). Really the issue in using these values is how much statistical bias will you tolerate especially in the impact on usage of that estimate because the usage of variance (such as in statistical tests) tend to be more influenced by bias than the estimate of variance. (Of course, many features rely on asymptotic properties so bias concerns are less apparent in large sample sizes.) Obviously the default relies on the developers background and requirements. There are multiple valid variance estimators in statistics with different denominators like N (maximum likelihood estimator), N-1 (restricted maximum likelihood estimator and certain Bayesian estimators) and Stein's (http://en.wikipedia.org/wiki/James%E2%80%93Stein_estimator). So thecurrent default behavior is a valid and documented. Consequently you can not just have one option or different functions (like certain programs) and Numpy's implementation actually allows you do all these in a single function. So I also see no reason change even if I have to add the ddof=1 argument, after all 'Explicit is better than implicit' :-). Bruce Bruce, I suggest that the Central Limit Theorem is tied in with the Law of Large Numbers. When one has a smallish sample size, what give the best estimate of the variance? The Bessel Correction provides a rationale, based on expectations: (http://en.wikipedia.org/wiki/Bessel%27s_correction). It is difficult to understand the proof of Stein: http://en.wikipedia.org/wiki/Proof_of_Stein%27s_example The symbols used are not clearly stated. He seems interested in a decision rule for the calculation of the mean of a sample and claims that his approach is better than the traditional Least Squares approach. In most cases, the interest is likely to be in the variance, with a view to establishing a confidence interval. What's the best estimate? That's the main question Estimators differ in their (sample or posterior) distribution, especially bias and variance. Stein estimator dominates OLS in the mean squared error, so although it is biased, the variance of the estimator is smaller than OLS so that MSE (bias plus variance) is also smaller for Stein estimator than for OLS. Depending on the application there could be many possible loss functions, including asymmetric, eg. if its more costly to over than to under estimate. The following was a good book for this, that I read a long time ago: Statistical decision theory and Bayesian analysis By James O. Berger http://books.google.ca/books?id=oY_x7dE15_ACpg=PP1lpg=PP1dq=berger+decisionsource=blots=wzL3ocu5_9sig=lGm5VevPtnFW570mgeqJklASalUhl=enei=P9cbS5CSCIqllAf-0f3xCQsa=Xoi=book_resultct=resultresnum=4ved=0CBcQ6AEwAw#v=onepageq=f=false In the widely used Analysis of Variance (ANOVA), the degrees of freedom are reduced for each mean estimated, see: http://www.mnstate.edu/wasson/ed602lesson13.htm for the example below: *Analysis of Variance Table* ** Source of Variation Sum of Squares Degrees of Freedom Mean Square F Ratio p Between Groups 25.20 2 12.60 5.178 .05 Within Groups 29.20 12 2.43 Total 54.40 14 There is a sample of 15 observations, which is divided into three groups, depending on the number of hours of therapy. Thus, the Total degrees of freedom are 15-1 = 14, the Between Groups 3-1 = 2 and the Residual is 14 - 2 = 12. Statistical tests are the only area where I really pay attention to the degrees of freedom, since the
Re: [Numpy-discussion] non-standard standard deviation
On Sun, Dec 6, 2009 at 9:21 AM, josef.p...@gmail.com wrote: On Sun, Dec 6, 2009 at 11:01 AM, Colin J. Williams c...@ncf.ca wrote: snip What's the best estimate? That's the main question Estimators differ in their (sample or posterior) distribution, especially bias and variance. Stein estimator dominates OLS in the mean squared error, so although it is biased, the variance of the estimator is smaller than OLS so that MSE (bias plus variance) is also smaller for Stein estimator than for OLS. Depending on the application there could be many possible loss functions, including asymmetric, eg. if its more costly to over than to under estimate. The following was a good book for this, that I read a long time ago: Statistical decision theory and Bayesian analysis By James O. Berger http://books.google.ca/books?id=oY_x7dE15_ACpg=PP1lpg=PP1dq=berger+decisionsource=blots=wzL3ocu5_9sig=lGm5VevPtnFW570mgeqJklASalUhl=enei=P9cbS5CSCIqllAf-0f3xCQsa=Xoi=book_resultct=resultresnum=4ved=0CBcQ6AEwAw#v=onepageq=f=false At last, an explanation I can understand. Thanks Josef. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] non-standard standard deviation
Colin J. Williams skrev: When one has a smallish sample size, what give the best estimate of the variance? What do you mean by best estimate? Unbiased? Smallest standard error? In the widely used Analysis of Variance (ANOVA), the degrees of freedom are reduced for each mean estimated, That is for statistical tests, not to compute estimators. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] non-standard standard deviation
On Sun, Dec 6, 2009 at 11:36 AM, Sturla Molden stu...@molden.no wrote: Colin J. Williams skrev: When one has a smallish sample size, what give the best estimate of the variance? What do you mean by best estimate? Unbiased? Smallest standard error? In the widely used Analysis of Variance (ANOVA), the degrees of freedom are reduced for each mean estimated, That is for statistical tests, not to compute estimators. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Ignoring the estimation method, there is no correct answer unless you impose various conditions like minimum-variance unbiased estimator (http://en.wikipedia.org/wiki/Minimum_variance_unbiased) where usually N-1 wins. Anyhow, this is way off topic since it is totally in the realm of math stats. Law of large numbers (http://en.wikipedia.org/wiki/Law_of_large_numbers) just address that the average not the variance so it is not directly applicable. Bruce ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] non-standard standard deviation
On 04-Dec-09 05:21 AM, Pauli Virtanen wrote: pe, 2009-12-04 kello 11:19 +0100, Chris Colbert kirjoitti: Why cant the divisor constant just be made an optional kwarg that defaults to zero? It already is an optional kwarg that defaults to zero. Cheers, I suggested that 1 (one) would be a better default but Robert Kern told us that it won't happen. Colin W. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] non-standard standard deviation
On 04-Dec-09 07:18 AM, yogesh karpate wrote: @ Pauli and @ Colin: Sorry for the late reply. I was busy in some other assignments. # As far as normalization by(n) is concerned then its common assumption that the population is normally distributed and population size is fairly large enough to fit the normal distribution. But this standard deviation, when applied to a small population, tends to be too low therefore it is called as biased. # The correction known as bessel correction is there for small sample size std. deviation. i.e. normalization by (n-1). # In electrical-and-electronic-measurements-and-instrumentation by A.K. Sawhney . In 1st chapter of the book Fundamentals of Meausrements . Its shown that for N=16 the std. deviation normalization was (n-1)=15 # While I was learning statistics in my course Instructor would advise to take n=20 for normalization by (n-1) # Probability and statistics by Schuam Series is good reading. Regards ~ymk Yogesh, Thanks for the Bessel name, I hadn't come across that before. The Wikipedia reference for the Bessel Correction uses a divisor of n-1: http://en.wikipedia.org/wiki/Bessel%27s_correction Perhaps the simplification for larger n comes from the fact that for large n, 1/n = 1/(n-1). I would suggest C. E. Weatherburn - Mathematical Statistics, but I doubt whether it is still widely available. Colin W. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] non-standard standard deviation
Colin J. Williams skrev: suggested that 1 (one) would be a better default but Robert Kern told us that it won't happen. I don't even see the need for this keyword argument, as you can always multiply the variance by n/(n-1) to get what you want. Also, normalization by n gives the ML estimate (yes it has a bias, but it is better anyway). It is a common novice mistake to use 1/(n-1) as nomalization, probably due to poor advice in introductory statistics textbooks. It also seems that frequentists are more scared about this bias boogey monster than Bayesians. It may actually help beginners to avoid this mistake if numpy's implementation prompts them to ask why the normalization is 1/n. If numpy is to change the implementation of std, var, and cov, I suggest using the two-pass algorithm to reduce rounding error. (I can provide C code.) This is much more important than changing the normalization to a bias-free but otherwise inferior value. Sturla ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] non-standard standard deviation
Why cant the divisor constant just be made an optional kwarg that defaults to zero? It wont break any existing code, and will let everybody that wants the other behavior, to have it. On Thu, Dec 3, 2009 at 1:49 PM, Colin J. Williams c...@ncf.ca wrote: Yogesh, Could you explain the rationale for this choice please? Colin W. On 03-Dec-09 00:35 AM, yogesh karpate wrote: The thing is that the normalization by (n-1) is done for the no. of samples 20 or23(Not sure about this no. but sure about the thing that this no isnt greater than 25) and below that we use normalization by n. Regards ~ymk ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] non-standard standard deviation
pe, 2009-12-04 kello 11:19 +0100, Chris Colbert kirjoitti: Why cant the divisor constant just be made an optional kwarg that defaults to zero? It already is an optional kwarg that defaults to zero. Cheers, -- Pauli Virtanen ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] non-standard standard deviation
Thu, 03 Dec 2009 11:05:07 +0530, yogesh karpate wrote: The thing is that the normalization by (n-1) is done for the no. of samples 20 or23(Not sure about this no. but sure about the thing that this no isnt greater than 25) and below that we use normalization by n. Regards ~ymk The thing is that the normalization by (n-1) is done for the no. of samples gt;20 or23(Not sure about this no. but sure about the thing that this no isnt greater than 25) and below that we use normalization by n. Just to clarify: Numpy (of course) does not change the divisor depending on `n` -- Yogesh's post concerns probably some code of his own. -- Pauli Virtanen ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] non-standard standard deviation
@ Pauli and @ Colin: Sorry for the late reply. I was busy in some other assignments. # As far as normalization by(n) is concerned then its common assumption that the population is normally distributed and population size is fairly large enough to fit the normal distribution. But this standard deviation, when applied to a small population, tends to be too low therefore it is called as biased. # The correction known as bessel correction is there for small sample size std. deviation. i.e. normalization by (n-1). # In electrical-and-electronic-measurements-and-instrumentation by A.K. Sawhney . In 1st chapter of the book Fundamentals of Meausrements . Its shown that for N=16 the std. deviation normalization was (n-1)=15 # While I was learning statistics in my course Instructor would advise to take n=20 for normalization by (n-1) # Probability and statistics by Schuam Series is good reading. Regards ~ymk ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] non-standard standard deviation
On 12/04/2009 06:18 AM, yogesh karpate wrote: @ Pauli and @ Colin: Sorry for the late reply. I was busy in some other assignments. # As far as normalization by(n) is concerned then its common assumption that the population is normally distributed and population size is fairly large enough to fit the normal distribution. But this standard deviation, when applied to a small population, tends to be too low therefore it is called as biased. # The correction known as bessel correction is there for small sample size std. deviation. i.e. normalization by (n-1). # In electrical-and-electronic-measurements-and-instrumentation by A.K. Sawhney . In 1st chapter of the book Fundamentals of Meausrements . Its shown that for N=16 the std. deviation normalization was (n-1)=15 # While I was learning statistics in my course Instructor would advise to take n=20 for normalization by (n-1) # Probability and statistics by Schuam Series is good reading. Regards ~ymk ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Hi, Basically, all that I see with these arbitrary values is that you are relying on the 'central limit theorem' (http://en.wikipedia.org/wiki/Central_limit_theorem). Really the issue in using these values is how much statistical bias will you tolerate especially in the impact on usage of that estimate because the usage of variance (such as in statistical tests) tend to be more influenced by bias than the estimate of variance. (Of course, many features rely on asymptotic properties so bias concerns are less apparent in large sample sizes.) Obviously the default relies on the developers background and requirements. There are multiple valid variance estimators in statistics with different denominators like N (maximum likelihood estimator), N-1 (restricted maximum likelihood estimator and certain Bayesian estimators) and Stein's (http://en.wikipedia.org/wiki/James%E2%80%93Stein_estimator). So thecurrent default behavior is a valid and documented. Consequently you can not just have one option or different functions (like certain programs) and Numpy's implementation actually allows you do all these in a single function. So I also see no reason change even if I have to add the ddof=1 argument, after all 'Explicit is better than implicit' :-). Bruce ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] non-standard standard deviation
This is getting OT, as I'm not making any comment on numpy's implementation, but... yogesh karpate wrote: # As far as normalization by(n) is concerned then its common assumption that the population is normally distributed and population size is fairly large enough to fit the normal distribution. But this standard deviation, when applied to a small population, tends to be too low therefore it is called as biased. OK. # The correction known as bessel correction is there for small sample size std. deviation. i.e. normalization by (n-1). but why only small size -- the beauty of the approach is that the -1 makes less and less difference the larger n gets. . Its shown that for N=16 the std. deviation normalization was (n-1)=15 # While I was learning statistics in my course Instructor would advise to take n=20 for normalization by (n-1) Which introduces an incontinuity -- I never like incontinuities -- why bother? for large n, it makes no practical difference, for small n you want the -1 -- why arbitrarily decide what small is? From an engineering/applied science point of view, I take the view expressed in the Wikipedia page on Unbiased estimation of standard deviation: ...the task has little relevance to applications of statistics... -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] non-standard standard deviation
Yogesh, Could you explain the rationale for this choice please? Colin W. On 03-Dec-09 00:35 AM, yogesh karpate wrote: The thing is that the normalization by (n-1) is done for the no. of samples 20 or23(Not sure about this no. but sure about the thing that this no isnt greater than 25) and below that we use normalization by n. Regards ~ymk ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] non-standard standard deviation
The thing is that the normalization by (n-1) is done for the no. of samples 20 or23(Not sure about this no. but sure about the thing that this no isnt greater than 25) and below that we use normalization by n. Regards ~ymk ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] non-standard standard deviation
Colin J. Williams skrev: Where the distribution of a variate is not known a priori, then I believe that it can be shown that the n-1 divisor provides the best estimate of the variance. Have you ever been shooting with a rifle? What would you rather do: - Hit 9 or 10, with a bias to the right. - Hit 7 or better, with no bias. Do you think it can be shown that the latter option is the better? No? Sturla Molden ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] non-standard standard deviation
2009/11/29 Dr. Phillip M. Feldman pfeld...@verizon.net: All of the statistical packages that I am currently using and have used in the past (Matlab, Minitab, R, S-plus) calculate standard deviation using the sqrt(1/(n-1)) normalization, which gives a result that is unbiased when sampling from a normally-distributed population. NumPy uses the sqrt(1/n) normalization. I'm currently using the following code to calculate standard deviations, but would much prefer if this could be fixed in NumPy itself: This issue was the subject of lengthy discussions on the mailing list, the upshot of which is that in current versions of scipy, std and var take an optional argument ddof, into which you can supply 1 to get the normalization you want. Anne def mystd(x=numpy.array([]), axis=None): This function calculates the standard deviation of the input using the definition of standard deviation that gives an unbiased result for samples from a normally-distributed population. -- View this message in context: http://old.nabble.com/non-standard-standard-deviation-tp26566808p26566808.html Sent from the Numpy-discussion mailing list archive at Nabble.com. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] non-standard standard deviation
On 29-Nov-09 17:13 PM, Dr. Phillip M. Feldman wrote: All of the statistical packages that I am currently using and have used in the past (Matlab, Minitab, R, S-plus) calculate standard deviation using the sqrt(1/(n-1)) normalization, which gives a result that is unbiased when sampling from a normally-distributed population. NumPy uses the sqrt(1/n) normalization. I'm currently using the following code to calculate standard deviations, but would much prefer if this could be fixed in NumPy itself: def mystd(x=numpy.array([]), axis=None): This function calculates the standard deviation of the input using the definition of standard deviation that gives an unbiased result for samples from a normally-distributed population. xd= x - x.mean(axis=axis) return sqrt( (xd*xd).sum(axis=axis) / (numpy.size(x,axis=axis)-1.0) ) Anne Archibald has suggested a work-around. Perhaps ddof could be set, by default to 1 as other values are rarely required. Where the distribution of a variate is not known a priori, then I believe that it can be shown that the n-1 divisor provides the best estimate of the variance. Colin W. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] non-standard standard deviation
On Mon, Nov 30, 2009 at 12:30 AM, Colin J. Williams c...@ncf.ca wrote: On 29-Nov-09 17:13 PM, Dr. Phillip M. Feldman wrote: All of the statistical packages that I am currently using and have used in the past (Matlab, Minitab, R, S-plus) calculate standard deviation using the sqrt(1/(n-1)) normalization, which gives a result that is unbiased when sampling from a normally-distributed population. NumPy uses the sqrt(1/n) normalization. I'm currently using the following code to calculate standard deviations, but would much prefer if this could be fixed in NumPy itself: def mystd(x=numpy.array([]), axis=None): This function calculates the standard deviation of the input using the definition of standard deviation that gives an unbiased result for samples from a normally-distributed population. xd= x - x.mean(axis=axis) return sqrt( (xd*xd).sum(axis=axis) / (numpy.size(x,axis=axis)-1.0) ) Anne Archibald has suggested a work-around. Perhaps ddof could be set, by default to 1 as other values are rarely required. Where the distribution of a variate is not known a priori, then I believe that it can be shown that the n-1 divisor provides the best estimate of the variance. There have been previous discussions on this (but I can't find them now) and I believe the current default was chosen deliberately. I think it is the view of the numpy developers that the n divisor has more desireable properties in most cases than the traditional n-1 - see this paper by Travis Oliphant for details: http://hdl.handle.net/1877/438 Cheers Robin ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion