On 29-Mar-04 [EMAIL PROTECTED] wrote: > Hello, > > My data is discrete, taking values between around -5 and +5. > I cannot give bounds for the values. So I consider it as > numerical data and not categorical data. > > The histogram has a 'normal' shape, so I test for normality > via a chisquare statistic (by calculating the expected > values by hand). > > When I use the sample mean and variance, the normality hypothesis > has to be rejected. > But when I test for sample mean + a small epsilon, I get very high > p-values. > > I am not sure if this right shift is a good idea. > Any suggestions?
I suspect that what you are seeing here corresponds to the following. Because your data are discrete, you are treating them as though they are "binned" values of an underlying continuous distribution when you approach the goodness-of-fit using a chisquared measure. At the same time, because you are using the sample mean and variance to estimate the parameters of this distribution, you are behaving as though the discrete values are the exact values of the continuous variable. To be consistent, if treating the observed values as "binned" values, the estimate you should be using for the mean and variance of the underlying normal distribution should take account of the grouping. There could be two main approaches to adopt here. 1. Minimum-chisquared: The chisquared value is the sum (O-E)^2/E where each E is calculated as n*(integral over the range). Minimise this with respect to mu and sigma^2. 2. Maximum likelihood: The likelihood is the product of P^r where P is the integral over the range and r is the count in the range. Maximise this with respect to mu and sigma^2. Neither of these estimates will give exactly the sample mean and sample variance as estimates of mean and variance of the underlying distribution. Therefore, if the data you have do in fact correspond very closely to binned values from a normal distribution, the fit you get by using sample mean and sample variance as estimates will not be the best fit, and (if you have enough data) the discrepancy may well be big enough to give a significantly large chisquared. But it could be (as you appear to have observed) that simply shifting the sample mean gives you a fit which is closer to the fit you would get from (1) or (2) (though I would also have expected it to improve if you slightly reduced sigma^2 as well). There is a nice paper from quite long ago by Dennis Lindley which discusses very closely related issues: Lindley, D.V. (1950). Grouping corrections and maximum likelihood equations. Proceedings of the Cambridge Philosophical Society, vol. 46, 106-110. Best wishes, Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 167 1972 Date: 30-Mar-04 Time: 00:27:58 ------------------------------ XFMail ------------------------------ ______________________________________________ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html