On 22/02/14 11:04, Rui Barradas wrote:
Hello,

Not answering directly to your question, if the sample size is a
documented problem with shapiro.test and you want a normality test, why
don't you use ?ks.test?

m <- mean(HP_TrinityK25$V2)
s <- sd(HP_TrinityK25$V2)

ks.test(HP_TrinityK25$V2, "pnorm", m, s)

Strictly speaking this is not a valid test. The KS test is used for testing against a *completely specified* distribution. If there are parameters to be estimated, the null distribution is no longer applicable. This may not be a "real" problem if the parameters are *well* estimated, as they would be in this instance (given that the sample size is over-large). I'm not sure about this.

The "Lilliefors" test is theoretically available in this context when
mu and sigma are estimated, but according to the Wikipedia article, the Lilliefors distribution is not known analytically and the critical values must be determined by Monte Carlo methods. There is a "LillieTest" function in the "DescTools" package which makes use of some approximations to get p-values.

However I think that a better approach would be to use a chi-squared goodness of fit test whereby you can adjust for estimated parameters simply by reducing the degrees of freedom. I believe that the chi-squared test is somewhat low in power, but with a very large sample this should not be a problem.

The difficulty with the chi-squared test is that the choice of "bins" is somewhat arbitrary. I believe the best approach is to take the bin boundaries to be the quantiles of the normal distribution (with parameters "m" and "s") corresponding to equispaced probabilities on [0,1], with the number of such probabilities being k+1 where k = floor(n/5), n being the sample size. This makes the expected counts all equal to n/k >= 5 so that the chi-squared test is "valid". The degrees of freedom are then k-3 (k - 1 - #estimated parameters).

One last comment: I believe that it is generally considered that testing for normality is a waste of time and a pseudo-intellectual exercise of academic interest at best.

cheers,

Rolf Turner



Hope this helps,

Rui Barradas

Em 21-02-2014 15:59, Gonzalo Villarino Pizarro escreveu:
Dear R users,
Please help with with this maybe basic question. I am trying to see if my
data is normal but is a large file and the test does not work.
I keep getting the message : "Error in shapiro.test(x = HP_TrinityK25$V2)
:  sample size must be between 3 and 5000"
thanks!

  shapiro.test(x=HP_TrinityK25$V2)
Error in shapiro.test(x = HP_TrinityK25$V2) : sample size must be
between 3
and 5000

##Note:
HP_TrinityK25= my file
HP_TrinityK25$V2= data in my file

    [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to