Dear All, One week ago I posted a question about large n and normal distritbuion, and have got several good replies from Isobel Clark, Ned Levine, Ruben Roa Ureta, Thies Dose, Chris Hlavka, Donald Myers and Jeffrey Blume. Jeffrey is perhaps not in the list, but I assume he has no objections if I copy his message to the list. Generally speaking, when n is too large, e.g., n>1,000 which is very common in geochemistry nowadays, statistical (goodness-of-fit) tests become too powerful, and the p-values are less informative. Therefore, users need to be very careful in using these tests with a large n. Suggestions to solve this problem include: (1) To use graphical methods; (2) To develop methods which are suitable for large n; (3) To use methods which are not sensitive to n. Well, the solutions may not be very satisfactory, but I do hope statisticians pay more attention on large n, as they have been paying too much attention on small ones. More personal discussions are welcome. If you need some data sets to play with, please feel free to get in touch with me. Please find the following original question and the replies. I would like to show my sincere thanks to all those who replied me (I hope nobody is missing in the above list). Cheers, Chaosheng -------------------------------------------------------------------------- Dr. Chaosheng Zhang Lecturer in GIS Department of Geography National University of Ireland, Galway IRELAND Tel: +353-91-524411 x 2375 Fax: +353-91-525700 E-mail: [EMAIL PROTECTED] Web 1: www.nuigalway.ie/geography/zhang.html Web 2: www.nuigalway.ie/geography/gis/index.htm ---------------------------------------------------------------------------- ----- Original Message ----- > Dear list, > > I'm wondering if anyone out there has the experience of dealing with the > probability distribution of data sets of a large sample size, e.g., > n>10,000. I am studying the probability feature of chemical element > concentrations in a USGS sediment database with the sample number of around > 50,000, and have found that it is virtually impossible for any real data set > to pass tests for normality as the tests become too powerful with the > increase of sample size. It is widely oberved that geochemical data do not > follow a normal or even a lognormal distribution. However, I feel that the > large sample size is also making trouble. > > I am looking for references on this topic. Any references or comments are > welcome. > > Cheers, > > Chaosheng ----------------------- Chaosheng Your problem may be 'non-stationarity' rather than the large sample size. If you have so many samples, you are probably sampling more than one 'population'. We have had success in fitting lognormals to mining data sets of up to half a million, where these are all within the same geological environment and primary minerlisation. We have also had a lot of success in reasonably large data sets (up to 100,000) with fitting mixtures of two, three or four lognormals (or Normals) to characterise different populations. See, for example, the paper given at the Australian Miing Geology conference in 1993 on my page at http://drisobelclark.ontheweb.com/resume/Publications.html Isobel http://ecosse.ontheweb.com ------------------ Chaosheng, Can't you do a Monte Carlo simulation for the distribution? In S-Plus, you can create confidence intervals from a MC simulation with a sample size as large as you have. That is, you draw 50,000 or so points from a normal distribution and calculate the distribution. You then re-run this a number of times (e.g., 1000) to establish approximate confidence intervals. You can then what proportion of your data points fall outside the approximate confidence intervals; you would expect no more than 5% or so of the data points to fall outside the intervals if your distribution is normal. If more than 5% fall outside, then you really don't have a normal distribution (since a normal distribution is essentially a random distribution, I would doubt that any real data set would be truly normal - the sampling distribution is another issue). Anyway, just some thoughts. Hope everything is well with you. Regards, Ned --------------- I pressume your null hypothesis is that the data comes from the given distribution as is usual in goodness of fit tests. If such is the case your sample size will almost surely lead to rejection. The well-known logical inconsistencies of the standard test of hypothesis based on the p-value are magnified under large n. You have these options at least: 1) Find some authority that says that for large sample sizes the p-value is less informative; e.g. Lindley and Scott. 1984. New Cambridge Elementary Statistical Tables. Cambridge Univ Press; and then you can throw away your goodness-of-fit test. But be warned that equally important authorities have said exactly the contrary thing, that the force of the p-value is stronger for large sample sizes (Peto et al. 1976. British Medical Journal 34:585-612). To make matters even worse, certainly other equally important authorities have said that the sample size doesn't matter (Cornfield 1966, American Statistician 29:18-23). 2) Do a more reasonable analysis than the standard goodness-of-fit test. I suggest you plot the likelihood function under normal and lognormal models and derive the probabilistic features of your data by direct inspection of the function. Also you can test for different location or scale parameters using the likelihood ratio (its direct valu, not its derived asymptotic distribution in the sample space) for any two well defined hypotheses. Ruben -------------- Dear Chaosheng, this will not answer your question directly, but I hope that it will be helpful anyway: 1.) Independance of values I am not quite sure, whether tests for normality (chi-square, shapiro-wilk, kolmogorow-smirnov) require independance of the samples, but I have a strong feeling that they do so. Most likely your data samples are not statistically independant of each other, because if the data would be so, you could save your time on the spatial analysis and work with the global mean or a transformed random number generator as local estimator instead. So in general this kind of test might not be appropriate. In addition, in case of clustered data in your data set, the clustering will lead for sure to biased results, and any results from statistical tests would be quite doubtful. 2.) rank transform I would try to do a spatial analysis on the rank transform of your variables, in the case that you can deal with the ties in the data set. For such a large no. of samples, this will probably provide a robust approach. In addition, a multigaussian approach has been discussed widely, and could be a useful alternative. Happy evaluations, Thies --------------------- Chaosheng - Other apporaches to your problem are: - Randomly select a few smaller samples and apply the goodness-of-fit test. - Test fit to normal and lognormal distributions with probability plots. -- Chris ------------------- A couple of observations about your question/problem 1. Most any statistical test will have an underlying assumption of random sampling (or perhaps a modification of random sampling such as stratified). It is very unlikely that the data will have been generated in that way (random sampling in this context refers to sampling from the " distribution" and not to sampling from a region or space). Generally speaking, random site selection for sampling is not the same thing as random sampling from the distribution. It is highly unlikely that you can really use statistical tests with your data because the underlying assumptions are not satisifed. They may be useful information to look at but don't take them as really hard evidence. 2. As a further point, the sampling in this case is obviously "without replacement" , i.e., you can't generate two samples from the (exact) same location. For smaller sample sizes the difference between "with replacement" and " without replacement" is probably negligible but not for larger sample sizes. You may be seeing this. Suppose that the "population" size is M (M very large) is, random sampling WITH replacement means that each possible value will be chosen with probability 1/M. For a sample of size n then the probability will be this raised to the power n. If the sampling is WITHOUT replacement then each sample of size n has a probability of 1/[ M!/(n! (M-n)!)] For M = 1000 and n = 5 the numerical difference in these two probabilities is very very small. But if n > 50 (as an example) then the difference is significant. 3. Finally, what is the "support:" of the samples? Generally speaking the probability distribution changes as the support changes. (In the Geography literature this is referred to as the "Modified Unit Area problem" I don't remember having seen this discussed but you might want to look at the literature pertaining to Pierre Gy's work on sampling (in fact there is to be or was a conference somewhere in Scandanavia recently on his work). Donald Myers http://www.u.arizona.edu/~donaldm ------------------- Chaosheng, Probably the best approach is to take a different tact and try estimating an important quantity rather than testing to see if the normal distribution fits your data. With such a high sample size almost any goodness of fit test will reject. Also as long as the distributions are symmetric, you can assume normality without loosing too much (even if the test rejects normality). I'm not sure the articles will help you in this matter, because they are more concerned with demonstrating that two equal p-values do not represent the same amount of evidence unless the sample sizes are equal. Which smaple has a stronger amount of evidence is still debatable (as you'll see). You might try an altogether different approach: look at the likelihood function. I have attached a tutorial that explains how to do this. Good Luck. Jeffrey |
- AI-GEOSTATS: Variograms & fractals Gregoire Dubois
- Re: AI-GEOSTATS: Variograms & fractals Pierre Goovaerts
- AI-GEOSTATS: Large sample size and normal distributi... Chaosheng Zhang
- Re: AI-GEOSTATS: Summary: Large sample size and ... Chaosheng Zhang
- Re: AI-GEOSTATS: Summary: Large sample size ... Ruben Roa Ureta