AI-GEOSTATS: Summary: Large sample size and normal distribution

Chaosheng Zhang Thu, 14 Aug 2003 13:15:49 -0700

Dear All,

One week ago I posted a question about large n and normal distritbuion, and have got several good replies from Isobel Clark, Ned Levine, Ruben Roa Ureta, Thies Dose, Chris Hlavka, Donald Myers and Jeffrey Blume. Jeffrey is perhaps not in the list, but I assume he has no objections if I copy his message to the list.

Generally speaking, when n is too large, e.g., n>1,000 which is very common in geochemistry nowadays, statistical (goodness-of-fit) tests become too powerful, and the p-values are less informative. Therefore, users need to be very careful in using these tests with a large n. Suggestions to solve this problem include: (1) To use graphical methods; (2) To develop methods which are suitable for large n; (3) To use methods which are not sensitive to n.

Well, the solutions may not be very satisfactory, but I do hope statisticians pay more attention on large n, as they have been paying too much attention on small ones. More personal discussions are welcome. If you need some data sets to play with, please feel free to get in touch with me.

Please find the following original question and the replies. I would like to show my sincere thanks to all those who replied me (I hope nobody is missing in the above list).

Cheers,

Chaosheng
--------------------------------------------------------------------------
Dr. Chaosheng Zhang
Lecturer in GIS
Department of Geography
National University of Ireland, Galway
IRELAND
Tel: +353-91-524411 x 2375
Fax: +353-91-525700
E-mail: [EMAIL PROTECTED]
Web 1: www.nuigalway.ie/geography/zhang.html
Web 2: www.nuigalway.ie/geography/gis/index.htm
----------------------------------------------------------------------------

----- Original Message -----

> Dear list,
>
> I'm wondering if anyone out there has the experience of dealing with the
> probability distribution of data sets of a large sample size, e.g.,
> n>10,000. I am studying the probability feature of chemical element
> concentrations in a USGS sediment database with the sample number of around
> 50,000, and have found that it is virtually impossible for any real data set
> to pass tests for normality as the tests become too powerful with the
> increase of sample size. It is widely oberved that geochemical data do not
> follow a normal or even a lognormal distribution. However, I feel that the
> large sample size is also making trouble.
>
> I am looking for references on this topic. Any references or comments are
> welcome.
>
> Cheers,
>
> Chaosheng

-----------------------
Chaosheng

Your problem may be 'non-stationarity' rather than the
large sample size. If you have so many samples, you
are probably sampling more than one 'population'.

We have had success in fitting lognormals to mining
data sets of up to half a million, where these are all
within the same geological environment and primary
minerlisation.

We have also had a lot of success in reasonably large
data sets (up to 100,000) with fitting mixtures of
two, three or four lognormals (or Normals) to
characterise different populations. See, for example,
the paper given at the Australian Miing Geology
conference in 1993 on my page at
http://drisobelclark.ontheweb.com/resume/Publications.html

Isobel
http://ecosse.ontheweb.com

------------------
Chaosheng,

        Can't you do a Monte Carlo simulation for the distribution? In S-Plus, you can create confidence intervals from a MC simulation with a sample size as large as you have. That is, you draw 50,000 or so points from a normal distribution and calculate the distribution. You then re-run this a number of times (e.g., 1000) to establish approximate confidence intervals. You can then what proportion of your data points fall outside the approximate confidence intervals; you would expect no more than 5% or so of the data points to fall outside the intervals if your distribution is normal. If more than 5% fall outside, then you really don't have a normal distribution (since a normal distribution is essentially a random distribution, I would doubt that any real data set would be truly normal - the sampling distribution is another issue).

        Anyway, just some thoughts. Hope everything is well with you.

Regards,

Ned
---------------
I pressume your null hypothesis is that the data comes from the given
distribution as is usual in goodness of fit tests. If such is the case
your sample size will almost surely lead to rejection. The well-known
logical inconsistencies of the standard test of hypothesis based on the
p-value are magnified under large n.
You have these options at least:
1) Find some authority that says that for large sample sizes the p-value
is less informative; e.g. Lindley and Scott. 1984. New Cambridge
Elementary Statistical Tables. Cambridge Univ Press; and then you can
throw away your goodness-of-fit test. But be warned that equally important
authorities have said exactly the contrary thing, that the force of the
p-value is stronger for large sample sizes (Peto et al. 1976. British
Medical Journal 34:585-612). To make matters even worse, certainly other
equally important authorities have said that the sample size doesn't
matter (Cornfield 1966, American Statistician 29:18-23).
2) Do a more reasonable analysis than the standard goodness-of-fit test.
I suggest you plot the likelihood function under normal and lognormal
models and derive the probabilistic features of your data by direct
inspection of the function. Also you can test for different location or
scale parameters using the likelihood ratio (its direct valu, not its
derived asymptotic distribution in the sample space) for any two well
defined hypotheses.
Ruben

--------------
Dear Chaosheng,

this will not answer your question directly, but I hope that it will be
helpful anyway:

1.) Independance of values
I am not quite sure, whether tests for normality (chi-square, shapiro-wilk,
kolmogorow-smirnov) require independance of the samples, but I have a strong
feeling that they do so. Most likely your data samples are not statistically
independant of each other, because if the data would be so, you could save
your time on the spatial analysis and work with the global mean or a
transformed random number generator as local estimator instead. So in
general this kind of test might not be appropriate.
In addition, in case of clustered data in your data set, the clustering will
lead for sure to biased results, and any results from statistical tests
would be quite doubtful.

2.) rank transform
I would try to do a spatial analysis on the rank transform of your
variables, in the case that you can deal with the ties in the data set. For
such a large no. of samples, this will probably provide a robust approach.
In addition, a multigaussian approach has been discussed widely, and could
be a useful alternative.

Happy evaluations,
Thies

---------------------
Chaosheng - Other apporaches to your problem are:
- Randomly select a few smaller samples and apply the goodness-of-fit test.
- Test fit to normal and lognormal distributions with probability plots.
-- Chris

-------------------
A couple of observations about your question/problem

1. Most any statistical test will have an underlying assumption of
random sampling (or perhaps a modification of random sampling such as
stratified). It is very unlikely that the data will have been generated
in that way (random sampling in this context refers to sampling from the
" distribution" and not to sampling from a region or space). Generally
speaking, random site selection for sampling is not the same thing as
random sampling from the distribution. It is highly unlikely that you
can really use statistical tests with your data because the underlying
assumptions are not satisifed. They may be useful information to look at
but don't take them as really hard evidence.

2. As a further point, the sampling in this case is obviously "without
replacement" , i.e., you can't generate two samples from the (exact)
same location. For smaller sample sizes the difference between "with
replacement" and " without replacement" is probably negligible but not
for larger sample sizes. You may be seeing this.

            Suppose that the "population" size is M (M very large) is,
random sampling WITH replacement means that each possible value will be
chosen with probability 1/M. For a sample of size n then the
probability will be this raised to the power n. If the sampling is
WITHOUT replacement then each sample of size n has a probability of 1/[
M!/(n! (M-n)!)] For M = 1000 and n = 5 the numerical difference in
these two probabilities is very very small. But if n > 50 (as an
example) then the difference is significant.

3. Finally, what is the "support:" of the samples? Generally speaking
the probability distribution changes as the support changes. (In the
Geography literature this is referred to as the "Modified Unit Area
problem"

I don't remember having seen this discussed but you might want to look
at the literature pertaining to Pierre Gy's work on sampling (in fact
there is to be or was a conference somewhere in Scandanavia recently on
his work).

Donald Myers
http://www.u.arizona.edu/~donaldm
-------------------
Chaosheng,

Probably the best approach is to take a different tact and try estimating an important quantity rather than testing to see if the normal distribution fits your data. With such a high sample size almost any goodness of fit test will reject.

Also as long as the distributions are symmetric, you can assume normality without loosing too much (even if the test rejects normality). I'm not sure the articles will help you in this matter, because they are more concerned with demonstrating that two equal p-values do not represent the same amount of evidence unless the sample sizes are equal. Which smaple has a stronger amount of evidence is still debatable (as you'll see).

You might try an altogether different approach: look at the likelihood function. I have attached a tutorial that explains how to do this.

Good Luck.
Jeffrey

AI-GEOSTATS: Summary: Large sample size and normal distribution

Reply via email to