Re: When Can We Really Use CLT & Student t

2001-11-28 Thread Jerry Dallal

Ronny Richardson wrote:
> 
> As I understand it, the Central Limit Theorem (CLT) guarantees that the
> distribution of sample means is normally distributed regardless of the
> distribution of the underlying data as long as the sample size is large
> enough and the population standard deviation is known.
> 
> It seems to me that most statistics books I see over optimistically invoke
> the CLT not when n is over 30 and the population standard deviation is
> known but anytime n is over 30. This seems inappropriate to me or am I
> overlooking something?
> 
> When the population standard deviation is not know (which is almost all the
> time) it seems to me that the Student t (t) distribution is more
> appropriate. However, t requires that the underlying data be normal, or at
> least not too non-normal. My expectations is that most data sets are not
> nearly "normal enough" to make using t appropriate.
> 
> So, if we do not know the population standard deviation and we cannot
> assume a normal population, what should we be doing-as opposed to just
> using the CLT as most business statistics books do?

I address some of this in my note at
http://www.tufts.edu/~gdallal/meandist.htm .
As for using s in place of sigma, a result known as Slutsky's
Theorem says it's "okay".  All of this hinges on what you mean by
"approximate" and "normal enough".  It varies from field to field
and with what is being measured.


=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: When Can We Really Use CLT & Student t

2001-11-28 Thread Jerry Dallal

> "Kaplon, Howard" wrote:

> 
> What many authors do, I believe, is employ the Law of Large
> Numbers, and say that for n sufficiently large, the probability
> approaches 0 that | sigma - s | is different from 0.  That is
> sigma and s may be interchanged with "minimal" probability of any
> change.  And so the ratio  [(x-bar - mu) / s] may be interchanged
> with [(x-bar - mu) / sigma] = Z.  Thus through dual approximations
> [(x-bar - mu) / s] has an approximate Normal(0,1) distribution.

S converges to sigma in probability.  Slutsky's theorem says 
sqrt(n)Z/(s/sigma) will have the same asymptotic distribution as
sqrt(n)Z.


=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: When Can We Really Use CLT & Student t

2001-11-23 Thread Herman Rubin

In article <[EMAIL PROTECTED]>,
Kaplon, Howard <[EMAIL PROTECTED]> wrote:
>This is a multi-part message in MIME format.


>It has been a long time; so if I am wrong, please fan the flames gently.

>The derivation of the t distribution is from the ratio of a Normal(0,1)
>over the square root of a ChiSquare divided by its degrees of freedom.

>   t =3D  [(x-bar - mu) /sigma] / sqrt{[(n-1)S-squared /
>sigma-squared] / n-1}

>which simplifies to  t =3D [(x-bar - mu) / S]

>The CLT allows for the numerator to be approximately Normal(0,1)
>regardless of the distribution ox X, but does NOT allow for the
>denominator to be approximately ChiSquare.  This is the rub in using the
>t distribution when the original distribution of X is UNknown.

>What many authors do, I believe, is employ the Law of Large Numbers, and
>say that for n sufficiently large, the probability approaches 0 that |
>sigma - s | is different from 0.  That is sigma and s may be
>interchanged with "minimal" probability of any change.  And so the ratio
>[(x-bar - mu) / s] may be interchanged with [(x-bar - mu) / sigma] =3D =
>Z.
>Thus through dual approximations [(x-bar - mu) / s] has an approximate
>Normal(0,1) distribution.

Unless the precise significance level is of great importance,
and this should never be the case, it does not matter too much.
The chance that the sample standard deviation is substantially
too small, which can cause erroneous rejection, is not that
much of a problem.  On the other hand, if the sample standard
deviation is substantially too large, this can make in very
hard to reject, which is the more likely problem.
-- 
This address is for information only.  I do not claim that these views
are those of the Statistics Department or of Purdue University.
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
[EMAIL PROTECTED] Phone: (765)494-6054   FAX: (765)494-0558


=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: When Can We Really Use CLT & Student t

2001-11-23 Thread Herman Rubin

In article <[EMAIL PROTECTED]>,
Ronny Richardson <[EMAIL PROTECTED]> wrote:
>As I understand it, the Central Limit Theorem (CLT) guarantees that the
>distribution of sample means is normally distributed regardless of the
>distribution of the underlying data as long as the sample size is large
>enough and the population standard deviation is known.

Wrong!  It states, in part, that the distribution of the
sample mean gets closer (in the sense of convergence in
distribution) to the normal distribution as the sample
size increases.  The distribution is never normal unless
the original distribution is normal.

>It seems to me that most statistics books I see over optimistically invoke
>the CLT not when n is over 30 and the population standard deviation is
>known but anytime n is over 30. This seems inappropriate to me or am I
>overlooking something?

You are not overlooking anything.  The rate of convergence
is usually not the greatest.

>When the population standard deviation is not know (which is almost all the
>time) it seems to me that the Student t (t) distribution is more
>appropriate. However, t requires that the underlying data be normal, or at
>least not too non-normal. My expectations is that most data sets are not
>nearly "normal enough" to make using t appropriate.

The approximation of the distribution of the t statistic 
to the t distribution is similar of that of the mean to
the normal.  Symmetric tests are somewhat better for most,
and the t-test is usually used two-sided.

>So, if we do not know the population standard deviation and we cannot
>assume a normal population, what should we be doing-as opposed to just
>using the CLT as most business statistics books do?

1.  Specify your real problem; what are the consequences
of taking wrong actions, and how important are they?

2.  If the results are not too sensitive to the cut 
point, and your problem is similar to those for which
the t-statistic or the normal distribution is usually
used, go ahead and use it.

3.  If not, consult a good mathematical statistician.
It is even possible that new recipes need to be used.


-- 
This address is for information only.  I do not claim that these views
are those of the Statistics Department or of Purdue University.
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
[EMAIL PROTECTED] Phone: (765)494-6054   FAX: (765)494-0558


=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: When Can We Really Use CLT & Student t

2001-11-21 Thread Rich Ulrich

On 21 Nov 2001 10:18:01 -0800, [EMAIL PROTECTED] (Ronny
Richardson) wrote:

> As I understand it, the Central Limit Theorem (CLT) guarantees that the
> distribution of sample means is normally distributed regardless of the
> distribution of the underlying data as long as the sample size is large
> enough and the population standard deviation is known.
> 
> It seems to me that most statistics books I see over optimistically invoke
> the CLT not when n is over 30 and the population standard deviation is
> known but anytime n is over 30. This seems inappropriate to me or am I
> overlooking something?
[ snip, rest ]

It seems to me that you have doubts which *might* be justifiable.

Do you have a professor who is prone to glib generalizations?
Do you have a lousy text?

I do wonder if your textbooks actually say what you accuse them of, 
or if you are guilty of hasty overgeneralization.  I have scanned 
textbooks in search of  errors like those, but I hardly ever find any.
Gross mis-statements tend to be in  "handbooks"  and in 
(unfortunate) interpretative  articles by non-statisticians.

(Can you cite "chapter and verse"?)

-- 
Rich Ulrich, [EMAIL PROTECTED]
http://www.pitt.edu/~wpilib/index.html


=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: When Can We Really Use CLT & Student t

2001-11-21 Thread Jay Warner

Ronny Richardson wrote:

> As I understand it, the Central Limit Theorem (CLT) guarantees that the
> distribution of sample means is normally distributed regardless of the
> distribution of the underlying data as long as the sample size is large
> enough and the population standard deviation is known.

the distribution of a linear contrast taken from a single population _tends_
toward a Normal distribution, regardless of the distribution of the underlying
population.

linear contrast = any given sum or difference of individual measurements.
Could be an average, could be an average minus another average, etc.

_tends_  = as the n of the contrast increases, the distribution of the
contrast approaches a Normal distribution.  It does not mathematically equal
said Normal.  It does not magically shift from non-Normal to Normal at n=30,
or n= anything.  If one could measure the 'difference' between the observed
distribution of the contrast and a true Normal, then the difference would
asymptotically approach zero as n increases.

In practice, if the original population has a saw tooth shape, an average of
12 observations will be within one line width (about 1 printer's point) of a
Normal on a 1/6 page size chart.

Also in practice, you need to ask how close to Normal a distribution you need,
before invoking traditional tests, CI's, and conclusions.  Since 'Nothing is
random," likewise, "nothing is Normal."  You ask for absolute rigor, and you
will wind up tossing the whole bath tub, as well as baby & bath water.  If the
objective is to help someone make a decision, for example, you may have to
help them understand the impact of their (or your!) choice of
cost/precision/rigor tradeoff.

I didn't say anything about sample size.  asymptotic approach, etc.  How close
is close enough is your problem :)

I didn't say what the underlying population distribution was.  Doesn't
matter.  Whether this is true for a Cauchy dist. (the one which has no stdev -
I'm not sure of the name) I can't say.

If you want honest mathematical proof, you'll have to go to the book.  Sorry
'bout that.

>
>
> It seems to me that most statistics books I see over optimistically invoke
> the CLT not when n is over 30 and the population standard deviation is
> known but anytime n is over 30. This seems inappropriate to me or am I
> overlooking something?
>
> When the population standard deviation is not know (which is almost all the
> time) it seems to me that the Student t (t) distribution is more
> appropriate.

True.  If the stdev is estimated form _internal_ sources (the given data),
then Student's t is applicable.

> However, t requires that the underlying data be normal, or at
> least not too non-normal. My expectations is that most data sets are not
> nearly "normal enough" to make using t appropriate.

Depends, as they say.

Actually, the t test is making conclusions about averages & groups.  Thus, it
uses the CLT to report differences in the _average_ of a sample.  Thus, CLT
covers the possible sins of original distribution.  Somewhat.  Thus, the t
dist. is particularly rugged & even 'forgiving.'  Esp. compared with a one way
AoV.  So I'm told.

If your prior knowledge (Rev. Bayes shows up again!) indicates you should not
expect a Normal dist., then perhaps you shouldn't be surprised if you find
something else.  Product life times under long tests, extreme depths of snow,
lots of things are not 'Normal'  Numbers of arrivals per unit time at a ticket
counter, too.  That's why we check these things, true?

> So, if we do not know the population standard deviation and we cannot
> assume a normal population, what should we be doing-as opposed to just
> using the CLT as most business statistics books do?

Let's clarify a possible confusion point.  The 'business statistics books'
often say that you should use a z test when sigma is known, or when n> 30.
Then use a t test for n<= 30.

the t dist. approaches the z dist asymptotically, not discontinuously (in a
step function).  At n=30, the critical t value is about 3% different than the
z critical value.  If you don't mind a 3% error in your critical value, then
what the hey.

More accurate is to say:

When sigma is known from _external_ data, taken from outside the sample set,
then use z.  Ex: when the QA people say they already measured, and found sigma
= 345.

When sigma is estimated from _internal_ data, i.e., from the sample data set,
then use t.  Ex., when nobody has ever looked at average sales by district and
month before, there is no way to know what sigma is before we collect the
data.

In practice, most of the time the second situation applies.

the practice of substituting an estimated sigma into a known sigma when n>30
seems to stem from the days when Excel didn't exist, before VisiCalc even.
Whether it is necessary today is an issue I'm working on, with the business
students I encounter.

> Ronny Richardson

Cheers,
Jay
--
Jay Warner
Principal Scientist
Warner Consulting, Inc.
 

Re: When Can We Really Use CLT & Student t

2001-11-21 Thread Gus Gassmann

Ronny Richardson wrote:

> As I understand it, the Central Limit Theorem (CLT) guarantees that the
> distribution of sample means is normally distributed regardless of the
> distribution of the underlying data as long as the sample size is large
> enough and the population standard deviation is known.

Not quite. The CLT states that the sample mean is _approximately_
normal if the sample size is large enough. This will be true regardless
of whether you know the population standard deviation or not.

> It seems to me that most statistics books I see over optimistically invoke
> the CLT not when n is over 30 and the population standard deviation is
> known but anytime n is over 30. This seems inappropriate to me or am I
> overlooking something?

It is indeed a common shortcut used in many introductory texts to imply
that magic happens whenever n > 30. Again, knowing the standard deviation
has nothing to do with it.

> When the population standard deviation is not know (which is almost all the
> time) it seems to me that the Student t (t) distribution is more
> appropriate. However, t requires that the underlying data be normal, or at
> least not too non-normal. My expectations is that most data sets are not
> nearly "normal enough" to make using t appropriate.

Not really. I suspect that you are muddling two very different concepts:
Applying the Central Limit Theorem and approximating the t-distribution
by a normal distribution. The latter is defensible whenever n > 30 (or
when you have 30 or more degrees of freedom), since most of the time
the underlying distribution is not normal, so you approximate anyway.
The former requires separate justification (which is often not done in
introductory texts). It is unfortunate that the same rule of thumb (n > 30)
is used for both concepts, and that explanations are often not given.

> So, if we do not know the population standard deviation and we cannot
> assume a normal population, what should we be doing-as opposed to just
> using the CLT as most business statistics books do?

That's a difficult question to answer. First, how non-normal is your
underlying population?  Standard tests of hypotheses on the mean are quite
robust with regard to normality. On the other hand, is the sample mean
really a useful measure in your context? Nonparametric methods _may_ be
called for, but it will depend on the situation.

Hope that helps




=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: When Can We Really Use CLT & Student t

2001-11-21 Thread Vadim and Oxana Marmer

On 21 Nov 2001, Ronny Richardson wrote:

> As I understand it, the Central Limit Theorem (CLT) guarantees that the
> distribution of sample means is normally distributed regardless of the
> distribution of the underlying data as long as the sample size is large
> enough and the population standard deviation is known.

CLT does not guarantee anything. It's just an approximation that sometimes
works and sometimes does not work. The underlying distribution does
actually matter, or, more correctly, the data has to satisfy some
regularity conditions for CLT to apply. Population standard deviation does
not need to be known.


>
> It seems to me that most statistics books I see over optimistically invoke
> the CLT not when n is over 30 and the population standard deviation is
> known but anytime n is over 30. This seems inappropriate to me or am I
> overlooking something?

Sometimes CLT is a good approximation for small data sets too, and
sometimes it's not good even if n is very large. It all depends on the model,
the data and so on. Often it's your only choice to use asymptotic argument
and CLT.


>
> When the population standard deviation is not know (which is almost all the
> time) it seems to me that the Student t (t) distribution is more
> appropriate.

not at all. again, you do not need to know standard deviation to apply
CLT. You can replace unknown parameters by their consistent estiamtors.


I do not know which textbooks you are refering to, but I suggest you to
try something more advanced like "Estimation and Inference in
Econometrics" by Davidson and MacKinnon or "Econometric Theory" by
Davidson.



=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



RE: When Can We Really Use CLT & Student t

2001-11-21 Thread Kaplon, Howard
Title: RE: When Can We Really Use CLT & Student t






It has been a long time; so if I am wrong, please fan the flames gently.


The derivation of the t distribution is from the ratio of a Normal(0,1) over the square root of a ChiSquare divided by its degrees of freedom.

    t =  [(x-bar - mu) /sigma] / sqrt{[(n-1)S-squared / sigma-squared] / n-1}


which simplifies to  t = [(x-bar - mu) / S]


The CLT allows for the numerator to be approximately Normal(0,1) regardless of the distribution ox X, but does NOT allow for the denominator to be approximately ChiSquare.  This is the rub in using the t distribution when the original distribution of X is UNknown.

What many authors do, I believe, is employ the Law of Large Numbers, and say that for n sufficiently large, the probability approaches 0 that | sigma - s | is different from 0.  That is sigma and s may be interchanged with "minimal" probability of any change.  And so the ratio  [(x-bar - mu) / s] may be interchanged with [(x-bar - mu) / sigma] = Z.  Thus through dual approximations [(x-bar - mu) / s] has an approximate Normal(0,1) distribution.

Howard Kaplon



-Original Message-

From: Ronny Richardson [mailto:[EMAIL PROTECTED]]

Sent: Wednesday, November 21, 2001 12:50 PM

To: [EMAIL PROTECTED]

Subject: When Can We Really Use CLT & Student t



As I understand it, the Central Limit Theorem (CLT) guarantees that the

distribution of sample means is normally distributed regardless of the

distribution of the underlying data as long as the sample size is large

enough and the population standard deviation is known.


It seems to me that most statistics books I see over optimistically invoke

the CLT not when n is over 30 and the population standard deviation is

known but anytime n is over 30. This seems inappropriate to me or am I

overlooking something?


When the population standard deviation is not know (which is almost all the

time) it seems to me that the Student t (t) distribution is more

appropriate. However, t requires that the underlying data be normal, or at

least not too non-normal. My expectations is that most data sets are not

nearly "normal enough" to make using t appropriate.


So, if we do not know the population standard deviation and we cannot

assume a normal population, what should we be doing-as opposed to just

using the CLT as most business statistics books do?


Ronny Richardson





Re: When Can We Really Use CLT & Student t

2001-11-21 Thread Dennis Roberts

At 12:49 PM 11/21/01 -0500, Ronny Richardson wrote:
>As I understand it, the Central Limit Theorem (CLT) guarantees that the
>distribution of sample means is normally distributed regardless of the
>distribution of the underlying data as long as the sample size is large
>enough and the population standard deviation is known.

nope ... clt says nothing of the kind
it says that regardless of the shape of the target population ... as n 
increases, the shape of the sampling distribution of means is better and 
better APPROXIMATED by the normal distribution

that is, even if the target population is quite different from normal ... 
if we take decent sized samples ... we can say and not be TOO wrong that 
the sampling distribution of means looks something like a normal ...

here is a quick simulation taking samples of n=50 (based on 1 samples) 
from a chi square distribution with 1 df

.
  ..::..
:.
  ..
 .::..
   .::.
 ..::..
   .: .
  +-+-+-+-+-+---C51
   0.30  0.60  0.90  1.20  1.50  1.80

even though the chi square distribution is radically + skewed, the sampling 
distribution looks pretty darn close to a normal distribution ... but it 
never will be exactly one ...


it does NOT say that it will GET to and BECOME a normal distribution

if the population is not normal ... the sampling distribution will not be 
normal regardless of n ... but, it could be that your EYES could not tell 
the difference


>It seems to me that most statistics books I see over optimistically invoke
>the CLT not when n is over 30 and the population standard deviation is
>known but anytime n is over 30. This seems inappropriate to me or am I
>overlooking something?

you are mixing two metaphors ...

if we know the sd of the population ... then we know the real sampling 
error ... ie, standard error of the mean ... if we do NOT know the 
population sd, and substitute our estimate of that from the sample, then we 
are only estimating the standard error of the mean

thus ... knowing or not knowing the population sd helps us to know or only 
to estimate the real standard error ... but this is unconnected with shape 
of sampling distribution

shape of sampling distribution is partly a function of shape of population 
AND random sample size ...


>When the population standard deviation is not know (which is almost all the
>time) it seems to me that the Student t (t) distribution is more
>appropriate. However, t requires that the underlying data be normal, or at
>least not too non-normal. My expectations is that most data sets are not
>nearly "normal enough" to make using t appropriate.
>
>So, if we do not know the population standard deviation and we cannot
>assume a normal population, what should we be doing-as opposed to just
>using the CLT as most business statistics books do?
>
>Ronny Richardson
>
>
>Ronny Richardson
>
>
>=
>Instructions for joining and leaving this list and remarks about
>the problem of INAPPROPRIATE MESSAGES are available at
>   http://jse.stat.ncsu.edu/
>=

_
dennis roberts, educational psychology, penn state university
208 cedar, AC 8148632401, mailto:[EMAIL PROTECTED]
http://roberts.ed.psu.edu/users/droberts/drober~1.htm



=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=