Re: When Can We Really Use CLT & Student t
Ronny Richardson wrote: > > As I understand it, the Central Limit Theorem (CLT) guarantees that the > distribution of sample means is normally distributed regardless of the > distribution of the underlying data as long as the sample size is large > enough and the population standard deviation is known. > > It seems to me that most statistics books I see over optimistically invoke > the CLT not when n is over 30 and the population standard deviation is > known but anytime n is over 30. This seems inappropriate to me or am I > overlooking something? > > When the population standard deviation is not know (which is almost all the > time) it seems to me that the Student t (t) distribution is more > appropriate. However, t requires that the underlying data be normal, or at > least not too non-normal. My expectations is that most data sets are not > nearly "normal enough" to make using t appropriate. > > So, if we do not know the population standard deviation and we cannot > assume a normal population, what should we be doing-as opposed to just > using the CLT as most business statistics books do? I address some of this in my note at http://www.tufts.edu/~gdallal/meandist.htm . As for using s in place of sigma, a result known as Slutsky's Theorem says it's "okay". All of this hinges on what you mean by "approximate" and "normal enough". It varies from field to field and with what is being measured. = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: When Can We Really Use CLT & Student t
> "Kaplon, Howard" wrote: > > What many authors do, I believe, is employ the Law of Large > Numbers, and say that for n sufficiently large, the probability > approaches 0 that | sigma - s | is different from 0. That is > sigma and s may be interchanged with "minimal" probability of any > change. And so the ratio [(x-bar - mu) / s] may be interchanged > with [(x-bar - mu) / sigma] = Z. Thus through dual approximations > [(x-bar - mu) / s] has an approximate Normal(0,1) distribution. S converges to sigma in probability. Slutsky's theorem says sqrt(n)Z/(s/sigma) will have the same asymptotic distribution as sqrt(n)Z. = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: When Can We Really Use CLT & Student t
In article <[EMAIL PROTECTED]>, Kaplon, Howard <[EMAIL PROTECTED]> wrote: >This is a multi-part message in MIME format. >It has been a long time; so if I am wrong, please fan the flames gently. >The derivation of the t distribution is from the ratio of a Normal(0,1) >over the square root of a ChiSquare divided by its degrees of freedom. > t =3D [(x-bar - mu) /sigma] / sqrt{[(n-1)S-squared / >sigma-squared] / n-1} >which simplifies to t =3D [(x-bar - mu) / S] >The CLT allows for the numerator to be approximately Normal(0,1) >regardless of the distribution ox X, but does NOT allow for the >denominator to be approximately ChiSquare. This is the rub in using the >t distribution when the original distribution of X is UNknown. >What many authors do, I believe, is employ the Law of Large Numbers, and >say that for n sufficiently large, the probability approaches 0 that | >sigma - s | is different from 0. That is sigma and s may be >interchanged with "minimal" probability of any change. And so the ratio >[(x-bar - mu) / s] may be interchanged with [(x-bar - mu) / sigma] =3D = >Z. >Thus through dual approximations [(x-bar - mu) / s] has an approximate >Normal(0,1) distribution. Unless the precise significance level is of great importance, and this should never be the case, it does not matter too much. The chance that the sample standard deviation is substantially too small, which can cause erroneous rejection, is not that much of a problem. On the other hand, if the sample standard deviation is substantially too large, this can make in very hard to reject, which is the more likely problem. -- This address is for information only. I do not claim that these views are those of the Statistics Department or of Purdue University. Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399 [EMAIL PROTECTED] Phone: (765)494-6054 FAX: (765)494-0558 = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: When Can We Really Use CLT & Student t
In article <[EMAIL PROTECTED]>, Ronny Richardson <[EMAIL PROTECTED]> wrote: >As I understand it, the Central Limit Theorem (CLT) guarantees that the >distribution of sample means is normally distributed regardless of the >distribution of the underlying data as long as the sample size is large >enough and the population standard deviation is known. Wrong! It states, in part, that the distribution of the sample mean gets closer (in the sense of convergence in distribution) to the normal distribution as the sample size increases. The distribution is never normal unless the original distribution is normal. >It seems to me that most statistics books I see over optimistically invoke >the CLT not when n is over 30 and the population standard deviation is >known but anytime n is over 30. This seems inappropriate to me or am I >overlooking something? You are not overlooking anything. The rate of convergence is usually not the greatest. >When the population standard deviation is not know (which is almost all the >time) it seems to me that the Student t (t) distribution is more >appropriate. However, t requires that the underlying data be normal, or at >least not too non-normal. My expectations is that most data sets are not >nearly "normal enough" to make using t appropriate. The approximation of the distribution of the t statistic to the t distribution is similar of that of the mean to the normal. Symmetric tests are somewhat better for most, and the t-test is usually used two-sided. >So, if we do not know the population standard deviation and we cannot >assume a normal population, what should we be doing-as opposed to just >using the CLT as most business statistics books do? 1. Specify your real problem; what are the consequences of taking wrong actions, and how important are they? 2. If the results are not too sensitive to the cut point, and your problem is similar to those for which the t-statistic or the normal distribution is usually used, go ahead and use it. 3. If not, consult a good mathematical statistician. It is even possible that new recipes need to be used. -- This address is for information only. I do not claim that these views are those of the Statistics Department or of Purdue University. Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399 [EMAIL PROTECTED] Phone: (765)494-6054 FAX: (765)494-0558 = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: When Can We Really Use CLT & Student t
On 21 Nov 2001 10:18:01 -0800, [EMAIL PROTECTED] (Ronny Richardson) wrote: > As I understand it, the Central Limit Theorem (CLT) guarantees that the > distribution of sample means is normally distributed regardless of the > distribution of the underlying data as long as the sample size is large > enough and the population standard deviation is known. > > It seems to me that most statistics books I see over optimistically invoke > the CLT not when n is over 30 and the population standard deviation is > known but anytime n is over 30. This seems inappropriate to me or am I > overlooking something? [ snip, rest ] It seems to me that you have doubts which *might* be justifiable. Do you have a professor who is prone to glib generalizations? Do you have a lousy text? I do wonder if your textbooks actually say what you accuse them of, or if you are guilty of hasty overgeneralization. I have scanned textbooks in search of errors like those, but I hardly ever find any. Gross mis-statements tend to be in "handbooks" and in (unfortunate) interpretative articles by non-statisticians. (Can you cite "chapter and verse"?) -- Rich Ulrich, [EMAIL PROTECTED] http://www.pitt.edu/~wpilib/index.html = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: When Can We Really Use CLT & Student t
Ronny Richardson wrote: > As I understand it, the Central Limit Theorem (CLT) guarantees that the > distribution of sample means is normally distributed regardless of the > distribution of the underlying data as long as the sample size is large > enough and the population standard deviation is known. the distribution of a linear contrast taken from a single population _tends_ toward a Normal distribution, regardless of the distribution of the underlying population. linear contrast = any given sum or difference of individual measurements. Could be an average, could be an average minus another average, etc. _tends_ = as the n of the contrast increases, the distribution of the contrast approaches a Normal distribution. It does not mathematically equal said Normal. It does not magically shift from non-Normal to Normal at n=30, or n= anything. If one could measure the 'difference' between the observed distribution of the contrast and a true Normal, then the difference would asymptotically approach zero as n increases. In practice, if the original population has a saw tooth shape, an average of 12 observations will be within one line width (about 1 printer's point) of a Normal on a 1/6 page size chart. Also in practice, you need to ask how close to Normal a distribution you need, before invoking traditional tests, CI's, and conclusions. Since 'Nothing is random," likewise, "nothing is Normal." You ask for absolute rigor, and you will wind up tossing the whole bath tub, as well as baby & bath water. If the objective is to help someone make a decision, for example, you may have to help them understand the impact of their (or your!) choice of cost/precision/rigor tradeoff. I didn't say anything about sample size. asymptotic approach, etc. How close is close enough is your problem :) I didn't say what the underlying population distribution was. Doesn't matter. Whether this is true for a Cauchy dist. (the one which has no stdev - I'm not sure of the name) I can't say. If you want honest mathematical proof, you'll have to go to the book. Sorry 'bout that. > > > It seems to me that most statistics books I see over optimistically invoke > the CLT not when n is over 30 and the population standard deviation is > known but anytime n is over 30. This seems inappropriate to me or am I > overlooking something? > > When the population standard deviation is not know (which is almost all the > time) it seems to me that the Student t (t) distribution is more > appropriate. True. If the stdev is estimated form _internal_ sources (the given data), then Student's t is applicable. > However, t requires that the underlying data be normal, or at > least not too non-normal. My expectations is that most data sets are not > nearly "normal enough" to make using t appropriate. Depends, as they say. Actually, the t test is making conclusions about averages & groups. Thus, it uses the CLT to report differences in the _average_ of a sample. Thus, CLT covers the possible sins of original distribution. Somewhat. Thus, the t dist. is particularly rugged & even 'forgiving.' Esp. compared with a one way AoV. So I'm told. If your prior knowledge (Rev. Bayes shows up again!) indicates you should not expect a Normal dist., then perhaps you shouldn't be surprised if you find something else. Product life times under long tests, extreme depths of snow, lots of things are not 'Normal' Numbers of arrivals per unit time at a ticket counter, too. That's why we check these things, true? > So, if we do not know the population standard deviation and we cannot > assume a normal population, what should we be doing-as opposed to just > using the CLT as most business statistics books do? Let's clarify a possible confusion point. The 'business statistics books' often say that you should use a z test when sigma is known, or when n> 30. Then use a t test for n<= 30. the t dist. approaches the z dist asymptotically, not discontinuously (in a step function). At n=30, the critical t value is about 3% different than the z critical value. If you don't mind a 3% error in your critical value, then what the hey. More accurate is to say: When sigma is known from _external_ data, taken from outside the sample set, then use z. Ex: when the QA people say they already measured, and found sigma = 345. When sigma is estimated from _internal_ data, i.e., from the sample data set, then use t. Ex., when nobody has ever looked at average sales by district and month before, there is no way to know what sigma is before we collect the data. In practice, most of the time the second situation applies. the practice of substituting an estimated sigma into a known sigma when n>30 seems to stem from the days when Excel didn't exist, before VisiCalc even. Whether it is necessary today is an issue I'm working on, with the business students I encounter. > Ronny Richardson Cheers, Jay -- Jay Warner Principal Scientist Warner Consulting, Inc.
Re: When Can We Really Use CLT & Student t
Ronny Richardson wrote: > As I understand it, the Central Limit Theorem (CLT) guarantees that the > distribution of sample means is normally distributed regardless of the > distribution of the underlying data as long as the sample size is large > enough and the population standard deviation is known. Not quite. The CLT states that the sample mean is _approximately_ normal if the sample size is large enough. This will be true regardless of whether you know the population standard deviation or not. > It seems to me that most statistics books I see over optimistically invoke > the CLT not when n is over 30 and the population standard deviation is > known but anytime n is over 30. This seems inappropriate to me or am I > overlooking something? It is indeed a common shortcut used in many introductory texts to imply that magic happens whenever n > 30. Again, knowing the standard deviation has nothing to do with it. > When the population standard deviation is not know (which is almost all the > time) it seems to me that the Student t (t) distribution is more > appropriate. However, t requires that the underlying data be normal, or at > least not too non-normal. My expectations is that most data sets are not > nearly "normal enough" to make using t appropriate. Not really. I suspect that you are muddling two very different concepts: Applying the Central Limit Theorem and approximating the t-distribution by a normal distribution. The latter is defensible whenever n > 30 (or when you have 30 or more degrees of freedom), since most of the time the underlying distribution is not normal, so you approximate anyway. The former requires separate justification (which is often not done in introductory texts). It is unfortunate that the same rule of thumb (n > 30) is used for both concepts, and that explanations are often not given. > So, if we do not know the population standard deviation and we cannot > assume a normal population, what should we be doing-as opposed to just > using the CLT as most business statistics books do? That's a difficult question to answer. First, how non-normal is your underlying population? Standard tests of hypotheses on the mean are quite robust with regard to normality. On the other hand, is the sample mean really a useful measure in your context? Nonparametric methods _may_ be called for, but it will depend on the situation. Hope that helps = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: When Can We Really Use CLT & Student t
On 21 Nov 2001, Ronny Richardson wrote: > As I understand it, the Central Limit Theorem (CLT) guarantees that the > distribution of sample means is normally distributed regardless of the > distribution of the underlying data as long as the sample size is large > enough and the population standard deviation is known. CLT does not guarantee anything. It's just an approximation that sometimes works and sometimes does not work. The underlying distribution does actually matter, or, more correctly, the data has to satisfy some regularity conditions for CLT to apply. Population standard deviation does not need to be known. > > It seems to me that most statistics books I see over optimistically invoke > the CLT not when n is over 30 and the population standard deviation is > known but anytime n is over 30. This seems inappropriate to me or am I > overlooking something? Sometimes CLT is a good approximation for small data sets too, and sometimes it's not good even if n is very large. It all depends on the model, the data and so on. Often it's your only choice to use asymptotic argument and CLT. > > When the population standard deviation is not know (which is almost all the > time) it seems to me that the Student t (t) distribution is more > appropriate. not at all. again, you do not need to know standard deviation to apply CLT. You can replace unknown parameters by their consistent estiamtors. I do not know which textbooks you are refering to, but I suggest you to try something more advanced like "Estimation and Inference in Econometrics" by Davidson and MacKinnon or "Econometric Theory" by Davidson. = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
RE: When Can We Really Use CLT & Student t
Title: RE: When Can We Really Use CLT & Student t It has been a long time; so if I am wrong, please fan the flames gently. The derivation of the t distribution is from the ratio of a Normal(0,1) over the square root of a ChiSquare divided by its degrees of freedom. t = [(x-bar - mu) /sigma] / sqrt{[(n-1)S-squared / sigma-squared] / n-1} which simplifies to t = [(x-bar - mu) / S] The CLT allows for the numerator to be approximately Normal(0,1) regardless of the distribution ox X, but does NOT allow for the denominator to be approximately ChiSquare. This is the rub in using the t distribution when the original distribution of X is UNknown. What many authors do, I believe, is employ the Law of Large Numbers, and say that for n sufficiently large, the probability approaches 0 that | sigma - s | is different from 0. That is sigma and s may be interchanged with "minimal" probability of any change. And so the ratio [(x-bar - mu) / s] may be interchanged with [(x-bar - mu) / sigma] = Z. Thus through dual approximations [(x-bar - mu) / s] has an approximate Normal(0,1) distribution. Howard Kaplon -Original Message- From: Ronny Richardson [mailto:[EMAIL PROTECTED]] Sent: Wednesday, November 21, 2001 12:50 PM To: [EMAIL PROTECTED] Subject: When Can We Really Use CLT & Student t As I understand it, the Central Limit Theorem (CLT) guarantees that the distribution of sample means is normally distributed regardless of the distribution of the underlying data as long as the sample size is large enough and the population standard deviation is known. It seems to me that most statistics books I see over optimistically invoke the CLT not when n is over 30 and the population standard deviation is known but anytime n is over 30. This seems inappropriate to me or am I overlooking something? When the population standard deviation is not know (which is almost all the time) it seems to me that the Student t (t) distribution is more appropriate. However, t requires that the underlying data be normal, or at least not too non-normal. My expectations is that most data sets are not nearly "normal enough" to make using t appropriate. So, if we do not know the population standard deviation and we cannot assume a normal population, what should we be doing-as opposed to just using the CLT as most business statistics books do? Ronny Richardson
Re: When Can We Really Use CLT & Student t
At 12:49 PM 11/21/01 -0500, Ronny Richardson wrote: >As I understand it, the Central Limit Theorem (CLT) guarantees that the >distribution of sample means is normally distributed regardless of the >distribution of the underlying data as long as the sample size is large >enough and the population standard deviation is known. nope ... clt says nothing of the kind it says that regardless of the shape of the target population ... as n increases, the shape of the sampling distribution of means is better and better APPROXIMATED by the normal distribution that is, even if the target population is quite different from normal ... if we take decent sized samples ... we can say and not be TOO wrong that the sampling distribution of means looks something like a normal ... here is a quick simulation taking samples of n=50 (based on 1 samples) from a chi square distribution with 1 df . ..::.. :. .. .::.. .::. ..::.. .: . +-+-+-+-+-+---C51 0.30 0.60 0.90 1.20 1.50 1.80 even though the chi square distribution is radically + skewed, the sampling distribution looks pretty darn close to a normal distribution ... but it never will be exactly one ... it does NOT say that it will GET to and BECOME a normal distribution if the population is not normal ... the sampling distribution will not be normal regardless of n ... but, it could be that your EYES could not tell the difference >It seems to me that most statistics books I see over optimistically invoke >the CLT not when n is over 30 and the population standard deviation is >known but anytime n is over 30. This seems inappropriate to me or am I >overlooking something? you are mixing two metaphors ... if we know the sd of the population ... then we know the real sampling error ... ie, standard error of the mean ... if we do NOT know the population sd, and substitute our estimate of that from the sample, then we are only estimating the standard error of the mean thus ... knowing or not knowing the population sd helps us to know or only to estimate the real standard error ... but this is unconnected with shape of sampling distribution shape of sampling distribution is partly a function of shape of population AND random sample size ... >When the population standard deviation is not know (which is almost all the >time) it seems to me that the Student t (t) distribution is more >appropriate. However, t requires that the underlying data be normal, or at >least not too non-normal. My expectations is that most data sets are not >nearly "normal enough" to make using t appropriate. > >So, if we do not know the population standard deviation and we cannot >assume a normal population, what should we be doing-as opposed to just >using the CLT as most business statistics books do? > >Ronny Richardson > > >Ronny Richardson > > >= >Instructions for joining and leaving this list and remarks about >the problem of INAPPROPRIATE MESSAGES are available at > http://jse.stat.ncsu.edu/ >= _ dennis roberts, educational psychology, penn state university 208 cedar, AC 8148632401, mailto:[EMAIL PROTECTED] http://roberts.ed.psu.edu/users/droberts/drober~1.htm = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =