During years of passionate practitioning and round-the-clock chaotic learning in the field of applied statistics, I have been desperately longing to learn the funadamentals of mathematical statistics, as well as start working as statistician. As the later recently came true, I simply had to make some notable progress in the former as well. Not to extend this unnecessary introduction any further, let me just state that I simply can not find adequate words of praise for the role and value of the sci.stat newsgroups in the whole story.
Now, to be even more lucky, this week I've been ill and thus found some peace for studying, while at the same time the discussion on CLT and t vs. z popped up. As a consequence, please allow me to ask for critiques of this brief recapitulation of the issue.*** (please view in nonproportional font) sample size | distribution(s) | population var | appropriate test ---------------------------------------------------------------------------- ---------- large (say, N>30) | normal | known | z (obvious) large | not normal | known | z (CLT takes care of numerator) small | not normal | known | still z, right?? large | normal | estimated | t (note 1 below) small | normal | estimated | t (the case of Student) small | not normal | estimated | mostly t (note 2 below) Note 1: z before computer era and also OK due to Slutsky's theorem Note 2: t-test is very robust (BTW, is Boneau, 1960, Psychological Bulletin vol. 57, referenced and summarised in Quinn and McNemar, Psychological Statistics, 4th ed. 1969, with the nice introduction "Boneau, with the indispesable help of an electronic computer, ...", still an adequate reference?), whereby: - skewness, even extreme, is not a big problem - two-tailed testing increases robusteness - unequal variances are a serious problem with unequal N's with larger variance of smaller sample Now, what to do if t is inadequate? - This is a whole complex issue in itself, so just a few thoughts: - in case of extreme skewness, Mann-Whitney is not a good alternative (assumes symmetric distrib.), right? - so Kolmogorov-Smirnov? But where to find truely continuous variables, especially in social sciences? Plus not very powerful with small N, right? - so exact permutation test, right? (Permutation Test with General Scores in StatXact - the manual says in this special case it's called Pitman's test) - solution for the problematic unequal variances case: take random subsample of the larger sample of the size of the smaller sample?? Or do kinda bootstrap - do it, say, 1000 times and take average obtained p??? - Figured these two out by myself, so surely they are utterly wrong. So transformation (in real-life cases usually log, or the Box-Cox, which I am yet to understand)? A big thanks for any comment, Gaj Vidmar University of Ljubljana, Faculty of Medicine Institute of biomedical informatics *** I try to be fully aware of how fundamentally wrong is the quest and view of statistics as collection of recepies, now matter how diverse and advanced they may be; but the fact stays that masses of people still encounter and/or are taught statistics precisely in this manner, preferably with the collection being very limited, extremely outdated and mainly faulty. And I speak from personal experience in its most extreme form here, but in spite of having graduated in psychology, I dare at the same time strongestly oposing any authority whatsoever and wherever who claims thas this is mostly due to or the case of social sciences! - But fortunately, if there is any real benefit of Internet to humanity, the wealth of statistics-related resources ... Anyhow, putting aside nonproductive debates, let me just do my best to make the living case that the possibility that the aforementioned approch and circumstances do not always leed to their replication and proliferation is not zero. - Or, at least, since we all know that even events with zero probability can happen, ... :) - Yes, to exagerate just a little, you can start by mastering hand-computed point-biserial correlation, point&click transformations in SPPS after regression diagnostics the next year, automate logistic regression with interaction terms the following year, then speak about somebody named Tufte to your new girlfriend all night long, and so forth and so forth, and wouldya believe it, last month came derivation of distribution of minimum of n samples taken from exponential distribution! And I'll be damned if in 2002 some S-Plus or R simmulations as part of serious reserach in statistics don't happen! And yes, I can foresee and promise that - health and means permitting - by retirement (i.e., after a few decades) even this person doomed to subnormality by the psych degree will learn enough mathematics to become a Bayesian :) ================================================================= Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =================================================================