During years of passionate practitioning and round-the-clock chaotic
learning in the field of applied statistics, I have been desperately longing
to learn the funadamentals of mathematical statistics, as well as start
working as statistician. As the later recently came true, I simply had to
make some notable progress in the former as well. Not to extend this
unnecessary introduction any further, let me just state that I simply can
not find adequate words of praise for the role and value of the sci.stat
newsgroups in the whole story.

Now, to be even more lucky, this week I've been ill and thus found some
peace for studying, while at the same time the discussion on CLT and t vs. z
popped up. As a consequence, please allow me to ask for critiques of this
brief recapitulation of the issue.***

(please view in nonproportional font)

sample size       | distribution(s) | population var | appropriate test
----------------------------------------------------------------------------
----------
large (say, N>30) | normal          | known          | z (obvious)
large             | not normal      | known          | z (CLT takes care of
numerator)
small             | not normal      | known          | still z, right??
large             | normal          | estimated      | t (note 1 below)
small             | normal          | estimated      | t (the case of
Student)
small             | not normal      | estimated      | mostly t (note 2
below)

Note 1: z before computer era and also OK due to Slutsky's theorem

Note 2: t-test is very robust (BTW, is Boneau, 1960, Psychological Bulletin
vol. 57, referenced and summarised in Quinn and McNemar, Psychological
Statistics, 4th ed. 1969, with the nice introduction "Boneau, with the
indispesable help of an electronic computer, ...", still an adequate
reference?), whereby:
- skewness, even extreme, is not a big problem
- two-tailed testing increases robusteness
- unequal variances are a serious problem with unequal N's with larger
variance of smaller sample

Now, what to do if t is inadequate? - This is a whole complex issue in
itself, so just a few thoughts:
- in case of extreme skewness, Mann-Whitney is not a good alternative
(assumes symmetric distrib.), right?
- so Kolmogorov-Smirnov? But where to find truely continuous variables,
especially in social sciences? Plus not very powerful with small N, right?
- so exact permutation test, right? (Permutation Test with General Scores in
StatXact - the manual says in this special case it's called Pitman's test)
- solution for the problematic unequal variances case: take random subsample
of the larger sample of the size of the smaller sample?? Or do kinda
bootstrap - do it, say, 1000 times and take average obtained p??? - Figured
these two out by myself, so surely they are utterly wrong. So transformation
(in real-life cases usually log, or the Box-Cox, which I am yet to
understand)?

A big thanks for any comment,

Gaj Vidmar
University of Ljubljana, Faculty of Medicine
Institute of biomedical informatics

*** I try to be fully aware of how fundamentally wrong is the quest and view
of statistics as collection of recepies, now matter how diverse and advanced
they may be; but the fact stays that masses of people still encounter and/or
are taught statistics precisely in this manner, preferably with the
collection being very limited, extremely outdated and mainly faulty. And I
speak from personal experience in its most extreme form here, but in spite
of having graduated in psychology, I dare at the same time strongestly
oposing any authority whatsoever and wherever who claims thas this is mostly
due to or the case of social sciences! - But fortunately, if there is any
real benefit of Internet to humanity, the wealth of statistics-related
resources ... Anyhow, putting aside nonproductive debates, let me just do my
best to make the living case that the possibility that the aforementioned
approch and circumstances do not always leed to their replication and
proliferation is not zero. - Or, at least, since we all know that even
events with zero probability can happen, ... :) - Yes, to exagerate just a
little, you can start by mastering hand-computed point-biserial correlation,
point&click transformations in SPPS after regression diagnostics the next
year, automate logistic regression with interaction terms the following
year, then speak about somebody named Tufte to your new girlfriend all night
long, and so forth and so forth, and wouldya believe it, last month came
derivation of distribution of minimum of n samples taken from exponential
distribution! And I'll be damned if in 2002 some S-Plus or R simmulations as
part of serious reserach in statistics don't happen! And yes, I can foresee
and promise that - health and means permitting - by retirement (i.e., after
a few decades) even this person doomed to subnormality by the psych degree
will learn enough mathematics to become a Bayesian :)






=================================================================
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
                  http://jse.stat.ncsu.edu/
=================================================================

Reply via email to