here is an example from minitab ... in moore and mccabe's book intro. to 
practice of statistics ... 3rd edition ... they have an example of speed of 
light measurements ... with newcomb in his lab on the bank of the potomac 
... bouncing light bursts off the base of the washington monument ... and 
then collecting observations (n=66) in nanoseconds (whatever the heck that is)

[NOTE: just think about newcomb back in the mid 1800s. doing this ... i 
wonder how many times he totally MISSED hitting the monument with his light 
beam and, it ended up  in philly???]

MTB > dotp c10 c11;
SUBC> same.

Dotplot: nanodata, trimmed


                                                         .
                                                         :
                                                       : :.: .
                                                       :.::: :
                                                     . ::::: : :
          .                           .           : .:::::::::.:...
           -----+---------+---------+---------+---------+---------+-nanodata
                                                         .
                                                         :
                                                       : :.: .
                                                       :.::: :
                                                     . ::::: : :
                                                  . .:::::::::.:
           -----+---------+---------+---------+---------+---------+-trimmed
            24765     24780     24795     24810     24825     24840

the original data set of n=66 shows at least on extreme value at the left ...

now, of 66, the 5% rule would lop off about the lower 3 and upper 3 ... 
which does little to the top but, DOES lop off the extreme values at the bottom

i sorted the data and did that ... and the bottom distribution above is the 
n=60 one ... with the top and bottom 3 axed

in the desc. stats below, note that the original nanodata has a mean of 
24826 .. and the trimmed mean of 24827 ...

of course, the trimmed mean is THE mean in the trimmed data (n=60)

perhaps the more important thing is the variation (ie, sds) ... (or 
variances if you like those better) ... trimming cuts the index number (sd 
in this case) for dispersion dramatically ... the variance for example 
would be almost only 1/10th of the variance in the original set of 
nanosecond data

MTB > desc c10 c11

Descriptive Statistics: nanodata, trimmed


Variable             N       Mean     Median     TrMean      StDev    SE Mean
nanodata            66      24826      24827      24827         11          1
trimmed             60      24827      24827      24827          4          1

Variable       Minimum    Maximum         Q1         Q3
nanodata         24756      24840      24824      24831
trimmed          24816      24836      24824      24830

there is no CLEAR cut rule for dealing with this situation nor, guidelines 
for telling one if this should be done or not ... a lot depends on whether 
these outlying values are real ... or if we can explain them away as 
abberations (miscodings, etc.)

so, i don't make a value judgement about if what i did for an example is 
good or not, i just give you this example (since i just HAPPEN to be doing 
this thing in my class at the moment) as an illustration



At 01:44 PM 1/30/02 -0700, Harold W Kerster wrote:
>   Look at minitab's "trimmed mean."  It is a Tukey (I think) invention
>w/5% chopped from each end, leaving the central 90%.  For the high
>variance, high skew, common world, a good approach.
>
>On Tue, 29 Jan 2002, Rich Ulrich wrote:
>
> > On 17 Jan 2002 00:05:02 -0800, [EMAIL PROTECTED] (Hekon) wrote:
> >
> > > I have noticed a practice among some people dealing with enterprise
> > > data to cut the left and right tails off their samples (including
> > > census data) in both dependent and independent variables. The reason
> > > is that outliers tend to be extreme. The effects can be stunning. How
> > > is this practice to be understood statistically - as some form of
> > > truncation? References that deal formally with such a practice?
> >
> > This is called "trimming" - 5% trimming, 25% trimming.
> > The median is what is left when you have done "50% trimming."
> >
> > Trimming by 5% or 10% reportedly works well for your
> > measures of 'central tendency', so long as you *know*
> > that the extremes are not important.
> >
> > I don't know what it is that you refer to as 'enterprise data.'
> >
> > --
> > Rich Ulrich, [EMAIL PROTECTED]
> > http://www.pitt.edu/~wpilib/index.html
> >
> >
> > =================================================================
> > Instructions for joining and leaving this list, remarks about the
> > problem of INAPPROPRIATE MESSAGES, and archives are available at
> >                   http://jse.stat.ncsu.edu/
> > =================================================================
> >
>
>
>=================================================================
>Instructions for joining and leaving this list, remarks about the
>problem of INAPPROPRIATE MESSAGES, and archives are available at
>                   http://jse.stat.ncsu.edu/
>=================================================================

_________________________________________________________
dennis roberts, educational psychology, penn state university
208 cedar, AC 8148632401, mailto:[EMAIL PROTECTED]
http://roberts.ed.psu.edu/users/droberts/drober~1.htm



=================================================================
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
                  http://jse.stat.ncsu.edu/
=================================================================

Reply via email to