IMHO, no you are not being an idiot  (I'll leave question of 'again' up to you
:)

the question of what is an outlier, and what is not, comes up every time you
have data collected by undergraduates in a standard lab.  They believe the data
should fit some pre-arranged principle/equation.  So points that don't fit well
are 'outliers.'  then we must decide whether to dump them.

If you dump them because they don't fit the model expected, you wind up with
data that fits your preconceptions.  the mathematical fit is excellent, but
basically says what you wanted it to.  It is nearly worthless.

For UG students, I would hit the roof if they tried that.  We have enough
sloppiness in the world (not a little contribute by yours truly) - make sure
the data is sound, anyway.

But what if it is clearly not 'real'?  (i.e., your 60 foot person).  If you can
find an _external_ reason why the point was so far from the others, you may
dump it, _with the explanation_.  If not, gotta keep it.  Check out what might
have led to such a result.  Maybe the data entry person slipped, and meant 6.0
feet.

_External_ = reason not related to the rest of the data, or to the vagaries of
the measurement system.

Notice I didn't say anything about dropping points to fit a computationally
desired distribution.  that's an internal justification, in any case.

this approach would get UG students through a sloppy lab with their honor and
rationality intact.

Don't forget, sometimes the oddities lead to new insights.  A small bump on the
pressure trace during vacuum brazing aluminum turned out to be a good indicator
of process success.

A certain superior aluminum alloy was 'discovered' because the tech ran it not
according to instructions, and the strength and ductility stood out from the
rest of the ingots.

A chemistry lab instructed the students to take 3 measurements, then average
the 2 closest together, dropping the 'outlier.'  An enterprising grad student
analyzed all the data, and found that in most of the cases, the 3rd point
brought the estimated average closer to the accepted value.  I believe the
expected frequency for this effect was 2/3, and the grad student found that it
happened in 64% of the cases.

Arguably, vulcanization of rubber was one such.

As for what, generically, is an outlier?  Don't go there, please.  You can't in
any case say 'this is an outlier' or 'this is not.'  It's a question s of
probabilities, a continuum of likelihood.

If you can't examine every data point in your file, then looking at the extreme
ones is better than blindly accepting it all.

Cheers,
jay

Jill Binker wrote:

> At 9:41 AM -0400 8/30/01, Dennis Roberts wrote:
> >all of this is assuming of course, that some extreme value ... by ANY
> >definition ... is "bad" in some way ... that is, worthy of special
> >attention for fear that it got there by some nefarious method
> >
> >i am not sure the flagging of extreme values has any particular value ...
> >certainly, to flag and look at these ... makes no more sense to me than
> >examining all the data points ... to make sure that all seem legitimate ...
> >and accounted for ...
>
> Well, yes. You should always stare at all your data (though I guess a lot
> of people leave this crucial step out). I thought there were two important
> reasons to look for outliers.
>
> An outlier may be a mistake, therefore not real data, therefore should not
> be included (as if you had a bunch of people's heights and one person was
> 60 feet tall -- that's a mistake and shouldn't be included as part of your
> results). Though this idea has come to be translated in many people's minds
> (incorrectly) as "It's an outlier, so delete it." which makes no sense.
> Just because some data is "far away" from the rest doesn't mean it ain't
> data.
>
> Also, some tests only work when there are no outliers, so if you have
> outliers, those tests won't work and you need to do something else. (This,
> I believe, is the real motivation behind "delete them.")
>
> Or am I being an idot (again)?
> ________________________________
>
> Jill Binker
> Fathom Dynamic Statistics Software
> KCP Technologies, an affiliate of
> Key Curriculum Press
> 1150 65th St
> Emeryville, CA  94608
> 1-800-995-MATH (6284)
> [EMAIL PROTECTED]
> http://www.keypress.com
> http://www.keycollege.com
> __________________________________
>
> =================================================================
> Instructions for joining and leaving this list and remarks about
> the problem of INAPPROPRIATE MESSAGES are available at
>                   http://jse.stat.ncsu.edu/
> =================================================================

--
Jay Warner
Principal Scientist
Warner Consulting, Inc.
4444 North Green Bay Road
Racine, WI 53404-1216
USA

Ph: (262) 634-9100
FAX: (262) 681-1133
email: [EMAIL PROTECTED]
web: http://www.a2q.com

The A2Q Method (tm) -- What do you want to improve today?






=================================================================
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
                  http://jse.stat.ncsu.edu/
=================================================================

Reply via email to