Re: cutting tails of samples

2002-01-31 Thread Jon Miller

Harold W Kerster wrote:

   Look at minitab's trimmed mean.  It is a Tukey (I think) invention w/5%
 chopped from each end, leaving the central 90%.  For the high variance,
 high skew, common world, a good approach.

Oh, yeah, especially when setting reserves.  Or calculating premiums.

Those outliers just get in the way.  Throw 'em out.

How come we're not making any money?

Jon Miller



=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: cutting tails of samples

2002-01-31 Thread kjetil halvorsen

Hola!

But the original poster asked about trimming IN THE MARGINAL
DISTRIBUTION with multivariate data --- and that is totally insane. If
trimming should be used in regression settings , the residuals should be
trimmed - as in Rousseuw and Leroy's LTS (least trimmed mean of
squares), implemented in at least tree or four statistical packages, but
not in Minitab. 

Trimming in the marginal distributions will not find (necessarly)
multivariate oulliers, and that may be the most important ones.
It may even incraese the influence of the multivariate oulliers (make a
pcture in 2D, it is easy to see), so might increase, not decrease,
problems. 

Best, 

Kjetil Halvorsen


Dennis Roberts wrote:
 
 here is an example from minitab ... in moore and mccabe's book intro. to
 practice of statistics ... 3rd edition ... they have an example of speed of
 light measurements ... with newcomb in his lab on the bank of the potomac
 ... bouncing light bursts off the base of the washington monument ... and
 then collecting observations (n=66) in nanoseconds (whatever the heck that is)
 
 [NOTE: just think about newcomb back in the mid 1800s. doing this ... i
 wonder how many times he totally MISSED hitting the monument with his light
 beam and, it ended up  in philly???]
 
 MTB  dotp c10 c11;
 SUBC same.
 
 Dotplot: nanodata, trimmed
 
  .
  :
: :.: .
:.::: :
  . : : :
   .   .   : .:.:...
-+-+-+-+-+-+-nanodata
  .
  :
: :.: .
:.::: :
  . : : :
   . .:.:
-+-+-+-+-+-+-trimmed
 24765 24780 24795 24810 24825 24840
 
 the original data set of n=66 shows at least on extreme value at the left ...
 
 now, of 66, the 5% rule would lop off about the lower 3 and upper 3 ...
 which does little to the top but, DOES lop off the extreme values at the bottom
 
 i sorted the data and did that ... and the bottom distribution above is the
 n=60 one ... with the top and bottom 3 axed
 
 in the desc. stats below, note that the original nanodata has a mean of
 24826 .. and the trimmed mean of 24827 ...
 
 of course, the trimmed mean is THE mean in the trimmed data (n=60)
 
 perhaps the more important thing is the variation (ie, sds) ... (or
 variances if you like those better) ... trimming cuts the index number (sd
 in this case) for dispersion dramatically ... the variance for example
 would be almost only 1/10th of the variance in the original set of
 nanosecond data
 
 MTB  desc c10 c11
 
 Descriptive Statistics: nanodata, trimmed
 
 Variable N   Mean Median TrMean  StDevSE Mean
 nanodata66  24826  24827  24827 11  1
 trimmed 60  24827  24827  24827  4  1
 
 Variable   MinimumMaximum Q1 Q3
 nanodata 24756  24840  24824  24831
 trimmed  24816  24836  24824  24830
 
 there is no CLEAR cut rule for dealing with this situation nor, guidelines
 for telling one if this should be done or not ... a lot depends on whether
 these outlying values are real ... or if we can explain them away as
 abberations (miscodings, etc.)
 
 so, i don't make a value judgement about if what i did for an example is
 good or not, i just give you this example (since i just HAPPEN to be doing
 this thing in my class at the moment) as an illustration
 
 At 01:44 PM 1/30/02 -0700, Harold W Kerster wrote:
Look at minitab's trimmed mean.  It is a Tukey (I think) invention
 w/5% chopped from each end, leaving the central 90%.  For the high
 variance, high skew, common world, a good approach.
 
 On Tue, 29 Jan 2002, Rich Ulrich wrote:
 
   On 17 Jan 2002 00:05:02 -0800, [EMAIL PROTECTED] (Hekon) wrote:
  
I have noticed a practice among some people dealing with enterprise
data to cut the left and right tails off their samples (including
census data) in both dependent and independent variables. The reason
is that outliers tend to be extreme. The effects can be stunning. How
is this practice to be understood statistically - as some form of
truncation? References that deal formally with such a practice?
  
   This is called trimming - 5% trimming, 25% trimming.
   The median is what is left 

Re: cutting tails of samples

2002-01-30 Thread Harold W Kerster

  Look at minitab's trimmed mean.  It is a Tukey (I think) invention 
w/5% chopped from each end, leaving the central 90%.  For the high 
variance, high skew, common world, a good approach.

On Tue, 29 Jan 2002, Rich Ulrich wrote:

 On 17 Jan 2002 00:05:02 -0800, [EMAIL PROTECTED] (Hekon) wrote:
 
  I have noticed a practice among some people dealing with enterprise
  data to cut the left and right tails off their samples (including
  census data) in both dependent and independent variables. The reason
  is that outliers tend to be extreme. The effects can be stunning. How
  is this practice to be understood statistically - as some form of
  truncation? References that deal formally with such a practice?
 
 This is called trimming - 5% trimming, 25% trimming.
 The median is what is left when you have done 50% trimming.
 
 Trimming by 5% or 10% reportedly works well for your 
 measures of 'central tendency', so long as you *know*  
 that the extremes are not important.
 
 I don't know what it is that you refer to as 'enterprise data.'
 
 -- 
 Rich Ulrich, [EMAIL PROTECTED]
 http://www.pitt.edu/~wpilib/index.html
 
 
 =
 Instructions for joining and leaving this list, remarks about the
 problem of INAPPROPRIATE MESSAGES, and archives are available at
   http://jse.stat.ncsu.edu/
 =
 


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: cutting tails of samples

2002-01-30 Thread Dennis Roberts

here is an example from minitab ... in moore and mccabe's book intro. to 
practice of statistics ... 3rd edition ... they have an example of speed of 
light measurements ... with newcomb in his lab on the bank of the potomac 
... bouncing light bursts off the base of the washington monument ... and 
then collecting observations (n=66) in nanoseconds (whatever the heck that is)

[NOTE: just think about newcomb back in the mid 1800s. doing this ... i 
wonder how many times he totally MISSED hitting the monument with his light 
beam and, it ended up  in philly???]

MTB  dotp c10 c11;
SUBC same.

Dotplot: nanodata, trimmed


 .
 :
   : :.: .
   :.::: :
 . : : :
  .   .   : .:.:...
   -+-+-+-+-+-+-nanodata
 .
 :
   : :.: .
   :.::: :
 . : : :
  . .:.:
   -+-+-+-+-+-+-trimmed
24765 24780 24795 24810 24825 24840

the original data set of n=66 shows at least on extreme value at the left ...

now, of 66, the 5% rule would lop off about the lower 3 and upper 3 ... 
which does little to the top but, DOES lop off the extreme values at the bottom

i sorted the data and did that ... and the bottom distribution above is the 
n=60 one ... with the top and bottom 3 axed

in the desc. stats below, note that the original nanodata has a mean of 
24826 .. and the trimmed mean of 24827 ...

of course, the trimmed mean is THE mean in the trimmed data (n=60)

perhaps the more important thing is the variation (ie, sds) ... (or 
variances if you like those better) ... trimming cuts the index number (sd 
in this case) for dispersion dramatically ... the variance for example 
would be almost only 1/10th of the variance in the original set of 
nanosecond data

MTB  desc c10 c11

Descriptive Statistics: nanodata, trimmed


Variable N   Mean Median TrMean  StDevSE Mean
nanodata66  24826  24827  24827 11  1
trimmed 60  24827  24827  24827  4  1

Variable   MinimumMaximum Q1 Q3
nanodata 24756  24840  24824  24831
trimmed  24816  24836  24824  24830

there is no CLEAR cut rule for dealing with this situation nor, guidelines 
for telling one if this should be done or not ... a lot depends on whether 
these outlying values are real ... or if we can explain them away as 
abberations (miscodings, etc.)

so, i don't make a value judgement about if what i did for an example is 
good or not, i just give you this example (since i just HAPPEN to be doing 
this thing in my class at the moment) as an illustration



At 01:44 PM 1/30/02 -0700, Harold W Kerster wrote:
   Look at minitab's trimmed mean.  It is a Tukey (I think) invention
w/5% chopped from each end, leaving the central 90%.  For the high
variance, high skew, common world, a good approach.

On Tue, 29 Jan 2002, Rich Ulrich wrote:

  On 17 Jan 2002 00:05:02 -0800, [EMAIL PROTECTED] (Hekon) wrote:
 
   I have noticed a practice among some people dealing with enterprise
   data to cut the left and right tails off their samples (including
   census data) in both dependent and independent variables. The reason
   is that outliers tend to be extreme. The effects can be stunning. How
   is this practice to be understood statistically - as some form of
   truncation? References that deal formally with such a practice?
 
  This is called trimming - 5% trimming, 25% trimming.
  The median is what is left when you have done 50% trimming.
 
  Trimming by 5% or 10% reportedly works well for your
  measures of 'central tendency', so long as you *know*
  that the extremes are not important.
 
  I don't know what it is that you refer to as 'enterprise data.'
 
  --
  Rich Ulrich, [EMAIL PROTECTED]
  http://www.pitt.edu/~wpilib/index.html
 
 
  =
  Instructions for joining and leaving this list, remarks about the
  problem of INAPPROPRIATE MESSAGES, and archives are available at
http://jse.stat.ncsu.edu/
  =
 


=
Instructions for joining and leaving 

Re: cutting tails of samples

2002-01-29 Thread Rich Ulrich

On 17 Jan 2002 00:05:02 -0800, [EMAIL PROTECTED] (Håkon) wrote:

 I have noticed a practice among some people dealing with enterprise
 data to cut the left and right tails off their samples (including
 census data) in both dependent and independent variables. The reason
 is that outliers tend to be extreme. The effects can be stunning. How
 is this practice to be understood statistically - as some form of
 truncation? References that deal formally with such a practice?

This is called trimming - 5% trimming, 25% trimming.
The median is what is left when you have done 50% trimming.

Trimming by 5% or 10% reportedly works well for your 
measures of 'central tendency', so long as you *know*  
that the extremes are not important.

I don't know what it is that you refer to as 'enterprise data.'

-- 
Rich Ulrich, [EMAIL PROTECTED]
http://www.pitt.edu/~wpilib/index.html


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



cutting tails of samples

2002-01-17 Thread Håkon

I have noticed a practice among some people dealing with enterprise
data to cut the left and right tails off their samples (including
census data) in both dependent and independent variables. The reason
is that outliers tend to be extreme. The effects can be stunning. How
is this practice to be understood statistically - as some form of
truncation? References that deal formally with such a practice?

Regards

Håkon Finne
SINTEF
N-7465 Trondheim


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=