Re: cutting tails of samples
Harold W Kerster wrote: Look at minitab's trimmed mean. It is a Tukey (I think) invention w/5% chopped from each end, leaving the central 90%. For the high variance, high skew, common world, a good approach. Oh, yeah, especially when setting reserves. Or calculating premiums. Those outliers just get in the way. Throw 'em out. How come we're not making any money? Jon Miller = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: cutting tails of samples
Hola! But the original poster asked about trimming IN THE MARGINAL DISTRIBUTION with multivariate data --- and that is totally insane. If trimming should be used in regression settings , the residuals should be trimmed - as in Rousseuw and Leroy's LTS (least trimmed mean of squares), implemented in at least tree or four statistical packages, but not in Minitab. Trimming in the marginal distributions will not find (necessarly) multivariate oulliers, and that may be the most important ones. It may even incraese the influence of the multivariate oulliers (make a pcture in 2D, it is easy to see), so might increase, not decrease, problems. Best, Kjetil Halvorsen Dennis Roberts wrote: here is an example from minitab ... in moore and mccabe's book intro. to practice of statistics ... 3rd edition ... they have an example of speed of light measurements ... with newcomb in his lab on the bank of the potomac ... bouncing light bursts off the base of the washington monument ... and then collecting observations (n=66) in nanoseconds (whatever the heck that is) [NOTE: just think about newcomb back in the mid 1800s. doing this ... i wonder how many times he totally MISSED hitting the monument with his light beam and, it ended up in philly???] MTB dotp c10 c11; SUBC same. Dotplot: nanodata, trimmed . : : :.: . :.::: : . : : : . . : .:.:... -+-+-+-+-+-+-nanodata . : : :.: . :.::: : . : : : . .:.: -+-+-+-+-+-+-trimmed 24765 24780 24795 24810 24825 24840 the original data set of n=66 shows at least on extreme value at the left ... now, of 66, the 5% rule would lop off about the lower 3 and upper 3 ... which does little to the top but, DOES lop off the extreme values at the bottom i sorted the data and did that ... and the bottom distribution above is the n=60 one ... with the top and bottom 3 axed in the desc. stats below, note that the original nanodata has a mean of 24826 .. and the trimmed mean of 24827 ... of course, the trimmed mean is THE mean in the trimmed data (n=60) perhaps the more important thing is the variation (ie, sds) ... (or variances if you like those better) ... trimming cuts the index number (sd in this case) for dispersion dramatically ... the variance for example would be almost only 1/10th of the variance in the original set of nanosecond data MTB desc c10 c11 Descriptive Statistics: nanodata, trimmed Variable N Mean Median TrMean StDevSE Mean nanodata66 24826 24827 24827 11 1 trimmed 60 24827 24827 24827 4 1 Variable MinimumMaximum Q1 Q3 nanodata 24756 24840 24824 24831 trimmed 24816 24836 24824 24830 there is no CLEAR cut rule for dealing with this situation nor, guidelines for telling one if this should be done or not ... a lot depends on whether these outlying values are real ... or if we can explain them away as abberations (miscodings, etc.) so, i don't make a value judgement about if what i did for an example is good or not, i just give you this example (since i just HAPPEN to be doing this thing in my class at the moment) as an illustration At 01:44 PM 1/30/02 -0700, Harold W Kerster wrote: Look at minitab's trimmed mean. It is a Tukey (I think) invention w/5% chopped from each end, leaving the central 90%. For the high variance, high skew, common world, a good approach. On Tue, 29 Jan 2002, Rich Ulrich wrote: On 17 Jan 2002 00:05:02 -0800, [EMAIL PROTECTED] (Hekon) wrote: I have noticed a practice among some people dealing with enterprise data to cut the left and right tails off their samples (including census data) in both dependent and independent variables. The reason is that outliers tend to be extreme. The effects can be stunning. How is this practice to be understood statistically - as some form of truncation? References that deal formally with such a practice? This is called trimming - 5% trimming, 25% trimming. The median is what is left
Re: cutting tails of samples
Look at minitab's trimmed mean. It is a Tukey (I think) invention w/5% chopped from each end, leaving the central 90%. For the high variance, high skew, common world, a good approach. On Tue, 29 Jan 2002, Rich Ulrich wrote: On 17 Jan 2002 00:05:02 -0800, [EMAIL PROTECTED] (Hekon) wrote: I have noticed a practice among some people dealing with enterprise data to cut the left and right tails off their samples (including census data) in both dependent and independent variables. The reason is that outliers tend to be extreme. The effects can be stunning. How is this practice to be understood statistically - as some form of truncation? References that deal formally with such a practice? This is called trimming - 5% trimming, 25% trimming. The median is what is left when you have done 50% trimming. Trimming by 5% or 10% reportedly works well for your measures of 'central tendency', so long as you *know* that the extremes are not important. I don't know what it is that you refer to as 'enterprise data.' -- Rich Ulrich, [EMAIL PROTECTED] http://www.pitt.edu/~wpilib/index.html = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ = = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: cutting tails of samples
here is an example from minitab ... in moore and mccabe's book intro. to practice of statistics ... 3rd edition ... they have an example of speed of light measurements ... with newcomb in his lab on the bank of the potomac ... bouncing light bursts off the base of the washington monument ... and then collecting observations (n=66) in nanoseconds (whatever the heck that is) [NOTE: just think about newcomb back in the mid 1800s. doing this ... i wonder how many times he totally MISSED hitting the monument with his light beam and, it ended up in philly???] MTB dotp c10 c11; SUBC same. Dotplot: nanodata, trimmed . : : :.: . :.::: : . : : : . . : .:.:... -+-+-+-+-+-+-nanodata . : : :.: . :.::: : . : : : . .:.: -+-+-+-+-+-+-trimmed 24765 24780 24795 24810 24825 24840 the original data set of n=66 shows at least on extreme value at the left ... now, of 66, the 5% rule would lop off about the lower 3 and upper 3 ... which does little to the top but, DOES lop off the extreme values at the bottom i sorted the data and did that ... and the bottom distribution above is the n=60 one ... with the top and bottom 3 axed in the desc. stats below, note that the original nanodata has a mean of 24826 .. and the trimmed mean of 24827 ... of course, the trimmed mean is THE mean in the trimmed data (n=60) perhaps the more important thing is the variation (ie, sds) ... (or variances if you like those better) ... trimming cuts the index number (sd in this case) for dispersion dramatically ... the variance for example would be almost only 1/10th of the variance in the original set of nanosecond data MTB desc c10 c11 Descriptive Statistics: nanodata, trimmed Variable N Mean Median TrMean StDevSE Mean nanodata66 24826 24827 24827 11 1 trimmed 60 24827 24827 24827 4 1 Variable MinimumMaximum Q1 Q3 nanodata 24756 24840 24824 24831 trimmed 24816 24836 24824 24830 there is no CLEAR cut rule for dealing with this situation nor, guidelines for telling one if this should be done or not ... a lot depends on whether these outlying values are real ... or if we can explain them away as abberations (miscodings, etc.) so, i don't make a value judgement about if what i did for an example is good or not, i just give you this example (since i just HAPPEN to be doing this thing in my class at the moment) as an illustration At 01:44 PM 1/30/02 -0700, Harold W Kerster wrote: Look at minitab's trimmed mean. It is a Tukey (I think) invention w/5% chopped from each end, leaving the central 90%. For the high variance, high skew, common world, a good approach. On Tue, 29 Jan 2002, Rich Ulrich wrote: On 17 Jan 2002 00:05:02 -0800, [EMAIL PROTECTED] (Hekon) wrote: I have noticed a practice among some people dealing with enterprise data to cut the left and right tails off their samples (including census data) in both dependent and independent variables. The reason is that outliers tend to be extreme. The effects can be stunning. How is this practice to be understood statistically - as some form of truncation? References that deal formally with such a practice? This is called trimming - 5% trimming, 25% trimming. The median is what is left when you have done 50% trimming. Trimming by 5% or 10% reportedly works well for your measures of 'central tendency', so long as you *know* that the extremes are not important. I don't know what it is that you refer to as 'enterprise data.' -- Rich Ulrich, [EMAIL PROTECTED] http://www.pitt.edu/~wpilib/index.html = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ = = Instructions for joining and leaving
Re: cutting tails of samples
On 17 Jan 2002 00:05:02 -0800, [EMAIL PROTECTED] (Håkon) wrote: I have noticed a practice among some people dealing with enterprise data to cut the left and right tails off their samples (including census data) in both dependent and independent variables. The reason is that outliers tend to be extreme. The effects can be stunning. How is this practice to be understood statistically - as some form of truncation? References that deal formally with such a practice? This is called trimming - 5% trimming, 25% trimming. The median is what is left when you have done 50% trimming. Trimming by 5% or 10% reportedly works well for your measures of 'central tendency', so long as you *know* that the extremes are not important. I don't know what it is that you refer to as 'enterprise data.' -- Rich Ulrich, [EMAIL PROTECTED] http://www.pitt.edu/~wpilib/index.html = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
cutting tails of samples
I have noticed a practice among some people dealing with enterprise data to cut the left and right tails off their samples (including census data) in both dependent and independent variables. The reason is that outliers tend to be extreme. The effects can be stunning. How is this practice to be understood statistically - as some form of truncation? References that deal formally with such a practice? Regards Håkon Finne SINTEF N-7465 Trondheim = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =