Robust regression
I know that robust regression can downweight outliers. Should someone apply robust regression when the data have skewed distributions but do not have outliers? Regression assumptions require normality of residuals, but not the normality of raw scores. So does it help at all to use robust regression in this situation. Any help will be appreciated. = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: Robust regression
On 1 Mar 2002 00:36:01 -0800, [EMAIL PROTECTED] (Alex Yu) wrote: > > I know that robust regression can downweight outliers. Should someone > apply robust regression when the data have skewed distributions but do not > have outliers? Regression assumptions require normality of residuals, but > not the normality of raw scores. So does it help at all to use robust > regression in this situation. Any help will be appreciated. Go ahead and do it if you want. If someone asks (or even if they don't), you can tell them that robust regression gives exactly the same result. -- Rich Ulrich, [EMAIL PROTECTED] http://www.pitt.edu/~wpilib/index.html = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: Robust regression
You don't need normality for regression. You may need it for certain optimality properties to hold, but you can apply OLS without normality. On 1 Mar 2002, Alex Yu wrote: > > I know that robust regression can downweight outliers. Should someone > apply robust regression when the data have skewed distributions but do not > have outliers? Regression assumptions require normality of residuals, but > not the normality of raw scores. So does it help at all to use robust > regression in this situation. Any help will be appreciated. > > > > = > Instructions for joining and leaving this list, remarks about the > problem of INAPPROPRIATE MESSAGES, and archives are available at > http://jse.stat.ncsu.edu/ > = > = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: Robust regression
If, for example, normality assumption holds then by doing robust regression instead of OLS you lose efficiency. So, it's not the same result after all. But you can do both, compare and decide. If robust regression produces results which are not really different from the OLS then stay with OLS. On Fri, 1 Mar 2002, Rich Ulrich wrote: > On 1 Mar 2002 00:36:01 -0800, [EMAIL PROTECTED] (Alex Yu) > wrote: > > > > > I know that robust regression can downweight outliers. Should someone > > apply robust regression when the data have skewed distributions but do not > > have outliers? Regression assumptions require normality of residuals, but > > not the normality of raw scores. So does it help at all to use robust > > regression in this situation. Any help will be appreciated. > > Go ahead and do it if you want. > > If someone asks (or even if they don't), you can tell > them that robust regression gives exactly the same result. > > > -- > Rich Ulrich, [EMAIL PROTECTED] > http://www.pitt.edu/~wpilib/index.html > = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
help: weighted robust regression
Hi, Does someone know how to include weights in the S-Plus rdl1.s algorithm (the robust regression algorithm developed by Hubert & Rousseeuw)? Of course, the algorithm already include a weighting scheme (based on distances of x points w.r.t. a robust center of an ellipsoid) but I want, before entering the procedure, to put more weights on some x-points and less on some others. Does it make sense? If so, how can we do that? I considered using the lmRobMM function (the algorithm developped by Yohai et al, also available in S-Plus) because it includes a "weights" argument but my problem includes regressors that are continuous and others that are binary and I don't know if the algorithm can handle such categorical variables. Even if it's the case, the default number of random subsamples drawn (and needed by the algorithm) is 4.6*2^ncol(x); I have 10 continuous variables + 1 categorical with 20 levels (which recoded gives 20 dummy vars), so the total is 30. Of course, I could change this default number and set a more "reasonable" one but the choice would be inevitably so small with regard to the default that I seriously doubt about the validity of the result anyway. Can someone help? The exact references for the above cited papers are: * Robust regression with both continuous and binary regressors, Mia Hubert and Peter J. Rousseeuw. http://win-www.uia.ac.be/u/statis/publicat/#j1990 * Yohai, V., Stahel, W. A., and Zamar, R. H. (1991). A procedure for robust estimation and inference in linear regression, in Stahel, W. A. and Weisberg, S. W., Eds., Directions in robust statistics and diagnostics, Part II. Springer-Verlag. Patrick = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Robust Regression MM vs. LTS
Hi, Can anyone explain on the group or point me to an appropriate website or book which discusses robust regression? Particularly I'm interested in the differences between the MM and LTS robust regressions and a simple linear regression. What advantages and disadvantages exist for each? I understand that the simple linear regression has more assumptions on the incoming data, but I don't know exactly what needs to be assumed when doing a robust regression. I have skimmed Yohai, Stahel, and Zamar (1991), as that is referenced in S-PLUS help on lmRobMM. Unfortunately this turned out to be more of a discussion of the MM robust regression method that is used by S-PLUS instead of an introduction to the terms, and there was no comparison with the LTS robust or the LS regressions. Mike Joner = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Robust Regression and Excel for Stats
At 04:00 PM 1/9/02 -0700, Michael Joner wrote: Hi, Can anyone explain on the group or point me to an appropriate website or book which discusses robust regression? Particularly I'm interested in the differences between the MM and LTS robust regressions and a simple linear regression. What advantages and disadvantages exist for each? I understand that the simple linear regression has more assumptions on the incoming data, but I don't know exactly what needs to be assumed when doing a robust regression. A book: Rousseeuw, P.J., and Leroy, A.M. (1987), Robust Regression and Outlier Detection, New York: John Wiley A web site: http://win-www.uia.ac.be/u/statis/ A web site for LMS (described below): http://www.wabash.edu/econexcel/LMSOrigin/Home.htm A [simple, nontechnical] explanation: LS (aka OLS, choose coefficients to min SSR) has a "high breakdown point," which means that one outlier can severely tilt the fitted line. Say you have y = beta0 + beta1*x + epsilon where epsilon ~ N(0,sigma), but the last data point is y + outlierfactor. Say beta0=1, beta1=5, sigma=10, and outlierfactor=100; with the xs running from 1 to 11 by 1. Here's a single sample of dirty data: X Y 1 -7.8 2 16.0 3 13.7 4 4.5 5 27.6 6 23.7 7 36.3 8 28.2 9 47.3 10 42.3 11 149.5 The LS fitted line is Predicted y = -22.55 + 9.54x The LMS fitted line is Predicted y = 8.9 + 2.9x This shows pretty clearly that a single outlier can throw the LS fit way off. To deal with this, folks (Rousseeuw should get a lot of credit here) have invented Robust (or resistant) Regression techniques. Basically, this means that a different objective function than min SSR is used to choose the coefficients. The objective function is chosen to not be influenced by outliers. Lines fitted with such objective functions are said to be robust (or resistant) to outliers (which LS is clearly not). Consider, LMS, least median of squares. Instead of min SSR, the coefficients are chosen to min the MEDIAN of the squared residuals. The sample data above shows how the LMS fit basically ignores the outlier. It's just like with a list of numbers: 1, 2, 3. Median and average of 2. 1, 2, 99? Median still 2, average very high. When you have outliers, LMS works well because it ignores how far away each data point is. "Oh, so I'll always use the LMS. It's a silver bullet!" Not so fast. The LMS.xls workbook shows a Monte Carlo simulation of comparing LS and LMS on clean data. Here are the results: CLEAN DATA (NO OUTLIERS) LS b1 LMS b1 Population Parameters Average 5.008 Average 4.997 beta1 5 SD 0.9503 SD 1.8759 OutlierFactor 100 Max 9.175 Max 13.415 Min 1.447 Min -3.692 LS beats LMS because both sampling distributions are centered on the parameter value, but the LS histogram is much more spiked. LMS ignores distance away in favor of "above/below the middle observation" which is good when there are outliers, but LMS is ignoring a lot of information -- which is bad if the data are clean. But, if you compare these two estimators on dirty data, now LMS wins because LS's bias is killing it. Look at the table below, not once in 10,000 samples did LS ever get a sample slope of 5. DIRTY DATA (Y11 is an outlier) LS b1 LMS b1 Population Parameters Average 9.539 Average 5.009 beta1 5 SD 0.9600 SD 2.0780 OutlierFactor 100 Max 12.894 Max 18.021 Min 6.071 Min -3.736 "So, when do I use LS and when LMS or LTS or any other robust regression? I mean, "LS if clean; LMS if dirty" is a dumb rule because how will I know if the data are clean or dirty?" Excellent point. There's no algorithmic way to decide. You need to know the process that generated the data. You need judgment. But at least now you know a little more about robust regression than you did before you started reading this. :-) As for Excel for teaching statistics and quantitative analysis, I would like to respond to Ken K, who said: > Whenever I see statistics training using Excel it immediately make > me suspect that people who don't understand/use statistics are > making the software decision. Mr. K, you are painting with an awfully broad brush. I think there are very good reasons for using Excel to teach introductory statistics. I team teach stats and econometrics with my colleague, Frank Howland. We place heavy emphasis on concrete examples and Monte Carlo simulation with Excel and I think we deliver very good courses. I do not use Excel's RAND() or VB's Rnd when doing Monte Carlos. I am aware of many deficiencies with Excel and I grant there are mistakes I am not aware of, but consider a short list of Excel's benefits: 1) Student familiari
Re: Robust Regression and Excel for Stats
Hi Humberto, You have given an excellent simplified account of the usefulness of robust regression and followed it by whole hearted support for Excel uses in Statistics. It looks like you belong to a delta (in mathematical sense) group. Anyway, what do you think about the credibility of Excel simulation (using vba) results in a research environment. Will you go for it? Cheers. Siddeek Humberto Barreto wrote: At 04:00 PM 1/9/02 -0700, Michael Joner wrote: Hi, Can anyone explain on the group or point me to an appropriate website or book which discusses robust regression? Particularly I'm interested in the differences between the MM and LTS robust regressions and a simple linear regression. What advantages and disadvantages exist for each? I understand that the simple linear regression has more assumptions on the incoming data, but I don't know exactly what needs to be assumed when doing a robust regression. A book: Rousseeuw, P.J., and Leroy, A.M. (1987), Robust Regression and Outlier Detection, New York: John Wiley A web site: http://win-www.uia.ac.be/u/statis/ A web site for LMS (described below): http://www.wabash.edu/econexcel/LMSOrigin/Home.htm A [simple, nontechnical] explanation: LS (aka OLS, choose coefficients to min SSR) has a "high breakdown point," which means that one outlier can severely tilt the fitted line. Say you have y = beta0 + beta1*x + epsilon where epsilon ~ N(0,sigma), but the last data point is y + outlierfactor. Say beta0=1, beta1=5, sigma=10, and outlierfactor=100; with the xs running from 1 to 11 by 1. Here's a single sample of dirty data: XY 1-7.8 216.0 313.7 44.5 527.6 623.7 736.3 828.2 947.3 1042.3 11149.5 The LS fitted line is Predicted y = -22.55 + 9.54x The LMS fitted line is Predicted y = 8.9 + 2.9x This shows pretty clearly that a single outlier can throw the LS fit way off. To deal with this, folks (Rousseeuw should get a lot of credit here) have invented Robust (or resistant) Regression techniques. Basically, this means that a different objective function than min SSR is used to choose the coefficients. The objective function is chosen to not be influenced by outliers. Lines fitted with such objective functions are said to be robust (or resistant) to outliers (which LS is clearly not). Consider, LMS, least median of squares. Instead of min SSR, the coefficients are chosen to min the MEDIAN of the squared residuals. The sample data above shows how the LMS fit basically ignores the outlier. It's just like with a list of numbers: 1, 2, 3. Median and average of 2. 1, 2, 99? Median still 2, average very high. When you have outliers, LMS works well because it ignores how far away each data point is. "Oh, so I'll always use the LMS. It's a silver bullet!" Not so fast. The LMS.xls workbook shows a Monte Carlo simulation of comparing LS and LMS on clean data. Here are the results: CLEAN DATA (NO OUTLIERS) LS b1LMS b1Population Parameters Average 5.008Average 4.997beta15 SD0.9503SD1.8759OutlierFactor100 Max9.175Max13.415 Min1.447Min-3.692 LS beats LMS because both sampling distributions are centered on the parameter value, but the LS histogram is much more spiked. LMS ignores distance away in favor of "above/below the middle observation" which is good when there are outliers, but LMS is ignoring a lot of information -- which is bad if the data are clean. But, if you compare these two estimators on dirty data, now LMS wins because LS's bias is killing it. Look at the table below, not once in 10,000 samples did LS ever get a sample slope of 5. DIRTY DATA (Y11 is an outlier) LS b1LMS b1Population Parameters Average 9.539Average 5.009beta15 SD0.9600SD2.0780OutlierFactor100 Max12.894Max18.021 Min6.071Min-3.736 "So, when do I use LS and when LMS or LTS or any other robust regression? I mean, "LS if clean; LMS if dirty" is a dumb rule because how will I know if the data are clean or dirty?" Excellent point. There's no algorithmic way to decide. You need to know the process that generated the data. You need judgment. But at least now you know a little more about robust regression than you did before you started reading this. :-) As for Excel for teaching statistics and quantitative analysis, I would like to respond to Ken K, who said: > Whenever I see statistics training using Excel it immediately make > me suspect that people who don't understand/use statistics are > making the software decision. Mr. K, you are painting with an awfully broad brush. I think there are very good reasons for using Excel to teach introductory statistics. I team teach stats and econometrics with my colleague, Frank Howland. We place heavy emphasis on concrete examples and Monte Carlo simulation with Excel and I think we deliver very good courses. I do not use Excel's RAND() or VB's Rnd when doing Monte Carlos. I am aware of many deficiencies with Excel and I
Re: Robust Regression and Excel for Stats
Thanks for all the information. Do you know anything about the other variations of robust regression? Does it make a big difference if I use an MM regression, or LTS, or LMS? Mike On 11 Jan 2002 11:07:59 -0800 [EMAIL PROTECTED] (Humberto Barreto) wrote: > A book: > Rousseeuw, P.J., and Leroy, A.M. (1987), Robust Regression and Outlier > Detection, New York: John Wiley > > A web site: > http://win-www.uia.ac.be/u/statis/ > > A web site for LMS (described below): > http://www.wabash.edu/econexcel/LMSOrigin/Home.htm > > A [simple, nontechnical] explanation: > LS (aka OLS, choose coefficients to min SSR) has a "high breakdown point," > which means that one outlier can severely tilt the fitted line. = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
RE: Robust Regression and Excel for Stats
>= Original Message From Michael Joner <[EMAIL PROTECTED]> = >Does it make a big difference if I use >an MM regression, or LTS, or LMS? Good question. I answered your first post from a basic, introductory level. I was trying to convey the idea of robust regression. I used LMS as my example of a robust estimator for two reasons: (1) it is reasonably easy to understand and (2) I had a ready-made example in Excel which I wanted to use as evidence that Excel is not completely worthless for teaching stats. I felt I was on firm ground, but you are now moving to deeper intellectual waters and I am now treading just like you. I will give you my opinion, based on what I know right now, but I am not nearly as sure of myself as I was before. In my attempt to answer you, I ran across the work of Doug Martin and Andreas Ruckstuhl. I am ccing them on this post in the hopes that they can correct any mistakes here and explain, in clear language, what MM in S Plus is doing. First, I think it's pretty clear that LMS is dominated by LTS or MM because of the large SE of the LMS estimator. I found an excellent post to the S Plus list from Doug Martin: http://www.math.yorku.ca/Who/Faculty/Monette/S-news/0032.html I recommend that you read this carefully. He makes it clear that LTS and MM are attempts to improve the efficiency of the robust estimator without compromising its robustness to outliers. As for which form of robust regression to run, I do not believe there is a clear answer. You can intuitively see that this is going to be an exercise in trading off efficiency for robustness and an optimal estimator is going to be a function of the data or particular problem at hand. I am not an S Plus user, but it looks like S Plus is going to give you LTS and MM pretty easily. The S Plus 2000 Release Notes, which are many places on the web, e.g., http://www.uni-koeln.de/themen/Statistik/s/v51/readme_win.txt says the following: Robust LTS regression (ltsreg) By default, ltsreg now uses 10% trimming. Previously it used 50% trimming. This change was made in response to user feedback that the default trimming of 50% was too extreme in most cases. Robust MM regression (lmRobMM) The Robust MM Regression dialog now has a default Resampling Method of "Auto", which uses the sample size and number of variables to determine which resampling method to use. The command line function lmRobMM() is unchanged. I couldn't find a clear explanation of what exactly MM is doing. I fear you're going to have to read the paper that started this: Yohai, V., Stahel, W. A., and Zamar, R. H. (1991). A procedure for robust estimation and inference in linear regression, in Stahel, W. A. and Weisberg, S. W., Eds., Directions in Robust Statistics and Diagnostics, Part II. Springer-Verlag. It looks like this might also be a good source: Marazzi, A. (1993). Algorithms, Routines, and S functions for Robust Statistics. Wadsworth & Brooks/Cole, Pacific Grove, CA. After you figure out exactly what MM and LTS are doing, I would suggest trying all of them, LS, LMS, LTS, and MM. Robust regression estimates are the result of complicated (read "lots of room for mistakes") algortihms. You need to be wary. I would also recommend that you think carefully about the process that generated the data. Why are you worried about outliers? I am sorry that this is not a clean, clear answer. Perhaps others can offer better, more grounded advice. Burble burble . . . :-)) Humberto Barreto [EMAIL PROTECTED] (765) 361-6315 = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =