Re: Robust Regression and Excel for Stats

Shareef Siddeek Fri, 11 Jan 2002 12:47:17 -0800

Hi Humberto,
You have given an excellent simplified account of the usefulness of robust regression and followed it by whole hearted support for Excel uses in Statistics. It looks like you belong to a delta (in mathematical sense) group. Anyway, what do you think about the credibility of Excel simulation (using vba) results in a research environment. Will you go for it? Cheers. Siddeek

Humberto Barreto wrote:

At 04:00 PM 1/9/02 -0700, Michael Joner wrote:
Hi,
Can anyone explain on the group or point me to an appropriate website or
book which discusses robust regression? Particularly I'm interested in
the differences between the MM and LTS robust regressions and a simple
linear regression. What advantages and disadvantages exist for each? I
understand that the simple linear regression has more assumptions on the
incoming data, but I don't know exactly what needs to be assumed when
doing a robust regression.

A book:
Rousseeuw, P.J., and Leroy, A.M. (1987), Robust Regression and Outlier Detection, New York: John Wiley
A web site:
http://win-www.uia.ac.be/u/statis/
A web site for LMS (described below):
http://www.wabash.edu/econexcel/LMSOrigin/Home.htm
A [simple, nontechnical] explanation:
LS (aka OLS, choose coefficients to min SSR) has a "high breakdown point," which means that one outlier can severely tilt the fitted line.
Say you have y = beta0 + beta1*x + epsilon where epsilon ~ N(0,sigma), but the last data point is y + outlierfactor. Say beta0=1, beta1=5, sigma=10, and outlierfactor=100; with the xs running from 1 to 11 by 1.
Here's a single sample of dirty data:
XY
1-7.8
216.0
313.7
44.5
527.6
623.7
736.3
828.2
947.3
1042.3
11149.5
The LS fitted line is Predicted y = -22.55 + 9.54x
The LMS fitted line is Predicted y = 8.9 + 2.9x
This shows pretty clearly that a single outlier can throw the LS fit way off.
To deal with this, folks (Rousseeuw should get a lot of credit here) have invented Robust (or resistant) Regression techniques. Basically, this means that a different objective function than min SSR is used to choose the coefficients. The objective function is chosen to not be influenced by outliers. Lines fitted with such objective functions are said to be robust (or resistant) to outliers (which LS is clearly not). Consider, LMS, least median of squares. Instead of min SSR, the coefficients are chosen to min the MEDIAN of the squared residuals.
The sample data above shows how the LMS fit basically ignores the outlier. It's just like with a list of numbers: 1, 2, 3. Median and average of 2. 1, 2, 99? Median still 2, average very high. When you have outliers, LMS works well because it ignores how far away each data point is.
"Oh, so I'll always use the LMS. It's a silver bullet!"
Not so fast. The LMS.xls workbook shows a Monte Carlo simulation of comparing LS and LMS on clean data. Here are the results:
CLEAN DATA (NO OUTLIERS)
LS b1LMS b1Population Parameters
Average 5.008Average 4.997beta15
SD0.9503SD1.8759OutlierFactor100
Max9.175Max13.415
Min1.447Min-3.692
LS beats LMS because both sampling distributions are centered on the parameter value, but the LS histogram is much more spiked. LMS ignores distance away in favor of "above/below the middle observation" which is good when there are outliers, but LMS is ignoring a lot of information -- which is bad if the data are clean.
But, if you compare these two estimators on dirty data, now LMS wins because LS's bias is killing it. Look at the table below, not once in 10,000 samples did LS ever get a sample slope of 5.
DIRTY DATA (Y11 is an outlier)
LS b1LMS b1Population Parameters
Average 9.539Average 5.009beta15
SD0.9600SD2.0780OutlierFactor100
Max12.894Max18.021
Min6.071Min-3.736

"So, when do I use LS and when LMS or LTS or any other robust regression? I mean, "LS if clean; LMS if dirty" is a dumb rule because how will I know if the data are clean or dirty?"
Excellent point. There's no algorithmic way to decide. You need to know the process that generated the data. You need judgment. But at least now you know a little more about robust regression than you did before you started reading this. :-)

As for Excel for teaching statistics and quantitative analysis, I would like to respond to Ken K, who said:
> Whenever I see statistics training using Excel it immediately make
> me suspect that people who don't understand/use statistics are
> making the software decision.
Mr. K, you are painting with an awfully broad brush. I think there are very good reasons for using Excel to teach introductory statistics. I team teach stats and econometrics with my colleague, Frank Howland. We place heavy emphasis on concrete examples and Monte Carlo simulation with Excel and I think we deliver very good courses.
I do not use Excel's RAND() or VB's Rnd when doing Monte Carlos. I am aware of many deficiencies with Excel and I grant there are mistakes I am not aware of, but consider a short list of Excel's benefits:
1) Student familiarity
2) Installed base
3) Data import features (including web links)
4) Ability to see formulas and how numbers are being calculated (I did not understand the person who charged that Excel "lacked an audit trail" at all -- what is that all about?)
5) Buttons and other controls to tailor the environment for the student
6) Visual Basic for Monte Carlo and other advanced programming
These are the primary reasons why I use Excel to teach stats and econometrics. I think they are reasonable. For all of you haters of Excel, please visit http://www.wabash.edu/econexcel/LMSOrigin/Home.htm
and take a look at LMS.xls to see what I mean. What do you think?
BTW, the file LMSOrigin.xls shows that not even SAS is immune from flaws. There simply is no such thing as perfect software.

Prof. Humberto BarretoDepartment of EconomicsWabash CollegeCrawfordsville, IN [EMAIL PROTECTED]: (765) 361-6315FAX: (765) 361-6277http://www.wabash.edu/econexcel

Re: Robust Regression and Excel for Stats

Reply via email to