Re: Robust regression

2002-03-01 Thread Rich Ulrich

On 1 Mar 2002 00:36:01 -0800, [EMAIL PROTECTED] (Alex Yu)
wrote:

 
 I know that robust regression can downweight outliers. Should someone
 apply robust regression when the data have skewed distributions but do not
 have outliers? Regression assumptions require normality of residuals, but
 not the normality of raw scores. So does it help at all to use robust
 regression in this situation. Any help will be appreciated. 

Go ahead and do it if you want.  

If someone asks (or even if they don't), you can tell 
them that robust regression gives exactly the same result.


-- 
Rich Ulrich, [EMAIL PROTECTED]
http://www.pitt.edu/~wpilib/index.html


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: Robust regression

2002-03-01 Thread Vadim and Oxana Marmer

If, for example, normality assumption holds then by doing robust
regression instead of OLS you lose efficiency. So, it's not the same
result after all. But you can do both, compare and decide. If robust
regression produces results which are not really different from the OLS
then stay with OLS.

On Fri, 1 Mar 2002, Rich Ulrich wrote:

 On 1 Mar 2002 00:36:01 -0800, [EMAIL PROTECTED] (Alex Yu)
 wrote:

 
  I know that robust regression can downweight outliers. Should someone
  apply robust regression when the data have skewed distributions but do not
  have outliers? Regression assumptions require normality of residuals, but
  not the normality of raw scores. So does it help at all to use robust
  regression in this situation. Any help will be appreciated.

 Go ahead and do it if you want.

 If someone asks (or even if they don't), you can tell
 them that robust regression gives exactly the same result.


 --
 Rich Ulrich, [EMAIL PROTECTED]
 http://www.pitt.edu/~wpilib/index.html




=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Robust regression

2002-02-28 Thread Alex Yu


I know that robust regression can downweight outliers Should someone
apply robust regression when the data have skewed distributions but do not
have outliers? Regression assumptions require normality of residuals, but
not the normality of raw scores So does it help at all to use robust
regression in this situation Any help will be appreciated 



=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jsestatncsuedu/
=



RE: Robust Regression and Excel for Stats

2002-01-12 Thread Humberto Barreto

= Original Message From Michael Joner [EMAIL PROTECTED] =
Does it make a big difference if I use
an MM regression, or LTS, or LMS?

Good question.

I answered your first post from a basic, introductory level.  I was trying to 
convey the idea of robust regression.  I used LMS as my example of a robust 
estimator for two reasons: (1) it is reasonably easy to understand and (2) I 
had a ready-made example in Excel which I wanted to use as evidence that Excel 
is not completely worthless for teaching stats.

I felt I was on firm ground, but you are now moving to deeper intellectual 
waters and I am now treading just like you.  I will give you my opinion, based 
on what I know right now, but I am not nearly as sure of myself as I was 
before.

In my attempt to answer you, I ran across the work of Doug Martin and Andreas 
Ruckstuhl.  I am ccing them on this post in the hopes that they can correct 
any mistakes here and explain, in clear language, what MM in S Plus is doing.

First, I think it's pretty clear that LMS is dominated by LTS or MM because of 
the large SE of the LMS estimator.  I found an excellent post to the S Plus 
list from Doug Martin:

http://www.math.yorku.ca/Who/Faculty/Monette/S-news/0032.html

I recommend that you read this carefully. He makes it clear that LTS and MM 
are attempts to improve the efficiency of the robust estimator without 
compromising its robustness to outliers.

As for which form of robust regression to run, I do not believe there is a 
clear answer.  You can intuitively see that this is going to be an exercise in 
trading off efficiency for robustness and an optimal estimator is going to be 
a function of the data or particular problem at hand.

I am not an S Plus user, but it looks like S Plus is going to give you LTS and 
MM pretty easily.  The S Plus 2000 Release Notes, which are many places on the 
web, e.g.,
http://www.uni-koeln.de/themen/Statistik/s/v51/readme_win.txt
says the following:

Robust LTS regression (ltsreg)

By default, ltsreg now uses 10% trimming. Previously it used 50%
trimming. This change was made in response to user feedback that the
default trimming of 50% was too extreme in most cases.

Robust MM regression (lmRobMM)

The Robust MM Regression dialog now has a default Resampling Method of
Auto, which uses the sample size and number of variables to
determine which resampling method to use. The command line function
lmRobMM() is unchanged.

I couldn't find a clear explanation of what exactly MM is doing.  I fear 
you're going to have to read the paper that started this:

Yohai, V., Stahel, W. A., and Zamar, R. H. (1991). A procedure for robust 
estimation and inference in linear regression, in Stahel, W. A. and Weisberg, 
S. W., Eds., Directions in Robust Statistics and Diagnostics, Part II. 
Springer-Verlag.

It looks like this might also be a good source:

Marazzi, A. (1993). Algorithms, Routines, and S functions for Robust 
Statistics. Wadsworth  Brooks/Cole, Pacific Grove, CA.


After you figure out exactly what MM and LTS are doing, I would suggest trying 
all of them, LS, LMS, LTS, and MM.  Robust regression estimates are the result 
of complicated (read lots of room for mistakes) algortihms.  You need to be 
wary.  I would also recommend that you think carefully about the process that 
generated the data.  Why are you worried about outliers?

I am sorry that this is not a clean, clear answer. Perhaps others can offer 
better, more grounded advice.  Burble burble . . . :-))

Humberto Barreto
[EMAIL PROTECTED]
(765) 361-6315



=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Robust Regression and Excel for Stats

2002-01-11 Thread Humberto Barreto

At 04:00 PM 1/9/02 -0700, Michael Joner wrote:
Hi,
Can anyone explain on the group or point me to an appropriate website
or
book which discusses robust regression? Particularly I'm interested
in
the differences between the MM and LTS robust regressions and a
simple
linear regression. What advantages and disadvantages exist for
each? I
understand that the simple linear regression has more assumptions on
the
incoming data, but I don't know exactly what needs to be assumed
when
doing a robust
regression.
A book:
Rousseeuw, P.J., and Leroy, A.M. (1987), Robust Regression and Outlier
Detection, New York: John Wiley
A web site:
http://win-www.uia.ac.be/u/statis/
A web site for LMS (described below):
http://www.wabash.edu/econexcel/LMSOrigin/Home.htm
A [simple, nontechnical] explanation:
LS (aka OLS, choose coefficients to min SSR) has a high breakdown
point, which means that one outlier can severely tilt the fitted
line. 
Say you have y = beta0 + beta1*x + epsilon where epsilon ~ N(0,sigma),
but the last data point is y + outlierfactor. Say beta0=1, beta1=5,
sigma=10, and outlierfactor=100; with the xs running from 1 to 11 by
1.
Here's a single sample of dirty
data:
XY
1-7.8
216.0
313.7
44.5
527.6
623.7
736.3
828.2
947.3
1042.3
11149.5
The LS fitted line is Predicted y = -22.55 + 9.54x
The LMS fitted line is Predicted y = 8.9 + 2.9x
This shows pretty clearly that a single outlier can throw the LS fit way
off.
To deal with this, folks (Rousseeuw should get a lot of credit here) have
invented Robust (or resistant) Regression techniques. Basically,
this means that a different objective function than min SSR is used to
choose the coefficients. The objective function is chosen to not be
influenced by outliers. Lines fitted with such objective functions are
said to be robust (or resistant) to outliers (which LS is clearly
not). Consider, LMS, least median of squares. Instead of min
SSR, the coefficients are chosen to min the MEDIAN of the squared
residuals.
The sample data above shows how the LMS fit basically ignores the
outlier. It's just like with a list of numbers: 1, 2, 3.
Median and average of 2. 1, 2, 99? Median still 2, average
very high. When you have outliers, LMS works well because it
ignores how far away each data point is.
Oh, so I'll always use the LMS. It's a silver
bullet!
Not so fast. The LMS.xls workbook shows a Monte Carlo simulation of
comparing LS and LMS on clean data. Here are the results:
CLEAN DATA (NO OUTLIERS)
LS
b1LMS
b1Population
Parameters
Average 5.008Average
4.997beta15
SD0.9503SD1.8759OutlierFactor100
Max9.175Max13.415
Min1.447Min-3.692
LS beats LMS because both sampling distributions are centered on the
parameter value, but the LS histogram is much more spiked.
LMS ignores distance away in favor of above/below the middle
observation which is good when there are outliers, but LMS is
ignoring a lot of information -- which is bad if the data are
clean.
But, if you compare these two estimators on dirty data, now LMS wins
because LS's bias is killing it. Look at the table below, not once
in 10,000 samples did LS ever get a sample slope of 5.
DIRTY DATA (Y11 is an outlier)
LS
b1LMS
b1Population
Parameters
Average 9.539Average
5.009beta15
SD0.9600SD2.0780OutlierFactor100
Max12.894Max18.021
Min6.071Min-3.736

So, when do I use LS and when LMS or LTS or any other robust
regression? I mean, LS if clean; LMS if dirty is a dumb
rule because how will I know if the data are clean or
dirty?
Excellent point. There's no algorithmic way to decide. You need to know
the process that generated the data. You need judgment. But
at least now you know a little more about robust regression than you did
before you started reading this. :-)

As for Excel for teaching statistics and quantitative analysis, I would
like to respond to Ken K, who said:
 Whenever I see statistics training using Excel it immediately
make
 me suspect that people who don't understand/use statistics are 

 making the software decision.
Mr. K, you are painting with an awfully broad brush. I think there
are very good reasons for using Excel to teach introductory statistics. I
team teach stats and econometrics with my colleague, Frank Howland.
We place heavy emphasis on concrete examples and Monte Carlo simulation
with Excel and I think we deliver very good courses. 
I do not use Excel's RAND() or VB's Rnd when doing Monte Carlos. I am
aware of many deficiencies with Excel and I grant there are mistakes I am
not aware of, but consider a short list of Excel's benefits:
1) Student familiarity
2) Installed base
3) Data import features (including web links)
4) Ability to see formulas and how numbers are being calculated (I did
not understand the person who charged that Excel lacked an audit
trail at all -- what is that all about?)
5) Buttons and other controls to tailor the environment for the
student
6) Visual Basic for Monte Carlo and other advanced programming
These are the primary reasons why I use Excel to teach stats and
econometrics. I

Re: Robust Regression and Excel for Stats

2002-01-11 Thread Shareef Siddeek


 Hi Humberto,
You have given an excellent simplified account of the usefulness of
robust regression and followed it by whole hearted support for Excel uses
in Statistics. It looks like you belong to a delta (in mathematical sense)
group. Anyway, what do you think about the credibility of Excel simulation
(using vba) results in a research environment. Will you go for it? Cheers.
Siddeek


Humberto Barreto wrote:
At 04:00 PM 1/9/02 -0700, Michael Joner wrote:
Hi,
Can anyone explain on the group or point me to an appropriate website
or
book which discusses robust regression? Particularly I'm interested
in
the differences between the MM and LTS robust regressions and a simple
linear regression. What advantages and disadvantages exist for
each? I
understand that the simple linear regression has more assumptions on
the
incoming data, but I don't know exactly what needs to be assumed when
doing a robust regression.

A book:
Rousseeuw, P.J., and Leroy, A.M.
(1987), Robust Regression and Outlier Detection, New York: John Wiley
A web site:
http://win-www.uia.ac.be/u/statis/
A web site for LMS (described below):
http://www.wabash.edu/econexcel/LMSOrigin/Home.htm
A [simple, nontechnical] explanation:
LS (aka OLS, choose coefficients
to min SSR) has a "high breakdown point," which means that one outlier
can severely tilt the fitted line.
Say you have y = beta0 + beta1*x +
epsilon where epsilon ~ N(0,sigma), but the last data point is y + outlierfactor.
Say beta0=1, beta1=5, sigma=10, and outlierfactor=100; with the xs running
from 1 to 11 by 1.
Here's a single sample of dirty data:
XY
1-7.8
216.0
313.7
44.5
527.6
623.7
736.3
828.2
947.3
1042.3
11149.5
The LS fitted line is Predicted y
= -22.55 + 9.54x
The LMS fitted line is Predicted
y = 8.9 + 2.9x
This shows pretty clearly that a single
outlier can throw the LS fit way off.
To deal with this, folks (Rousseeuw
should get a lot of credit here) have invented Robust (or resistant) Regression
techniques. Basically, this means that a different objective function
than min SSR is used to choose the coefficients. The objective function
is chosen to not be influenced by outliers. Lines fitted with such objective
functions are said to be robust (or resistant) to outliers (which LS is
clearly not). Consider, LMS, least median of squares. Instead
of min SSR, the coefficients are chosen to min the MEDIAN of the squared
residuals.
The sample data above shows how the
LMS fit basically ignores the outlier. It's just like with a list
of numbers: 1, 2, 3. Median and average of 2. 1, 2, 99?
Median still 2, average very high. When you have outliers, LMS works
well because it ignores how far away each data point is.
"Oh, so I'll always use the LMS.
It's a silver bullet!"
Not so fast. The LMS.xls workbook
shows a Monte Carlo simulation of comparing LS and LMS on clean data.
Here are the results:
CLEAN DATA (NO OUTLIERS)
LS b1LMS
b1Population Parameters
Average 5.008Average
4.997beta15
SD0.9503SD1.8759OutlierFactor100
Max9.175Max13.415
Min1.447Min-3.692
LS beats LMS because both sampling
distributions are centered on the parameter value, but the LS histogram
is much more spiked. LMS ignores distance away in favor of
"above/below the middle observation" which is good when there are outliers,
but LMS is ignoring a lot of information -- which is bad if the data are
clean.
But, if you compare these two estimators
on dirty data, now LMS wins because LS's bias is killing it. Look
at the table below, not once in 10,000 samples did LS ever get a sample
slope of 5.
DIRTY DATA (Y11 is an outlier)
LS b1LMS
b1Population Parameters
Average 9.539Average
5.009beta15
SD0.9600SD2.0780OutlierFactor100
Max12.894Max18.021
Min6.071Min-3.736

"So, when do I use LS and when LMS
or LTS or any other robust regression? I mean, "LS if clean; LMS
if dirty" is a dumb rule because how will I know if the data are clean
or dirty?"
Excellent point. There's no algorithmic
way to decide. You need to know the process that generated the data.
You need judgment. But at least now you know a little more about
robust regression than you did before you started reading this. :-)


As for Excel for teaching statistics
and quantitative analysis, I would like to respond to Ken K, who said:
> Whenever I see statistics training
using Excel it immediately make
> me suspect that people who don't
understand/use statistics are
> making the software decision.
Mr. K, you are painting with an awfully
broad brush. I think there are very good reasons for using Excel
to teach introductory statistics. I team teach stats and econometrics with
my colleague, Frank Howland. We place heavy emphasis on concrete
examples and Monte Carlo simulation with Excel and I think we deliver very
good courses.
I do not use Excel's RAND() or VB's
Rnd when doing Monte Carlos. I am aware of many deficiencies with Excel
and I grant there are mistakes I am not aware of, but consider a short
list of Excel's

Re: Robust Regression and Excel for Stats

2002-01-11 Thread Michael Joner

Thanks for all the information.  Do you know anything about the other
variations of robust regression?  Does it make a big difference if I use
an MM regression, or LTS, or LMS?

Mike

On 11 Jan 2002 11:07:59 -0800 [EMAIL PROTECTED] (Humberto Barreto)
wrote:

 A book:
 Rousseeuw, P.J., and Leroy, A.M. (1987), Robust Regression and Outlier 
 Detection, New York: John Wiley
 
 A web site:
 http://win-www.uia.ac.be/u/statis/
 
 A web site for LMS (described below):
 http://www.wabash.edu/econexcel/LMSOrigin/Home.htm
 
 A [simple, nontechnical] explanation:
 LS (aka OLS, choose coefficients to min SSR) has a high breakdown
point,  which means that one outlier can severely tilt the fitted line.


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



help: weighted robust regression

2000-11-14 Thread Patrick Agin


Hi,

Does someone know how to include weights in the S-Plus rdl1.s algorithm
(the robust regression algorithm developed by Hubert  Rousseeuw)? Of
course, the algorithm already include a weighting scheme (based on
distances of x points w.r.t. a robust center of an ellipsoid) but I
want, before entering the procedure, to put more weights on some
x-points and less on some others. Does it make sense? If so, how can we
do that?

I considered using the lmRobMM function (the algorithm developped by
Yohai et al, also available in S-Plus) because it includes a "weights"
argument but my problem includes regressors that are continuous and
others that are binary and I don't know if the algorithm can handle such
categorical variables. Even if it's the case, the default number of
random subsamples drawn (and needed by the algorithm) is 4.6*2^ncol(x);
I have 10 continuous variables + 1 categorical with 20 levels (which
recoded gives 20 dummy vars), so the total is 30. Of course, I could
change this default number and set a more "reasonable" one but the
choice would be inevitably so small with regard to the default that I
seriously doubt about the validity of the result anyway.

Can someone help?

The exact references for the above cited papers are:

   * Robust regression with both continuous and binary regressors, Mia
 Hubert and Peter J. Rousseeuw.
 http://win-www.uia.ac.be/u/statis/publicat/#j1990
   * Yohai, V., Stahel, W. A., and Zamar, R. H. (1991). A procedure for
 robust estimation and inference in linear regression, in Stahel, W.
 A. and Weisberg, S. W., Eds.,  Directions in robust statistics and
 diagnostics, Part II.  Springer-Verlag.

Patrick




=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=