Re: [Rd] (PR#8877) predict.lm does not have a weights argument for

2006-05-26 Thread Peter Dalgaard
Prof Brian Ripley [EMAIL PROTECTED] writes:

  (e) Inverse probability weights: Knowing that part of the population
  is undersampled and wanting results that are compatible with what you
  would have gotten in a balanced sample. Prototypically: You sample X,
  taking only a third of those with X  c; find population mean of X,
  (or univariate regression on some other variable, which is only
  recorded in the subsample).
 
 I would call this an example of case weights (you are just weighting
 cases and saying `I have 1/p like this', and in rlm there is a
 difference between (a) and (b) and you would want to use
 wt.method=case for (e)).

No it's not quite the same. One is I have 3 of these, the other is
I have looked at one case, but it comes from a population that I know
is undersampled by a factor of 3. Standard error of estimates will be
considerably different.

-- 
   O__   Peter Dalgaard Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark  Ph:  (+45) 35327918
~~ - ([EMAIL PROTECTED])  FAX: (+45) 35327907

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] (PR#8877) predict.lm does not have a weights argument for

2006-05-24 Thread Peter Dalgaard
[EMAIL PROTECTED] writes:

 (a) case weights:  w_i = 3 means `I have three observations like (y, x)'
 
 (b) inverse-variance weights, most often an indication that w_i = 1/3 
 means that y_i is actually the average of 3 observations at x_i.
 
 (c) multiple imputation, where a case with missing values in x is split 
 into say 5 parts, with case weights less than and summing to one.
 
 (d) Heteroscedasticity, where the model is rather
 
  y = x\beta + \epsilon, \epsilon \sim N(0, \sigma^2(x))
 
 And there may well be other scenarios, but those are the most common (in 
 decreasing order) in my experience.

I'd have (d) higher on the list, but never mind. There's also

(e) Inverse probability weights: Knowing that part of the population
is undersampled and wanting results that are compatible with what you
would have gotten in a balanced sample. Prototypically: You sample X,
taking only a third of those with X  c; find population mean of X,
(or univariate regression on some other variable, which is only
recorded in the subsample).

(R-bugs stripped from recipients since this doesn't really have
anything to do with the purported bug.)

   
-- 
   O__   Peter Dalgaard Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark  Ph:  (+45) 35327918
~~ - ([EMAIL PROTECTED])  FAX: (+45) 35327907

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] (PR#8877) predict.lm does not have a weights argument for

2006-05-24 Thread Prof Brian Ripley
On Wed, 24 May 2006, Peter Dalgaard wrote:

 [EMAIL PROTECTED] writes:

 (a) case weights:  w_i = 3 means `I have three observations like (y, x)'

 (b) inverse-variance weights, most often an indication that w_i = 1/3
 means that y_i is actually the average of 3 observations at x_i.

 (c) multiple imputation, where a case with missing values in x is split
 into say 5 parts, with case weights less than and summing to one.

 (d) Heteroscedasticity, where the model is rather

  y = x\beta + \epsilon, \epsilon \sim N(0, \sigma^2(x))

 And there may well be other scenarios, but those are the most common (in
 decreasing order) in my experience.

 I'd have (d) higher on the list, but never mind. There's also

I find that if you detect heteroscedasticity, then one of the following 
applies:

- a transformation of y would be beneficial

- a non-normal model, e.g. a Poisson regression, is more appropriate

- the error variance really depends on y or Ey not x, as in most
   measurement-error scenarios (and the example in ?nls and the example
   in the addendum to the bug report).

- in analytical chemistry as in the example on the addendum to the bug
   report, there are errors in both y and x to consider, and a functional
   relationship model is better.

So I very rarely actually get as far as predicting from a heteroscedastic 
regression model.

 (e) Inverse probability weights: Knowing that part of the population
 is undersampled and wanting results that are compatible with what you
 would have gotten in a balanced sample. Prototypically: You sample X,
 taking only a third of those with X  c; find population mean of X,
 (or univariate regression on some other variable, which is only
 recorded in the subsample).

I would call this an example of case weights (you are just weighting cases 
and saying `I have 1/p like this', and in rlm there is a difference 
between (a) and (b) and you would want to use wt.method=case for (e)).

-- 
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] (PR#8877) predict.lm does not have a weights argument for newdata

2006-05-24 Thread jranke
Prof. Ripley,

thank you for being more explicit now! Thank you also for the fix to
the wish that you derived from my bug report. Here comes the rationale
for my updated patch, which I *humbly* propose as a more general
solution to the problem.

I have spent several days now on this problem, but I am no statistician,
so please excuse my ignorance of any notational or other conventions
that I might disregard below. I tried hard to do it right.

Judging from the documentation to lm, I find that lm has only *one*
perspective on weights which I propose is reasonable and sufficient. It
minimises  sum(w*e^2)), more clearly expressed as 

sum(w_i * e_i^2)

I infer that the point is to construct the weights such that 
the errors

\epsilon = sqrt(w_k) * \epsilon_k 

are from a normal distribution N(0, \sigma^2), where index k covers all
observations, past as well as future ones, the model is

y = x\beta + \epsilon_k

with 

\epsilon_k \sim N(0, \sigma_k^2)

and

\sigma_k^2 = \sigma^2 / w_k

This is my view on this, it might be naive, it might be wrong, but
if so, I can't see my mistake.

An estimator for \sigma^2 is then

sum(w_i * e_i^2) / df

which is called res.var in the R code to predict.lm. An estimator
for \sigma_k^2, the variance for observation k, is therefore

res.var / w_k

which is what my proposed patch, which can now be found in an updated
form under

 
http://www.uft.uni-bremen.de/chemie/ranke/r-patches/predict.lm.patch.r38195

implements. I removed the first version called lm.predict.patch because
it did not correctly deal with prediction intervals for old data, sorry
for the inconvenience (Not found). Meanwhile you disabled prediction
intervals for old data, which the above patch reverts (no offense,
please, I just think if predict.lm does confidence intervals for the
regression line at the location of old data points, it might as well do
illustrative prediction intervals at these locations. At least for the
old data, the weights are known from the model object.)

I tested my solution with the script

http://www.uft.uni-bremen.de/chemie/ranke/r-patches/test.predict.lm.R

* Prof Brian Ripley [EMAIL PROTECTED] [060524 07:50]:
 I am more than 'a little disappointed' that you expect a detailed 
 explanation of the problems with your 'bug' report, especially as you did 
 not provide any explanation yourself as to your reasoning (nor did you 
 provide any credentials nor references).

I am sorry for this, and I worked hard to supply the information in the
followups including this one.
 
 Note that
 
 1) Your report did not make clear that this was only relevant to 
 prediction intervals, which are not commonly used.

I am not really sure if confidence intervals couldn't be improved
in scenario (d). 
 
 2) Only in some rather special circumstances do weights enter into 
 prediction intervals, and definitely not necessarily the weights used for 
 fitting.  

Yes, under these circumstances R would then need weights from the user
in order to construct prediction intervals. 

 Indeed, it seems that to label the variances that do enter as 
 inverse weights would be rather misleading.

Possibly. I assume it can be correctly done and documented (see my patch
for a proposal).

 3) In a later message you referenced Brown's book, which is dealing with a 
 different model.
 
 The model fitted by lm is
 
   y = x\beta + \epsilon, \epsilon \sim N(0, \sigma^2)
 
 (Row vector x, column vector \beta.)

I assumed that the model in Brown's book is a special case of the model
fitted by lm, the only difference being that x is a row vector (1,
x_B), where x_B is Brown's random variable X, and \beta contains
\alpha_B and \beta_B.
 
 If the observations are from the model, OLS is appropriate, but weighting 
 is used in several scenarios, including:
 
 (a) case weights:  w_i = 3 means `I have three observations like (y, x)'

I am not sure what you mean by this. Do you mean you have three exactly
equal observations? In this case this could be regarded as a special
case of (b) and w_i = 1/3 would be used as input for lm.

What does the ' mean in (y, x)'?

 (b) inverse-variance weights, most often an indication that w_i = 1/3 
 means that y_i is actually the average of 3 observations at x_i.

Yes, and I take the reasoning for this to be as follows: An estimator
for the variance of the means of n observations is 1/n times the
variance of the single observations. Why shouldn't this apply to the
means of multiple future observations?

 (c) multiple imputation, where a case with missing values in x is split 
 into say 5 parts, with case weights less than and summing to one.

I don't understand this, but I suppose lm works in the same way for
weights w_i derived from such a procedure.
 
 (d) Heteroscedasticity, where the model is rather
 
 y = x\beta + \epsilon, \epsilon \sim N(0, \sigma^2(x))
 
 And there may well be other scenarios, 

Re: [Rd] (PR#8877) predict.lm does not have a weights argument for

2006-05-23 Thread ripley
I am more than 'a little disappointed' that you expect a detailed 
explanation of the problems with your 'bug' report, especially as you did 
not provide any explanation yourself as to your reasoning (nor did you 
provide any credentials nor references).

Note that

1) Your report did not make clear that this was only relevant to 
prediction intervals, which are not commonly used.

2) Only in some rather special circumstances do weights enter into 
prediction intervals, and definitely not necessarily the weights used for 
fitting.  Indeed, it seems that to label the variances that do enter as 
inverse weights would be rather misleading.

3) In a later message you referenced Brown's book, which is dealing with a 
different model.

The model fitted by lm is

y = x\beta + \epsilon, \epsilon \sim N(0, \sigma^2)

(Row vector x, column vector \beta.)

If the observations are from the model, OLS is appropriate, but weighting 
is used in several scenarios, including:

(a) case weights:  w_i = 3 means `I have three observations like (y, x)'

(b) inverse-variance weights, most often an indication that w_i = 1/3 
means that y_i is actually the average of 3 observations at x_i.

(c) multiple imputation, where a case with missing values in x is split 
into say 5 parts, with case weights less than and summing to one.

(d) Heteroscedasticity, where the model is rather

 y = x\beta + \epsilon, \epsilon \sim N(0, \sigma^2(x))

And there may well be other scenarios, but those are the most common (in 
decreasing order) in my experience.


Now, consider prediction intervals.  It would be perverse to consider 
these to be for other than a single future observation at x.  In scenarios 
(a) to (c), R's current behaviour is what is commonly accepted to be 
correct (and you provide no arguments otherwise). If a future observation 
has missing values, predict.lm would only be a starting point for multiple 
imputation.

Even if 'newdata' is not supplied, prediction intervals must apply to new 
observations, not the existing ones (or the formula used is wrong: perhaps 
to avoid your confusion they should not be allowed in that case).

Only in case (d), which is a different model, is it appropriate to supply 
error variances (not weights) for prediction intervals.  This is why I 
marked it for the wishlist.  Equally, one might want to specify
\sigma^2 for all future observations as being different from the model 
fitting, as the training data may include other components of variance in 
their error variances.


On Sat, 20 May 2006, [EMAIL PROTECTED] wrote:

 Dear R developers,

 I am a little disappointed that my bug report only made it to the
 wishlist, with the argument:

   Well, it does not say it has.
   Only relevant to prediction intervals.

 predict.lm does calculate prediction intervals for linear models from
 weighted regression, so they should be correct, right?

 As far as I can see they are bound to be wrong in almost all cases, if
 no weights for newdata can be given. So the point is that predict.lm
 needs such an argument in order to give correct prediction intervals for
 models from weighted linear regression.

 Also, it strikes me that in the absence of a newdata argument, the
 weights from the lm object need to be taken into account for
 constructing prediction intervals.

Where are the references and arguments?

 My updated proposal fixing both points as well as the help file can be found 
 at:

   http://www.uft.uni-bremen.de/chemie/ranke/r-patches/lm.predict.patch

Not found.

 and I wrote up a small demonstration of the problem and my proposed solution:

   http://www.uft.uni-bremen.de/chemie/ranke/r-patches/lm.predict.pdf

That example is not a valid use of WLS, as you have the weights depending 
on the data you are fitting.

 Kind regards,

 Johannes Ranke



-- 
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel