Re: [R] Weighted least squares
As John noted, there are different kinds of weights, and different terminology: * inverse-variance weights (accuracy weights) * case weights (frequencies, counts) * sampling weights (selection probability weights) I'll add: * inverse-variance weights, where var(y for observation) = 1/weight (as opposed to just being inversely proportional to the weight) * weights used as part of an algorithm (e.g. for robust estimation, or glm's using iteratively-reweighted least-squares). For linear regression, the type of weights don't affect regression coefficient calculation, but do affect inferences such as standard errors for the regression coefficients, degrees of freedom for variance estimates, etc. lm() inferences assume the first type. Other formulae are appropriate for inferences for types 2-4. Combinations of types 1-4 require other formulae; this gets nontrivial. For the 5th type, inferences need to be handled by the algorithm that is using weighted linear regression. Tim Hesterberg John Fox wrote: I think that the problem is that the term weights has different meanings, which, although they are related, are not quite the same. The weights used by lm() are (inverse-)variance weights, reflecting the variances of the errors, with observations that have low-variance errors therefore being accorded greater weight in the resulting WLS regression. What you have are sometimes called case weights, and I'm unaware of a general way of handling them in R, although you could regenerate the unaggregated data. As you discovered, you get the same coefficients with case weights as with variance weights, but different standard errors. Finally, there are sampling weights, which are inversely proportional to the probability of selection; these are accommodated by the survey package. To complicate matters, this terminology isn't entirely standard. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Weighted least squares
Thanks John, That's just the explanation I was looking for. I had hoped that there would be a built in way of dealing with them with R, but obviously not. Given that explanation, it stills seems to me that the way R calculates n is suboptimal, as demonstrated by my second example: summary(lm(y ~ x, data=df, weights=rep(c(0,2), each=50))) summary(lm(y ~ x, data=df, weights=rep(c(0.01,2), each=50))) the weights are only very slightly different but the estimates of residual standard error are quite different (20 vs 14 in my run) Hadley On 5/8/07, John Fox [EMAIL PROTECTED] wrote: Dear Hadley, I think that the problem is that the term weights has different meanings, which, although they are related, are not quite the same. The weights used by lm() are (inverse-)variance weights, reflecting the variances of the errors, with observations that have low-variance errors therefore being accorded greater weight in the resulting WLS regression. What you have are sometimes called case weights, and I'm unaware of a general way of handling them in R, although you could regenerate the unaggregated data. As you discovered, you get the same coefficients with case weights as with variance weights, but different standard errors. Finally, there are sampling weights, which are inversely proportional to the probability of selection; these are accommodated by the survey package. To complicate matters, this terminology isn't entirely standard. I hope this helps, John John Fox, Professor Department of Sociology McMaster University Hamilton, Ontario Canada L8S 4M4 905-525-9140x23604 http://socserv.mcmaster.ca/jfox -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of hadley wickham Sent: Tuesday, May 08, 2007 5:09 AM To: R Help Subject: [R] Weighted least squares Dear all, I'm struggling with weighted least squares, where something that I had assumed to be true appears not to be the case. Take the following data set as an example: df - data.frame(x = runif(100, 0, 100)) df$y - df$x + 1 + rnorm(100, sd=15) I had expected that: summary(lm(y ~ x, data=df, weights=rep(2, 100))) summary(lm(y ~ x, data=rbind(df,df))) would be equivalent, but they are not. I suspect the difference is how the degrees of freedom is calculated - I had expected it to be sum(weights), but seems to be sum(weights 0). This seems unintuitive to me: summary(lm(y ~ x, data=df, weights=rep(c(0,2), each=50))) summary(lm(y ~ x, data=df, weights=rep(c(0.01,2), each=50))) What am I missing? And what is the usual way to do a linear regression when you have aggregated data? Thanks, Hadley __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Weighted least squares
On 5/9/07, Adaikalavan Ramasamy [EMAIL PROTECTED] wrote: http://en.wikipedia.org/wiki/Weighted_least_squares gives a formulaic description of what you have said. Except it doesn't describe what I think is important in my case - how do you calculate the degrees of freedom/n for weighted linear regression? I believe the original poster has converted something like this y x 0 1.1 0 2.2 0 2.2 0 2.2 1 3.3 1 3.3 2 4.4 ... into something like the following y x freq 0 1.11 0 2.23 1 3.32 2 4.41 ... Exactly! Thanks for providing that example. Now, the variance of means of each row in table above is ZERO because the individual elements that comprise each row are identical. Therefore your method of using inverse-variance will not work here. Then is it valid then to use lm( y ~ x, weights=freq ) ? Regards, Adai S Ellison wrote: Hadley, You asked .. what is the usual way to do a linear regression when you have aggregated data? Least squares generally uses inverse variance weighting. For aggregated data fitted as mean values, you just need the variances for the _means_. So if you have individual means x_i and sd's s_i that arise from aggregated data with n_i observations in group i, the natural weighting is by inverse squared standard error of the mean. The appropriate weight for x_i would then be n_i/(s_i^2). In R, that's n/(s^2), as n and s would be vectors with the same length as x. If all the groups had the same variance, or nearly so, s is a scalar; if they have the same number of observations, n is a scalar. Of course, if they have the same variance and same number of observations, they all have the same weight and you needn't weight them at all: see previous posting! Steve E *** This email and any attachments are confidential. Any use, co...{{dropped}} __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Weighted least squares
Dear Adai, -Original Message- From: Adaikalavan Ramasamy [mailto:[EMAIL PROTECTED] Sent: Tuesday, May 08, 2007 8:38 PM To: S Ellison Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; R-help@stat.math.ethz.ch Subject: Re: [R] Weighted least squares http://en.wikipedia.org/wiki/Weighted_least_squares gives a formulaic description of what you have said. I believe the original poster has converted something like this y x 0 1.1 0 2.2 0 2.2 0 2.2 1 3.3 1 3.3 2 4.4 ... into something like the following y x freq 0 1.11 0 2.23 1 3.32 2 4.41 ... Now, the variance of means of each row in table above is ZERO because the individual elements that comprise each row are identical. Therefore your method of using inverse-variance will not work here. Then is it valid then to use lm( y ~ x, weights=freq ) ? No, because the weights argument gives inverse-variance weights not case weights. Regards, John Regards, Adai S Ellison wrote: Hadley, You asked .. what is the usual way to do a linear regression when you have aggregated data? Least squares generally uses inverse variance weighting. For aggregated data fitted as mean values, you just need the variances for the _means_. So if you have individual means x_i and sd's s_i that arise from aggregated data with n_i observations in group i, the natural weighting is by inverse squared standard error of the mean. The appropriate weight for x_i would then be n_i/(s_i^2). In R, that's n/(s^2), as n and s would be vectors with the same length as x. If all the groups had the same variance, or nearly so, s is a scalar; if they have the same number of observations, n is a scalar. Of course, if they have the same variance and same number of observations, they all have the same weight and you needn't weight them at all: see previous posting! Steve E *** This email and any attachments are confidential. Any use, co...{{dropped}} __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Weighted least squares
Dear Hadley, -Original Message- From: hadley wickham [mailto:[EMAIL PROTECTED] Sent: Wednesday, May 09, 2007 2:21 AM To: John Fox Cc: R-help@stat.math.ethz.ch Subject: Re: [R] Weighted least squares Thanks John, That's just the explanation I was looking for. I had hoped that there would be a built in way of dealing with them with R, but obviously not. Given that explanation, it stills seems to me that the way R calculates n is suboptimal, as demonstrated by my second example: summary(lm(y ~ x, data=df, weights=rep(c(0,2), each=50))) summary(lm(y ~ x, data=df, weights=rep(c(0.01,2), each=50))) the weights are only very slightly different but the estimates of residual standard error are quite different (20 vs 14 in my run) Observations with 0 weight are literally excluded, while those with very small weight (relative to others) don't contribute much to the fit. Consequently you get very similar coefficients but different numbers of observations. I hope this helps, John Hadley On 5/8/07, John Fox [EMAIL PROTECTED] wrote: Dear Hadley, I think that the problem is that the term weights has different meanings, which, although they are related, are not quite the same. The weights used by lm() are (inverse-)variance weights, reflecting the variances of the errors, with observations that have low-variance errors therefore being accorded greater weight in the resulting WLS regression. What you have are sometimes called case weights, and I'm unaware of a general way of handling them in R, although you could regenerate the unaggregated data. As you discovered, you get the same coefficients with case weights as with variance weights, but different standard errors. Finally, there are sampling weights, which are inversely proportional to the probability of selection; these are accommodated by the survey package. To complicate matters, this terminology isn't entirely standard. I hope this helps, John John Fox, Professor Department of Sociology McMaster University Hamilton, Ontario Canada L8S 4M4 905-525-9140x23604 http://socserv.mcmaster.ca/jfox -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of hadley wickham Sent: Tuesday, May 08, 2007 5:09 AM To: R Help Subject: [R] Weighted least squares Dear all, I'm struggling with weighted least squares, where something that I had assumed to be true appears not to be the case. Take the following data set as an example: df - data.frame(x = runif(100, 0, 100)) df$y - df$x + 1 + rnorm(100, sd=15) I had expected that: summary(lm(y ~ x, data=df, weights=rep(2, 100))) summary(lm(y ~ x, data=rbind(df,df))) would be equivalent, but they are not. I suspect the difference is how the degrees of freedom is calculated - I had expected it to be sum(weights), but seems to be sum(weights 0). This seems unintuitive to me: summary(lm(y ~ x, data=df, weights=rep(c(0,2), each=50))) summary(lm(y ~ x, data=df, weights=rep(c(0.01,2), each=50))) What am I missing? And what is the usual way to do a linear regression when you have aggregated data? Thanks, Hadley __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Weighted least squares
Adaikalavan Ramasamy [EMAIL PROTECTED] 09/05/2007 01:37:31 ..the variance of means of each row in table above is ZERO because the individual elements that comprise each row are identical. ... Then is it valid then to use lm( y ~ x, weights=freq ) ? ermmm... probably not, because if that heppened I'd strongly suspect we'd substantially violated some assumptions. We are given a number of groups of identical observations. But we are seeking a solution to a problem that posits an underlying variance. If it's not visible within the groups, where is it? Has it disappeared in numerical precision, or is something else going on? If we did this regression, we would see identical residuals for all members of a group. That would imply that the variance arises entirely from between-group effects and not at all from within-group effects. To me, that would in turn imply that the number of observations in the group is irrelevant; we should be using use unweighted regression on the group 'means' in this situation if we're using least squares at all. If we genuinely have independent observations and by some coincidence they have the same value within available precision, we might be justified in saying we can't see the variance within groups, but we can estimate it from the residual variance. That would be equivalent to assuming constant variance, and my n/(s^2) reduces to n except for a scaling factor. Using n alone would then be consistent with one's assumptions, I think. On the kind of data I get, though (mostly chemical measurement with continuous scales), I'd have considerable difficulty justifying that assumption. And if I didn't have that kind of data (or a reasonable approximation thereto) I'd be wondering whether I should be using linear regression at all. S *** This email and any attachments are confidential. Any use, co...{{dropped}} __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Weighted least squares
Dear all, I'm struggling with weighted least squares, where something that I had assumed to be true appears not to be the case. Take the following data set as an example: df - data.frame(x = runif(100, 0, 100)) df$y - df$x + 1 + rnorm(100, sd=15) I had expected that: summary(lm(y ~ x, data=df, weights=rep(2, 100))) summary(lm(y ~ x, data=rbind(df,df))) would be equivalent, but they are not. I suspect the difference is how the degrees of freedom is calculated - I had expected it to be sum(weights), but seems to be sum(weights 0). This seems unintuitive to me: summary(lm(y ~ x, data=df, weights=rep(c(0,2), each=50))) summary(lm(y ~ x, data=df, weights=rep(c(0.01,2), each=50))) What am I missing? And what is the usual way to do a linear regression when you have aggregated data? Thanks, Hadley __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Weighted least squares
See below. hadley wickham wrote: Dear all, I'm struggling with weighted least squares, where something that I had assumed to be true appears not to be the case. Take the following data set as an example: df - data.frame(x = runif(100, 0, 100)) df$y - df$x + 1 + rnorm(100, sd=15) I had expected that: summary(lm(y ~ x, data=df, weights=rep(2, 100))) summary(lm(y ~ x, data=rbind(df,df))) You assign weights to different points according to some external quality or reliability measure not number of times the data point was measured. Look at the estimates and standard error of the two models below: coefficients( summary(f.w - lm(y ~ x, data=df, weights=rep(2, 100))) ) Estimate Std. Error t value Pr(|t|) (Intercept) 1.940765 3.30348066 0.587491 5.582252e-01 x 0.982610 0.05893262 16.673448 2.264258e-30 coefficients( summary( f.u - lm(y ~ x, data=rbind(df,df) ) ) ) Estimate Std. Errort value Pr(|t|) (Intercept) 1.940765 2.32408609 0.8350659 4.046871e-01 x 0.982610 0.04146066 23.6998165 1.012067e-59 You can see that they have same coefficient estimates but the second one has smaller variances. The repeated values artificially deflates the variance and thus inflates the precision. This is why you cannot treat replicate data as independent observations. would be equivalent, but they are not. I suspect the difference is how the degrees of freedom is calculated - I had expected it to be sum(weights), but seems to be sum(weights 0). This seems unintuitive to me: summary(lm(y ~ x, data=df, weights=rep(c(0,2), each=50))) summary(lm(y ~ x, data=df, weights=rep(c(0.01,2), each=50))) What am I missing? And what is the usual way to do a linear regression when you have aggregated data? I would be best to use the individual data points instead of aggregated data as it allows you to estimate the within-group variations as well. If you had individual data points, you could try something as follows. Please check the codes as I am no expert in the area of repeated measures. x - runif(100, 0, 100) y1 - x + rnorm(100, mean=1, sd=15) y2 - y1 + rnorm(100, sd=5) df - data.frame( y=c(y1, y2), x=c(x,x), subject=factor(rep( paste(p, 1:100, sep=), 2 ) )) library(nlme) summary( lme( y ~ x, random = ~ 1 | subject, data=df ) ) Try reading Pinheiro and Bates (http://tinyurl.com/yvvrr7) or related material for more information. Hope this helps. Thanks, Hadley Regards, Adai __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Weighted least squares
On 5/8/07, Adaikalavan Ramasamy [EMAIL PROTECTED] wrote: See below. hadley wickham wrote: Dear all, I'm struggling with weighted least squares, where something that I had assumed to be true appears not to be the case. Take the following data set as an example: df - data.frame(x = runif(100, 0, 100)) df$y - df$x + 1 + rnorm(100, sd=15) I had expected that: summary(lm(y ~ x, data=df, weights=rep(2, 100))) summary(lm(y ~ x, data=rbind(df,df))) You assign weights to different points according to some external quality or reliability measure not number of times the data point was measured. That is one type of weighting - but what if I have already aggregated data? That is a perfectly valid type of weighting too. Look at the estimates and standard error of the two models below: coefficients( summary(f.w - lm(y ~ x, data=df, weights=rep(2, 100))) ) Estimate Std. Error t value Pr(|t|) (Intercept) 1.940765 3.30348066 0.587491 5.582252e-01 x 0.982610 0.05893262 16.673448 2.264258e-30 coefficients( summary( f.u - lm(y ~ x, data=rbind(df,df) ) ) ) Estimate Std. Errort value Pr(|t|) (Intercept) 1.940765 2.32408609 0.8350659 4.046871e-01 x 0.982610 0.04146066 23.6998165 1.012067e-59 You can see that they have same coefficient estimates but the second one has smaller variances. The repeated values artificially deflates the variance and thus inflates the precision. This is why you cannot treat replicate data as independent observations. Hardly artificially - I have repeated observations. would be equivalent, but they are not. I suspect the difference is how the degrees of freedom is calculated - I had expected it to be sum(weights), but seems to be sum(weights 0). This seems unintuitive to me: summary(lm(y ~ x, data=df, weights=rep(c(0,2), each=50))) summary(lm(y ~ x, data=df, weights=rep(c(0.01,2), each=50))) What am I missing? And what is the usual way to do a linear regression when you have aggregated data? I would be best to use the individual data points instead of aggregated data as it allows you to estimate the within-group variations as well. There is no within group variation - these are observations that occur with same values many times in the dataset, so have been aggregated into the a contingency table-like format. If you had individual data points, you could try something as follows. Please check the codes as I am no expert in the area of repeated measures. x - runif(100, 0, 100) y1 - x + rnorm(100, mean=1, sd=15) y2 - y1 + rnorm(100, sd=5) df - data.frame( y=c(y1, y2), x=c(x,x), subject=factor(rep( paste(p, 1:100, sep=), 2 ) )) library(nlme) summary( lme( y ~ x, random = ~ 1 | subject, data=df ) ) Try reading Pinheiro and Bates (http://tinyurl.com/yvvrr7) or related material for more information. Hope this helps. I'm not interested in a mixed model, and I don't have individual data points. Hadley __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Weighted least squares
Sorry, you did not explain that your weights correspond to your frequency in the original post. I assumed they were repeated measurements with within group variation. I was merely responding to your query why the following differed. summary(lm(y ~ x, data=df, weights=rep(2, 100))) summary(lm(y ~ x, data=rbind(df,df))) Let me also clarify my statement about artificial. If one treats repeated observations as independent, then they obtain estimates with inflated precision. I was not calling your data artificial in any way. Using frequency as weights may be valid. Your data points appear to arise from discrete distribution, so I am not entirely sure if you can use the linear model which assumes the errors are normally distributed. Regards, Adai hadley wickham wrote: On 5/8/07, Adaikalavan Ramasamy [EMAIL PROTECTED] wrote: See below. hadley wickham wrote: Dear all, I'm struggling with weighted least squares, where something that I had assumed to be true appears not to be the case. Take the following data set as an example: df - data.frame(x = runif(100, 0, 100)) df$y - df$x + 1 + rnorm(100, sd=15) I had expected that: summary(lm(y ~ x, data=df, weights=rep(2, 100))) summary(lm(y ~ x, data=rbind(df,df))) You assign weights to different points according to some external quality or reliability measure not number of times the data point was measured. That is one type of weighting - but what if I have already aggregated data? That is a perfectly valid type of weighting too. Look at the estimates and standard error of the two models below: coefficients( summary(f.w - lm(y ~ x, data=df, weights=rep(2, 100))) ) Estimate Std. Error t value Pr(|t|) (Intercept) 1.940765 3.30348066 0.587491 5.582252e-01 x 0.982610 0.05893262 16.673448 2.264258e-30 coefficients( summary( f.u - lm(y ~ x, data=rbind(df,df) ) ) ) Estimate Std. Errort value Pr(|t|) (Intercept) 1.940765 2.32408609 0.8350659 4.046871e-01 x 0.982610 0.04146066 23.6998165 1.012067e-59 You can see that they have same coefficient estimates but the second one has smaller variances. The repeated values artificially deflates the variance and thus inflates the precision. This is why you cannot treat replicate data as independent observations. Hardly artificially - I have repeated observations. would be equivalent, but they are not. I suspect the difference is how the degrees of freedom is calculated - I had expected it to be sum(weights), but seems to be sum(weights 0). This seems unintuitive to me: summary(lm(y ~ x, data=df, weights=rep(c(0,2), each=50))) summary(lm(y ~ x, data=df, weights=rep(c(0.01,2), each=50))) What am I missing? And what is the usual way to do a linear regression when you have aggregated data? I would be best to use the individual data points instead of aggregated data as it allows you to estimate the within-group variations as well. There is no within group variation - these are observations that occur with same values many times in the dataset, so have been aggregated into the a contingency table-like format. If you had individual data points, you could try something as follows. Please check the codes as I am no expert in the area of repeated measures. x - runif(100, 0, 100) y1 - x + rnorm(100, mean=1, sd=15) y2 - y1 + rnorm(100, sd=5) df - data.frame( y=c(y1, y2), x=c(x,x), subject=factor(rep( paste(p, 1:100, sep=), 2 ) )) library(nlme) summary( lme( y ~ x, random = ~ 1 | subject, data=df ) ) Try reading Pinheiro and Bates (http://tinyurl.com/yvvrr7) or related material for more information. Hope this helps. I'm not interested in a mixed model, and I don't have individual data points. Hadley __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Weighted least squares
Dear Hadley, I think that the problem is that the term weights has different meanings, which, although they are related, are not quite the same. The weights used by lm() are (inverse-)variance weights, reflecting the variances of the errors, with observations that have low-variance errors therefore being accorded greater weight in the resulting WLS regression. What you have are sometimes called case weights, and I'm unaware of a general way of handling them in R, although you could regenerate the unaggregated data. As you discovered, you get the same coefficients with case weights as with variance weights, but different standard errors. Finally, there are sampling weights, which are inversely proportional to the probability of selection; these are accommodated by the survey package. To complicate matters, this terminology isn't entirely standard. I hope this helps, John John Fox, Professor Department of Sociology McMaster University Hamilton, Ontario Canada L8S 4M4 905-525-9140x23604 http://socserv.mcmaster.ca/jfox -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of hadley wickham Sent: Tuesday, May 08, 2007 5:09 AM To: R Help Subject: [R] Weighted least squares Dear all, I'm struggling with weighted least squares, where something that I had assumed to be true appears not to be the case. Take the following data set as an example: df - data.frame(x = runif(100, 0, 100)) df$y - df$x + 1 + rnorm(100, sd=15) I had expected that: summary(lm(y ~ x, data=df, weights=rep(2, 100))) summary(lm(y ~ x, data=rbind(df,df))) would be equivalent, but they are not. I suspect the difference is how the degrees of freedom is calculated - I had expected it to be sum(weights), but seems to be sum(weights 0). This seems unintuitive to me: summary(lm(y ~ x, data=df, weights=rep(c(0,2), each=50))) summary(lm(y ~ x, data=df, weights=rep(c(0.01,2), each=50))) What am I missing? And what is the usual way to do a linear regression when you have aggregated data? Thanks, Hadley __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Weighted least squares
Doubling the length of the data doubles the apparent number of observations. You would expect the standard error to reduce by sqrt(2) (which it just about does, though I'm not clear on why its not exact here) Weights are not as simple as they look. You have given all your data the same weight, so the answer is independent of the weights (!). Try again with weights=rep(4,100) etc. Equal weights simply cancel out in the lm process. In fact, some linear regression algorithms rescale all weights to sum to 1; in others, weights are scaled to average 1; done 'naturally' the weights simply appear in two places which cancel out in the final covariance matrix calculation (eg in the weighted 'residual sd' and in the hessian for the chi-squared function, if I remember correctly). Bottom line - equal weights make no difference in lm, so choose what you like. 1 is a good number, though. Steve e hadley wickham [EMAIL PROTECTED] 08/05/2007 10:08:34 Dear all, I'm struggling with weighted least squares, where something that I had assumed to be true appears not to be the case. Take the following data set as an example: *** This email and any attachments are confidential. Any use, co...{{dropped}} __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Weighted least squares
Hadley, You asked .. what is the usual way to do a linear regression when you have aggregated data? Least squares generally uses inverse variance weighting. For aggregated data fitted as mean values, you just need the variances for the _means_. So if you have individual means x_i and sd's s_i that arise from aggregated data with n_i observations in group i, the natural weighting is by inverse squared standard error of the mean. The appropriate weight for x_i would then be n_i/(s_i^2). In R, that's n/(s^2), as n and s would be vectors with the same length as x. If all the groups had the same variance, or nearly so, s is a scalar; if they have the same number of observations, n is a scalar. Of course, if they have the same variance and same number of observations, they all have the same weight and you needn't weight them at all: see previous posting! Steve E *** This email and any attachments are confidential. Any use, co...{{dropped}} __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Weighted least squares
http://en.wikipedia.org/wiki/Weighted_least_squares gives a formulaic description of what you have said. I believe the original poster has converted something like this y x 0 1.1 0 2.2 0 2.2 0 2.2 1 3.3 1 3.3 2 4.4 ... into something like the following y x freq 0 1.11 0 2.23 1 3.32 2 4.41 ... Now, the variance of means of each row in table above is ZERO because the individual elements that comprise each row are identical. Therefore your method of using inverse-variance will not work here. Then is it valid then to use lm( y ~ x, weights=freq ) ? Regards, Adai S Ellison wrote: Hadley, You asked .. what is the usual way to do a linear regression when you have aggregated data? Least squares generally uses inverse variance weighting. For aggregated data fitted as mean values, you just need the variances for the _means_. So if you have individual means x_i and sd's s_i that arise from aggregated data with n_i observations in group i, the natural weighting is by inverse squared standard error of the mean. The appropriate weight for x_i would then be n_i/(s_i^2). In R, that's n/(s^2), as n and s would be vectors with the same length as x. If all the groups had the same variance, or nearly so, s is a scalar; if they have the same number of observations, n is a scalar. Of course, if they have the same variance and same number of observations, they all have the same weight and you needn't weight them at all: see previous posting! Steve E *** This email and any attachments are confidential. Any use, co...{{dropped}} __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Weighted Least Squares mit glm
Hallo all, I have a question concerning the weights used in the glm function. I need to build a linear model (family=gaussian) with only one regressor. Sadly I have only 6 different sets: y_i=alpha+beta*x_i , i=1,2,3,4,5. i=1,2,3,4,5 has been observed 60 times, while i=6 has only been observed 30 times. This is why I wanted to use a weighted least squares approach instead of least squares. I used the weight w1-c(1,1,1,1,1,0.5). Can anybody tell me if this is correct? THX. Machen Sie aus 14 Cent spielend bis zu 100 Euro! Die neue Gaming-Area von Arcor - über 50 Onlinespiele im Angebot. http://www.arcor.de/rd/emf-gaming-1 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Weighted least squares
On Tue, 18 Jan 2005, Prof Brian Ripley wrote: On Mon, 17 Jan 2005, Ming Hsu wrote: I would like to run a weighted least squares with the the weighting matrix W. This is generalized not weighted least squares if W really is a matrix and not a vector of case-by-case weights. I ran the following two regressions, (W^-1)Y = Xb + e Y = WXb+ We If W is a diagonal matrix, the weights for the second are W^(-2) and you used W below. In both cases, E[bhat] = b. I used the following commands in R lm1 - lm(Y/W ~ X) lm2 - lm(Y ~ W:X, weights = W) where Y - rnorm(10,1) X - Y + rnorm(10,1) W - 1:10 That W is not a matrix! In lm2, I believe W is applied to the error term, resulting in WLS. However the estimated coefficients in lm1 and lm2 are really different. I tried glm as well, but the result was the same as lm. Any advice would be appreciated, Please do check that you supply an example that agrees with your words. Another difference is that your equations have no intercept, but the lm calls do, so you need to add +0 to their rhs. Use lm.gls in MASS or gls in nlme for generalized least squares, if that is what you meant. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Weighted least squares
Hi, I would like to run a weighted least squares with the the weighting matrix W. I ran the following two regressions, (W^-1)Y = Xb + e Y = WXb+ We In both cases, E[bhat] = b. I used the following commands in R lm1 - lm(Y/W ~ X) lm2 - lm(Y ~ W:X, weights = W) where Y - rnorm(10,1) X - Y + rnorm(10,1) W - 1:10 In lm2, I believe W is applied to the error term, resulting in WLS. However the estimated coefficients in lm1 and lm2 are really different. I tried glm as well, but the result was the same as lm. Any advice would be appreciated, Hsu Ming __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Weighted least squares
On Mon, 17 Jan 2005, Ming Hsu wrote: I would like to run a weighted least squares with the the weighting matrix W. This is generalized not weighted least squares if W really is a matrix and not a vector of case-by-case weights. I ran the following two regressions, (W^-1)Y = Xb + e Y = WXb+ We If W is a diagonal matrix, the weights for the second are W^(-2) and you used W below. In both cases, E[bhat] = b. I used the following commands in R lm1 - lm(Y/W ~ X) lm2 - lm(Y ~ W:X, weights = W) where Y - rnorm(10,1) X - Y + rnorm(10,1) W - 1:10 That W is not a matrix! In lm2, I believe W is applied to the error term, resulting in WLS. However the estimated coefficients in lm1 and lm2 are really different. I tried glm as well, but the result was the same as lm. Any advice would be appreciated, Please do check that you supply an example that agrees with your words. Use lm.gls in MASS or gls in nlme for generalized least squares, if that is what you meant. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] weighted least squares
I apologize, in advance, for cross-posting this to the R listserv. I have submitted this query twice to the S listserv (yesterday and this morning)and neither post has made it, not sure why. When I run the code gls.1 - gls(y ~ x, data = foo.frame, weights = varPower(form = ~ fitted(.)|group), control = list(maxIter = 500, msMaxIter = 500), verbose = TRUE) I get the message Warning messages: a database with name controlvals already in search list: access by where=controlvals may not work in: attach(controlvals) and Splus tells me that Maximum number of iterations reached without convergence. What's going on? I don't remember ever attaching a database named controlvals. Respectfully Much Thanks in Advance, David Paul __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help