Sorry, you did not explain that your weights correspond to your frequency in the original post. I assumed they were repeated measurements with within group variation.
I was merely responding to your query why the following differed. summary(lm(y ~ x, data=df, weights=rep(2, 100))) summary(lm(y ~ x, data=rbind(df,df))) Let me also clarify my statement about "artificial". If one treats repeated observations as independent, then they obtain estimates with inflated precision. I was not calling your data artificial in any way. Using frequency as weights may be valid. Your data points appear to arise from discrete distribution, so I am not entirely sure if you can use the linear model which assumes the errors are normally distributed. Regards, Adai hadley wickham wrote: > On 5/8/07, Adaikalavan Ramasamy <[EMAIL PROTECTED]> wrote: >> See below. >> >> hadley wickham wrote: >> > Dear all, >> > >> > I'm struggling with weighted least squares, where something that I had >> > assumed to be true appears not to be the case. Take the following >> > data set as an example: >> > >> > df <- data.frame(x = runif(100, 0, 100)) >> > df$y <- df$x + 1 + rnorm(100, sd=15) >> > >> > I had expected that: >> > >> > summary(lm(y ~ x, data=df, weights=rep(2, 100))) >> > summary(lm(y ~ x, data=rbind(df,df))) >> >> You assign weights to different points according to some external >> quality or reliability measure not number of times the data point was >> measured. > > That is one type of weighting - but what if I have already aggregated > data? That is a perfectly valid type of weighting too. > >> Look at the estimates and standard error of the two models below: >> >> coefficients( summary(f.w <- lm(y ~ x, data=df, weights=rep(2, 100))) ) >> Estimate Std. Error t value Pr(>|t|) >> (Intercept) 1.940765 3.30348066 0.587491 5.582252e-01 >> x 0.982610 0.05893262 16.673448 2.264258e-30 >> >> coefficients( summary( f.u <- lm(y ~ x, data=rbind(df,df) ) ) ) >> Estimate Std. Error t value Pr(>|t|) >> (Intercept) 1.940765 2.32408609 0.8350659 4.046871e-01 >> x 0.982610 0.04146066 23.6998165 1.012067e-59 >> >> You can see that they have same coefficient estimates but the second one >> has smaller variances. >> >> The repeated values artificially deflates the variance and thus inflates >> the precision. This is why you cannot treat replicate data as >> independent observations. > > Hardly artificially - I have repeated observations. > >> > would be equivalent, but they are not. I suspect the difference is >> > how the degrees of freedom is calculated - I had expected it to be >> > sum(weights), but seems to be sum(weights > 0). This seems >> > unintuitive to me: >> > >> > summary(lm(y ~ x, data=df, weights=rep(c(0,2), each=50))) >> > summary(lm(y ~ x, data=df, weights=rep(c(0.01,2), each=50))) >> > >> > What am I missing? And what is the usual way to do a linear >> > regression when you have aggregated data? >> >> I would be best to use the individual data points instead of aggregated >> data as it allows you to estimate the within-group variations as well. > > There is no within group variation - these are observations that occur > with same values many times in the dataset, so have been aggregated > into the a contingency table-like format. > >> If you had individual data points, you could try something as follows. >> Please check the codes as I am no expert in the area of repeated >> measures. >> >> x <- runif(100, 0, 100) >> y1 <- x + rnorm(100, mean=1, sd=15) >> y2 <- y1 + rnorm(100, sd=5) >> >> df <- data.frame( y=c(y1, y2), >> x=c(x,x), >> subject=factor(rep( paste("p", 1:100, sep=""), 2 ) )) >> >> library(nlme) >> summary( lme( y ~ x, random = ~ 1 | subject, data=df ) ) >> >> Try reading Pinheiro and Bates (http://tinyurl.com/yvvrr7) or related >> material for more information. Hope this helps. > > I'm not interested in a mixed model, and I don't have individual data > points. > > Hadley > > > ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.