pasting reply by Xavier into this thread for the sake of others who might come across it in their R travels. Thanks Xavier
Hi, It seems like your multivariate regression works for groups A and B. For group C, values are calculated using an interaction between x1 and x2. Try to include it in the regression model to find the right coeffs: y ~ grp:x1:x2 + grp:x1 + grp:x2 + grp - 1 It works for me. Notice in summary() that the coefficients for the x1:x2 interaction are not significantly different from 0 in groups A and B, which is what we expected. HTH, Xavier Chris Friedl wrote: > > I have a large dataset grouped by a factor and I want to perform a > regression on each data subset based on this factor. There are many ways > to do this, posted here and elsewhere. I have tried several. However I > found one method posted on the R wiki which works exactly as I want, and I > like the elegance and simplicity of the solution, but I don't understand > how it works. Its all in the formula specification. Problem is also that I > want to extend it from a single variable case to a multivariable case and > haven't really got a clue how to go about it. Hope someone can help. > > # The code below creates a data.frame comprising two marginally noisy > straight lines and performs a > # regression on each based on the grp factor. This is a single variable > case based on R wiki one liner. > # y = x for grp A > # y = 2 + 5x for grp B > set.seed(5) > ind <- as.vector(runif(50)) > d1 <- data.frame(x=ind, y=ind + rnorm(50,0,0.1), grp=rep("A", 50)) > d2 <- data.frame(x=ind, y=2+5*ind+rnorm(50,0,0.1), grp=rep("B", 50)) > data1 <- rbind(d1,d2) > fits <- lm(y ~ grp:x + grp - 1, data=data1) > summary(fits) > with(data1, plot(x,y)) > abline(coef(fits)[c(1,3)], col="red") > abline(coef(fits)[c(2,4)], col="blue") > > Output: > Call: > lm(formula = y ~ grp:x + grp - 1, data = data1) > > Residuals: > Min 1Q Median 3Q Max > -0.206414 -0.057582 -0.001376 0.062737 0.242111 > > Coefficients: > Estimate Std. Error t value Pr(>|t|) > grpA -0.02935 0.02811 -1.044 0.299 > grpB 1.96200 0.02811 69.803 <2e-16 *** > grpA:x 1.06276 0.04921 21.596 <2e-16 *** > grpB:x 5.09351 0.04921 103.502 <2e-16 *** > --- > > We can see that the intercepts of are close to 0 and 2 and slopes close to > 1 and 5 as suggested by the data creation equation. In fact these results > are identical to running lm individually on each subset of data with > formula y ~ x. What I don't understand is how this result comes about from > the formula specification y ~ grp:x + grp -1 . Can someone offer an > explanation? > > > In addition I want to apply the same idea for a multivariate data set. The > code below shows a multivariate dataset which I hope to fit by group. > > # The code below creates a data.frame comprising three marginally noisy > planes. I want to perform a > # regression on each based on the grp factor. > # y = x1 + x2 for grp A > # y = 2 + 2x1 + 4x2 for grp B > # y = 2 + 2x1 + 4x2 + 6x1x2 for grp C > ind <- matrix(runif(100), ncol=2, dimnames=list(c(), c("x1","x2"))) > d1 <- data.frame(ind, y=ind[,"x1"]+ind[,"x2"]+rnorm(100,0,0.1), > grp=rep("A", 100)) > d2 <- data.frame(ind, y=2+2*ind[,"x1"]+4*ind[,"x2"]+rnorm(100,0,0.1), > grp=rep("B", 100)) > d3 <- data.frame(ind, > y=2+2*ind[,"x1"]+4*ind[,"x2"]+6*ind[,"x1"]*ind[,"x2"]+rnorm(100,0,0.1),grp=rep("C", > 100)) > data2 <- rbind(d1,d2,d3) > # visualise > ifelse(require(rgl), with(data2, plot3d(x1,x2,y)), print("RGL not found")) > fits <- lm(y ~ grp:x1 + grp:x2 + grp - 1, data=data2) > summary(fits) > > Output: > > Call: > lm(formula = y ~ grp:x1 + grp:x2 + grp - 1, data = data2) > > Residuals: > Min 1Q Median 3Q Max > -1.338348 -0.092069 -0.002844 0.094813 1.259094 > > Coefficients: > Estimate Std. Error t value Pr(>|t|) > grpA -0.01943 0.08442 -0.230 0.818 > grpB 1.98345 0.08442 23.495 < 2e-16 *** > grpC 0.51002 0.08442 6.042 4.66e-09 *** > grpA:x1 1.03376 0.11769 8.784 < 2e-16 *** > grpB:x1 2.02563 0.11769 17.212 < 2e-16 *** > grpC:x1 4.83101 0.11769 41.050 < 2e-16 *** > grpA:x2 1.02371 0.09664 10.593 < 2e-16 *** > grpB:x2 4.00069 0.09664 41.396 < 2e-16 *** > grpC:x2 7.22331 0.09664 74.742 < 2e-16 *** > --- > Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > > Residual standard error: 0.2838 on 291 degrees of freedom > Multiple R-squared: 0.9973, Adjusted R-squared: 0.9972 > F-statistic: 1.199e+04 on 9 and 291 DF, p-value: < 2.2e-16 > > Whilst there seems to be some correlation between coefficients and terms > that doesn't apply in all cases. I've tried many other combinations but > noe that gave me coefficients similar to the defining equations. Bottom > line is I don't understand what the formula specification should be for > the multivariate version. > > Thanks for any help given. > > Chris > > -- View this message in context: http://www.nabble.com/Regression-by-groups-questions-tp24070481p24091659.html Sent from the R help mailing list archive at Nabble.com. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.