Re: [R] predict.lm point forecasts with factors
On Wed, 2007-02-14 at 13:54 -0700, sj wrote: hello, I am trying to use predict.lm to make point forecasts based on a model with continuous and categorical independent variables I have no problems fitting the model using lm, but when I try to use predict to make point predictions. it reverts back to the original dataframe and gives me the point predictions for the fitted data rather than for the new data, I imagine that I am missing something simple but for whatever reason I can't figure out why it does not like the new data and is reverting to the fitted data. The following code illustrates the problem I am running in to. Any help would be appreciated. f1 - rep(c(a,b,c,d),25) f2 - sample(rep(c(e,f,g,h),250),100) x - rnorm(100,100) y - rnorm(100,150) mdl - lm(y~x+f1+f2) f12 -rep(c(a,b,c,d),5) f22 - sample(rep(c(e,f,g,h),250),20) x2 - rnorm(20,100) new - data.frame(cbind(f12[1],f22[1],x2[1])) predict(mdl,new) best, Spencer Spencer, You have two distinct issues going on here: The initial model that you create 'mdl' is based upon 'f1' and 'f2' being created as character vectors, not as factors. While the modeling functions will internally do the coercion, I do not believe that the predict functions will. In fact, you should have noted the following error messages: mdl - lm(y~x+f1+f2) Warning messages: 1: variable 'f1' converted to a factor in: model.matrix.default(mt, mf, contrasts) 2: variable 'f2' converted to a factor in: model.matrix.default(mt, mf, contrasts) So you end up with a 'class' conflict between the model frame object and the new data object, since the latter will default to coercing 'f12' and 'f22' to factors. Secondly, 'new' needs to have columns created with the SAME names as those used in the original model. Thus, a code sequence along the lines of the following should work: f1 - rep(c(a,b,c,d), 25) f2 - sample(rep(c(e,f,g,h), 250), 100) x - rnorm(100, 100) y - rnorm(100, 150) # Create a data frame from the data so # so that f1 and f2 become factors DF - data.frame(y, x, f1, f2) mdl - lm(y ~ x + f1 + f2, DF) f12 -rep(c(a,b,c,d), 5) f22 - sample(rep(c(e,f,g,h), 250), 20) x2 - rnorm(20, 100) # Create 'new' in the same way, but naming the # columns the same as 'DF above new - data.frame(f1 = f12, f2 = f22, x = x2) # Now run predict on the first row in 'new predict(mdl, new[1, ]) [1] 150.3273 The number you come up with should be different, since you are using random data. HTH, Marc Schwartz __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] predict.lm variables found question
Larry White [EMAIL PROTECTED] writes: hello, I'm trying to predict some values based on a linear regression model. I've created the model using one dataframe, and have the prediction values in a second data frame (call it newdata). There are 56 rows in the dataframe used to create the model and 15 in newdata. I ran predict(model1, newdata) and get the warning: 'newdata' had 15 rows but variable(s) found have 56 rows When i checked help(predict.lm) I found this: Variables are first looked for in newdata and then searched for in the usual way (which will include the environment of the formula used in the fit). A warning will be given if the variables found are not of the same length as those in newdata if it was supplied. My questions are - how can I just get predicted values for the 15 rows in the newdata data frame, and if that's not possible, how can I tell which of the 56 predicted values are derived from newdata only, if any. You need to have all your predictors represented in newdata. You seem to have at least one of them missing (a typo in a variable name could do that). -- O__ Peter Dalgaard Øster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~ - ([EMAIL PROTECTED]) FAX: (+45) 35327907 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] predict.lm
I think you got it right. The mean of the (weighted) sum of a set of random variables is the (weighted) sum of the means and its variance is the (weighted) sum of the individual variances (using squared weights). Here you don't have to worry about weights. So what you proposed does exactly this. -Christos -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Bill Szkotnicki Sent: Tuesday, May 02, 2006 2:59 PM To: 'R-Help help' Subject: [R] predict.lm I have a model with a few correlated explanatory variables. i.e. m1=lm(y~x1+x2+x3+x4,protdata) and I have used predict as follows: x=data.frame(x=1:36) yp=predict(m1,x,se.fit=T) tprot=sum(yp$fit) # add up the predictions tprot tprot is the sum of the 36 predicted values and I would like the se of that prediction. I think sqrt(sum(yp$se.fit^2)) is not correct. Would anyone know the correct approach? i.e. How to get the se of a function of predicted values (in this case sum) Thanks, Bill __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] predict.lm
On Tue, 2 May 2006, Christos Hatzis wrote: I think you got it right. The mean of the (weighted) sum of a set of random variables is the (weighted) sum of the means and its variance is the (weighted) sum of the individual variances (using squared weights). Here you don't have to worry about weights. So what you proposed does exactly this. Yes, but the theory has assumptions which are not met here: the random variables are correlated (in almost all case). -Christos -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Bill Szkotnicki Sent: Tuesday, May 02, 2006 2:59 PM To: 'R-Help help' Subject: [R] predict.lm I have a model with a few correlated explanatory variables. i.e. m1=lm(y~x1+x2+x3+x4,protdata) and I have used predict as follows: x=data.frame(x=1:36) yp=predict(m1,x,se.fit=T) How can this work? You fitted the model to x1...x4 and supplied x. tprot=sum(yp$fit) # add up the predictions tprot tprot is the sum of the 36 predicted values and I would like the se of that prediction. I think sqrt(sum(yp$se.fit^2)) is not correct. Would anyone know the correct approach? i.e. How to get the se of a function of predicted values (in this case sum) You need to go back to the theory: it is easy to do for a linear function, otherwise you will need to linearize. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] predict.lm
I did mean to use x1,x2,x3,x4 in the new data frame. And I think the theory would be something like yhat = 1' K' bhat and so the variance should be 1' K'CK 1 where C=(X'X)-1 and 1 is a 1 vector. The question is do I need to form these matrices and grind through it or is there an easier way? Bill -Original Message- From: Prof Brian Ripley [mailto:[EMAIL PROTECTED] Sent: Tuesday, May 02, 2006 2:54 PM To: Christos Hatzis Cc: 'Bill Szkotnicki'; 'R-Help help' Subject: Re: [R] predict.lm On Tue, 2 May 2006, Christos Hatzis wrote: I think you got it right. The mean of the (weighted) sum of a set of random variables is the (weighted) sum of the means and its variance is the (weighted) sum of the individual variances (using squared weights). Here you don't have to worry about weights. So what you proposed does exactly this. Yes, but the theory has assumptions which are not met here: the random variables are correlated (in almost all case). -Christos -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Bill Szkotnicki Sent: Tuesday, May 02, 2006 2:59 PM To: 'R-Help help' Subject: [R] predict.lm I have a model with a few correlated explanatory variables. i.e. m1=lm(y~x1+x2+x3+x4,protdata) and I have used predict as follows: x=data.frame(x=1:36) yp=predict(m1,x,se.fit=T) How can this work? You fitted the model to x1...x4 and supplied x. tprot=sum(yp$fit) # add up the predictions tprot tprot is the sum of the 36 predicted values and I would like the se of that prediction. I think sqrt(sum(yp$se.fit^2)) is not correct. Would anyone know the correct approach? i.e. How to get the se of a function of predicted values (in this case sum) You need to go back to the theory: it is easy to do for a linear function, otherwise you will need to linearize. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] predict.lm - standard error of predicted means?
[EMAIL PROTECTED] writes: Simple question. For a simple linear regression, I obtained the standard error of predicted means, for both a confidence and prediction interval: x-1:15 y-x + rnorm(n=15) model-lm(y~x) predict.lm(model,newdata=data.frame(x=c(10,20)),se.fit=T,interval=confidence)$se.fit 1 2 0.2708064 0.7254615 predict.lm(model,newdata=data.frame(x=c(10,20)),se.fit=T,interval=prediction)$se.fit 1 2 0.2708064 0.7254615 I was surprised to find that the standard errors returned were in fact the standard errors of the sampling distribution of Y_hat: sqrt(MSE(1/n + (x-x_bar)^2/SS_x)), not the standard errors of Y_new (predicted value): sqrt(MSE(1 + 1/n + (x-x_bar)^2/SS_x)). Is there a reason this quantity is called the standard error of predicted means if it doesn't relate to the prediction distribution? Yes. Yhat is the predicted mean and se.fit is its standard deviation. It doesn't change its meaning because you desire another kind of prediction interval. Turning to Neter et al.'s Applied Linear Statistical Models, I note that if we have multiple observations, then the standard error of the mean of the predicted value: sqrt(MSE(1/m + 1/n + (x-x_bar)^2/SS_x)), reverts to the standard error of the sampling distribution of Y-hat, as m, the number of samples, gets large. Still, this doesn't explain the result for small sample sizes. You can make completely similar considerations regarding the standard errors of and about an estimated mean: sigma*sqrt(1+1/n) vs. sigma*sqrt(1/m + 1/n) vs. sigma*sqrt(1/n). SEM is still the latter quantity even if you are interested in another kind of prediction limit. -- O__ Peter Dalgaard Øster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~ - ([EMAIL PROTECTED]) FAX: (+45) 35327907 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] predict.lm - standard error of predicted means?
Can someone please refer me to a function or method that resolves this structuring issue: I have two matrices with identical colnames (89), but varying number of observations: matrix Amatrix B 217 x 89 16063 x 89 I want to creat one matrix C that has both matrices adjacent to one another, where matrix A is duplicated many times to create the same row number for matrix B, i.e. 16063. matrixA matrix B matrixA matrixA so matrix C will be 16063 x 178 I've tried cbind() and merge() with no success.. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] predict.lm - standard error of predicted means?
mark salsburg [EMAIL PROTECTED] writes: Can someone please refer me to a function or method that resolves this structuring issue: I have two matrices with identical colnames (89), but varying number of observations: matrix Amatrix B 217 x 89 16063 x 89 I want to creat one matrix C that has both matrices adjacent to one another, where matrix A is duplicated many times to create the same row number for matrix B, i.e. 16063. matrixA matrix B matrixA matrixA so matrix C will be 16063 x 178 I've tried cbind() and merge() with no success.. A: What the !!##¤ does this have to do with the subject line? B: This should do it: cbind(A[rep(1:217,length=16063),], B) -- O__ Peter Dalgaard Øster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~ - ([EMAIL PROTECTED]) FAX: (+45) 35327907 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] predict.lm - standard error of predicted means?
On Wed, 20 Jul 2005, Peter Dalgaard wrote: mark salsburg [EMAIL PROTECTED] writes: Can someone please refer me to a function or method that resolves this structuring issue: I have two matrices with identical colnames (89), but varying number of observations: matrix Amatrix B 217 x 89 16063 x 89 I want to creat one matrix C that has both matrices adjacent to one another, where matrix A is duplicated many times to create the same row number for matrix B, i.e. 16063. matrixA matrix B matrixA matrixA so matrix C will be 16063 x 178 I've tried cbind() and merge() with no success.. A: What the !!##¤ does this have to do with the subject line? B: This should do it: cbind(A[rep(1:217,length=16063),], B) But note that makes 74 + 5/217 copies of A, and I did wonder if that was the intention (or if not, what was intended). -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595__ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] predict.lm with (logical) NA vector
On Mon, 10 Nov 2003, Edzer J. Pebesma wrote: I was surprised by the following (R 1.8.0): R lm.fit = lm(y~x, data.frame(x=1:10, y=1:10)) R predict(lm.fit, data.frame(x = rep(NA, 10))) 1 2 3 4 5 -1.060998e-314 -1.060998e-314 -1.060998e-314 -1.060998e-314 -1.060998e-314 6 7 8 9 10 0.00e+00 1.406440e-269 6.715118e-265 4.940656e-323 1.782528e-265 R predict(lm.fit, data.frame(x = as.numeric(rep(NA, 10 1 2 3 4 5 6 7 8 9 10 NA NA NA NA NA NA NA NA NA NA shouldn't the first predict() call return NA's, or else issue an error message? The prediction methods do not in general check that new variables you give are of the correct type: the type used in the fit is not recorded in the model object. In this case a logical column will `work' provided it has two values (even with NAs). We can probably trap this exact case, but there will remain a lot of scope for user error. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help