Re: [Rd] Proposal: model.data
On Thu, May 3, 2012 at 12:19 PM, Brian G. Peterson br...@braverock.com wrote: On Thu, 2012-05-03 at 12:09 -0500, Paul Johnson wrote: Greetings: On Thu, May 3, 2012 at 11:36 AM, Brian G. Peterson br...@braverock.com wrote: On Thu, 2012-05-03 at 10:51 -0500, Paul Johnson wrote: If somebody in R Core would like this and think about putting it, or something like it, into the base, then many chores involving predicted values would become much easier. Why does this need to be in base? Implement it in a package. So, nobody agrees with me that R base should have model.data? Too bad! You could save us a lot of effort. I've found different efforts to get the same work done in termplot and all of the implementations of predict. And just about every regression related package has its own approach. If there were one good way that would always work, just think how convenient it would be for all those function package writers. Oh, well. I'm not saying that my model.data is perfect, just that I wish there were a perfect one :) Yesterday, I realized that predict.nls probably has to deal with this problem so I studied that and have yet another version of model.data to propose to you. I'm using this in my regression-support package rockchalk, so you don't need to give me Brian Peterson's advice to put it in a package. The idea here is to take variables from a fitted model's data if it can find them, and then grab variables from the parent environment IF they have the correct length. This means we ignore variables like d in poly(x, d) because the variable d is not of the same length as the variables in the model. ##' Creates a raw (UNTRANSFORMED) data frame equivalent ##' to the input data that would be required to fit the given model. ##' ##' Unlike model.frame and model.matrix, this does not return transformed ##' variables. ##' ##' @param model A fitted regression model in which the data argument ##' is specified. This function will fail if the model was not fit ##' with the data option. ##' @return A data frame ##' @export ##' @author Paul E. Johnson pauljohn@@ku.edu ##' @example inst/examples/model.data-ex.R model.data - function(model){ #from nls, returns -1 for missing variables lenVar - function(var, data) tryCatch(length(eval(as.name(var), data, env)), error = function(e) -1) fmla - formula(model) varNames - all.vars(fmla) ## all variable names ## varNames includes d in poly(x,d), possibly other constants ## varNamesRHS - all.vars(formula(delete.response(terms(model ## previous same as nls way? fmla2 - fmla fmla2[[2L]] - 0 varNamesRHS - all.vars(fmla2) varNamesLHS - setdiff(varNames, varNamesRHS) env - environment(fmla) if (is.null(env)) env - parent.frame() dataOrig - eval(model$call$data, environment(formula(model))) rndataOrig - row.names(dataOrig) n - sapply(varNames, lenVar, data=dataOrig) targetLength - length(eval(as.name(varNamesLHS[1]), dataOrig, env)) varNames - varNames[ n == targetLength ] ldata - lapply(varNames, function(x) eval(as.name(x), dataOrig, env)) names(ldata) - varNames data - data.frame(ldata[varNames]) if (!is.null(rndataOrig)) row.names(data) - rndataOrig ## remove rows listed in model's na.action ## TODO: question: what else besides OMIT might be called for? if ( !is.null(model$na.action)){ data - data[ -as.vector(model$na.action), ] } ## keep varNamesRHS that exist in datOrig attr(data, varNamesRHS) - setdiff(colnames(data), varNamesLHS) invisible(data) } And some example output: ## check if model.data works when there is no data argument set.seed(12345) x1 - rpois(100, l=6) x2 - rnorm(100, m=50, s=10) x3 - rnorm(100) y - rnorm(100) m0 - lm(y ~ x1 + x2 + x3) m0.data - model.data(m0) x1[4:5] - NA m0 - lm(y ~ x1 + x2 + x3) m0.data - model.data(m0) head(m0.data) y x1 x2 x3 1 -0.8086741 7 44.59614 -1.6193283 2 1.0011198 9 69.47693 0.5483979 3 0.4560525 8 50.53590 0.1952822 6 0.6417692 4 52.77954 -0.2509466 7 -0.4150210 5 56.91171 1.6993467 8 -0.4595757 6 58.23795 -0.3442988 x1 - rpois(100, l=6) x2 - rnorm(100, m=50, s=10) x3 - rnorm(100) xcat1 - gl(2,50, labels=c(M,F)) xcat2 - cut(rnorm(100), breaks=c(-Inf, 0, 0.4, 0.9, 1, Inf), labels=c(R, M, D, P, G)) dat - data.frame(x1, x2, x3, xcat1, xcat2) rm(x1, x2, x3, xcat1, xcat2) xcat1n - with(dat, contrasts(xcat1)[xcat1, ,drop=FALSE]) xcat2n - with(dat, contrasts(xcat2)[xcat2, ]) STDE - 20 dat$y - 0.03 + 0.8*dat$x1 + 0.1*dat$x2 + 0.7*dat$x3 + xcat1n %*% c(2) + xcat2n %*% c(0.1,-2,0.3, 0.1) + STDE*rnorm(100) rownames(dat$y) - NULL m1 - lm(y ~ poly(x1, 2), data=dat) m1.data - model.data(m1) head(m1.data) y x1 1 56.336279 7 2 47.823205 5 3 9.296108 6 4 16.213508 5 5 -16.922331 3 6 10.639724 7 attr(m1.data, varNamesRHS) [1] x1 d - 2 m2 - lm(y ~ poly(x1, d),
Re: [Rd] Proposal: model.data
PJ == Paul Johnson pauljoh...@gmail.com on Thu, 3 May 2012 12:09:08 -0500 writes: PJ Greetings: On Thu, May 3, 2012 at 11:36 AM, Brian PJ G. Peterson br...@braverock.com wrote: On Thu, 2012-05-03 at 10:51 -0500, Paul Johnson wrote: If somebody in R Core would like this and think about putting it, or something like it, into the base, then many chores involving predicted values would become much easier. Why does this need to be in base? Implement it in a package. If it works, and is additive, people will use it. Look at 'reshape' or 'xts' or 'Matrix' just to name a few examples of widely used packages. PJ I can't use it to fix termplot unless it is in base. Indeed. I think it would be useful to branch this topic into one part where it is about fixing termplot() and predict(*, type = terms) because -- as it happens -- I've got this on my todo list to look at, for a couple of weeks now. I also think there are infelicities in there that could/should be improved. PJ Or are you suggesting I create my own termplot PJ replacement? not necessarily.. ;-) Martin __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Proposal: model.data
Greetings: I'm still working on functions to make it easier for students to interpret regression predictions. I am working out a scheme to more easily create newdata objects (for use in predict functions). This has been done before in packages like Zelig, Effects, rms, and others. Nothing is exactly right, though. Stata users keep flashing their predicted probabity tables at me and I'd like something like that to use with R. I'm proposing here a function model.data that receives a regression and creates a dataframe of raw data from which newdata objects can be constructed. This follows a suggestion that Bill Dunlap made to me in response to a question I posted in r-help. While studying termplot code, I saw the carrier function approach to deducing the raw predictors. However, it does not always work. Here is one problem. termplot mistakes 10 in log(10 + x1) for a variable Example: dat - data.frame(x1 = rnorm(100), x2 = rpois(100, lambda=7)) STDE - 10 dat$y - 1.2 * log(10 + dat$x1) + 2.3 * dat$x2 + rnorm(100, sd = STDE) m1 - lm( y ~ log(10 + x1) + x2, data=dat) termplot(m1) ## See the trouble? termplot thinks 10 is the term to plot. Another problem is that predict( type=terms) does not behave sensibly sometimes. RHS of formula that have nonlinear transformations are misunderstood as separate terms. ##Example: dat$y2 - 1.2 * log(10 + dat$x1) + 2.3 * dat$x1^2 + rnorm(100, sd = STDE) m2 - lm( y2 ~ log(10 + x1) + sin(x1), data=dat) summary(m2) predict(m2, type=terms) ## Output: ## log(10 + x1) sin(x1) ## 1 1.50051781 -2.04871711 ## 2-0.14707391 0.31131124 What I wish would happen instead is one correct prediction for each value of x1. This should be the output: predict(m2, newdata = data.frame(x1 = dat$x1)) ## predict(m2, newdata = data.frame(x1 = dat$x1)) ## 12345678 ## 17.78563 18.49806 17.50719 19.70093 17.45071 19.69718 18.84137 18.89971 The fix I'm testing now is the following new function, model.data. which tries to re-create the data object that would be consistent with a fitted model. This follows a suggestion from Bill Dunlap in r-help on 2012-04-22 ##' Creates a raw (UNTRANSFORMED) data frame equivalent ##' to the input data that would be required to fit the given model. ##' ##' Unlike model.frame and model.matrix, this does not return transformed ##' variables. ##' ##' @param model A fitted regression model in which the data argument ##' is specified. This function will fail if the model was not fit ##' with the data option. ##' @return A data frame ##' @export ##' @author Paul E. Johnson pauljohn@@ku.edu ##' @example inst/examples/model.data-ex.R model.data - function(model){ fmla - formula(model) allnames - all.vars(fmla) ## all variable names ## indep variables, includes d in poly(x,d) ivnames - all.vars(formula(delete.response(terms(model ## datOrig: original data frame datOrig - eval(model$call$data, environment(formula(model))) if (is.null(datOrig))stop(model.data: input model has no data frame) ## datOrig: almost right, but includes d in poly(x, d) dat - get_all_vars(fmla, datOrig) ## Get rid of d and other non variable variable names that are not in datOrig: keepnames - intersect(names(dat), names(datOrig)) ## Keep only rows actually used in model fit, and the correct columns dat - dat[ row.names(model$model) , keepnames] ## keep ivnames that exist in datOrig attr(dat, ivnames) - intersect(ivnames, names(datOrig)) invisible(dat) } This works for the test cases like log(10+x) and so forth: ## Examples: head(m1.data - model.data(m1)) head(m2.data - model.data(m2)) ## head(m1.data - model.data(m1)) ## y x1 x2 ## 1 18.53846 0.46176539 8 ## 2 28.24759 0.09720934 7 ## 3 23.88184 0.67602556 9 ## 4 23.50130 -0.74877054 8 ## 5 25.81714 1.02555255 5 ## 6 24.75052 -0.69659539 6 ## head(m2.data - model.data(m2)) ## y x1 ## 1 18.53846 0.46176539 ## 2 28.24759 0.09720934 ## 3 23.88184 0.67602556 ## 4 23.50130 -0.74877054 ## 5 25.81714 1.02555255 ## 6 24.75052 -0.69659539 d - 2 m4 - lm(y ~ poly(x1,d), data=dat) head(m4.data - model.data(m4)) ## y x1 ## 1 18.53846 0.46176539 ## 2 28.24759 0.09720934 ## 3 23.88184 0.67602556 Another strength of this approach is that the return object has an attribute ivnames. If R's termplot used model.dat instead of the carrier functions, this would make for a much tighter set of code. What flaws do you see in this? One flaw is that I did not know how to re-construct data from the parent environment, so I insist the regression model has to have a data argument. Is this necessary, or can one of the R experts help. Another possible flaw: I'm keeping the columns from the data frame that are needed to re-construct the model.frame, and to match the rows, I'm using row.names for the model.frame. Are there other formulae that
Re: [Rd] Proposal: model.data
On Thu, 2012-05-03 at 10:51 -0500, Paul Johnson wrote: If somebody in R Core would like this and think about putting it, or something like it, into the base, then many chores involving predicted values would become much easier. Why does this need to be in base? Implement it in a package. If it works, and is additive, people will use it. Look at 'reshape' or 'xts' or 'Matrix' just to name a few examples of widely used packages. Regards, - Brian __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Proposal: model.data
Greetings: On Thu, May 3, 2012 at 11:36 AM, Brian G. Peterson br...@braverock.com wrote: On Thu, 2012-05-03 at 10:51 -0500, Paul Johnson wrote: If somebody in R Core would like this and think about putting it, or something like it, into the base, then many chores involving predicted values would become much easier. Why does this need to be in base? Implement it in a package. If it works, and is additive, people will use it. Look at 'reshape' or 'xts' or 'Matrix' just to name a few examples of widely used packages. I can't use it to fix termplot unless it is in base. Or are you suggesting I create my own termplot replacement? Regards, - Brian -- Paul E. Johnson Professor, Political Science Assoc. Director 1541 Lilac Lane, Room 504 Center for Research Methods University of Kansas University of Kansas http://pj.freefaculty.org http://quant.ku.edu __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Proposal: model.data
On Thu, 2012-05-03 at 12:09 -0500, Paul Johnson wrote: Greetings: On Thu, May 3, 2012 at 11:36 AM, Brian G. Peterson br...@braverock.com wrote: On Thu, 2012-05-03 at 10:51 -0500, Paul Johnson wrote: If somebody in R Core would like this and think about putting it, or something like it, into the base, then many chores involving predicted values would become much easier. Why does this need to be in base? Implement it in a package. If it works, and is additive, people will use it. Look at 'reshape' or 'xts' or 'Matrix' just to name a few examples of widely used packages. I can't use it to fix termplot unless it is in base. Or are you suggesting I create my own termplot replacement? I was suggesting that you create a package that has all the features that you think it needs. If you have a *patch* for termplot that would fix what you perceive to be its problems, and not break existing code, then the usual method would be to propose that. It seems, though, that you are proposing more significant changes to functionality, and it seems as though that would run a risk of breaking backwards compatibility, which is usually a bad idea. Regards, - Brian -- Brian G. Peterson http://braverock.com/brian/ Ph: 773-459-4973 IM: bgpbraverock __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel