[R] Generic functions with models? 'predict', 'residuals', 'fitted', ...?
Hi WizaRds! I don't know if my understanding of models is screwed up or if there's a fundamental limitation to the way they are used in R. Here's what I'd like to do: create a model and then test data against that model using generics like 'predict' and 'residuals'. The problem is that generics don't work for new data sets, except of course 'predict': set.seed(1) df- data.frame(matrix(rnorm(10*4), nc=4)) cn- c(y, x1, x2, x3) colnames(df)- cn model- lm(y ~ x1 + x2 + x3 + 1, data=df) u- residuals(model, data=df) test- data.frame(matrix(rnorm(10*4), nc=4)) colnames(test)- cn y.hat- predict(model, newdata=test) u.new- residuals(model, newdata=test) # I would expect to get the residuals to the 'test' # data fit to 'model' with 'residuals(model, newdata=test)', # alas no: cbind(u, u.new) uu.new 1 0.032996295 0.032996295 2 0.575261650 0.575261650 3 -0.702501425 -0.702501425 4 0.482075538 0.482075538 5 0.300600839 0.300600839 6 -0.980832577 -0.980832577 7 0.265239680 0.265239680 8 -0.341625071 -0.341625071 9 0.363224527 0.363224527 10 0.005560545 0.005560545 # I would expect to get the 'test' fit to 'model' # with 'fitted(model, newdata=test)' but again no: cbind(fitted(model), fitted(model, newdata=test)) [,1][,2] 1 -0.65945011 -0.65945011 2 -0.39161833 -0.39161833 3 -0.13312719 -0.13312719 4 1.11320526 1.11320526 5 0.02890693 0.02890693 6 0.16036419 0.16036419 7 0.22218937 0.22218937 8 1.07994978 1.07994978 9 0.21255683 0.21255683 10 -0.31094893 -0.31094893 I have to do: y.hat- predict(model, newdata=test) # get test residuals: u.new- test$y - y.hat u.new 1 2 3 4 5 6 7 1.3661751 -0.4077999 1.1740108 0.4492772 -1.5865168 -0.7633108 -0.8800892 8 9 10 1.7474658 -0.1016104 2.1081361 to get the residuals. Maybe this is what one is supposed to do? To be fair, the help pages for e.g. 'residuals.lm' don't say that you can do things like 'newdata=test' in: 'residuals(model, newdata=test)' or 'data=test' in: 'residuals(model, data=test)' but that is exactly what I would like to do! Am I missing something here, or is this just the way it is? Thanks guys! Jack. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] plotting NAs
Hi WizaRds, (I've cross-posted to r-sig-finance because a lot of people there use 'zoo' objects). I'm trying to plot 2 time series lined up with par(mfrow=c(2,1)), so that the time index is the same for the two series. The data have monthly frequency. An example will illustrate: # data for these dates (monthly frequency) do not have NAs: require(zoo) monthly.dates- as.Date(c( 1991-01-31, 1991-02-28, 1991-03-31, 1991-04-30, 1991-05-30 )) monthly.only.data- 1:length(monthly.dates) # data for these extra dates (daily frequency) do have NAs interspersed with # non-NA data having monthly frequency: daily.dates- seq(as.Date(1991-06-01), as.Date(1991-07-31), by=1) tmp- zoo(NA, order.by=c(monthly.dates, daily.dates)) z- cbind(tmp, tmp) colnames(z)- NULL monthly.in.daily.data- (1:2)/10 z[index(z) %in% monthly.dates, 1]- monthly.only.data z[index(z) %in% as.Date(c(1991-06-30,1991-07-31)), 2]- monthly.in.daily.data # this creates a block of data like that listed below. Here's my problem: I want to plot lines through all non-NA data. This is straight forward with column 1, but not with column 2. Why? Because the monthly-frequency data in column 2 have NAs interspersed. plot(z, type=l) does not plot a line through the points in column 2. (It's not a problem with 'plot.zoo'; same thing happens if you use 'plot' on regular data like: plot(c(1, NA, 2, 3, NA, 4), type=l) The following works in so far as it will plot the correct picture, tmp- zoo(NA, seq(min(index(z)), max(index(z)), by=1)) z- cbind(tmp, z); z- z[,-1] plot(na.omit(z), type=l) plot(z[!apply(is.na(z), 1, sum)==2,], type=l) but for various reasons too long to go into here I want to know how to plot lines through the non-NA data in objects like: c(1, NA, 2, 3, NA, 4) without having to strip out the NAs with the !apply(is.na(z), 1, sum)==2 hack. What do I do??? Thanks guys! Jack. # p.s. here's that 'zoo' data: z 1991-01-31 1 NA 1991-02-28 2 NA 1991-03-31 3 NA 1991-04-30 4 NA 1991-05-30 5 NA 1991-06-01 NA NA 1991-06-02 NA NA 1991-06-03 NA NA 1991-06-04 NA NA 1991-06-05 NA NA 1991-06-06 NA NA 1991-06-07 NA NA 1991-06-08 NA NA 1991-06-09 NA NA 1991-06-10 NA NA 1991-06-11 NA NA 1991-06-12 NA NA 1991-06-13 NA NA 1991-06-14 NA NA 1991-06-15 NA NA 1991-06-16 NA NA 1991-06-17 NA NA 1991-06-18 NA NA 1991-06-19 NA NA 1991-06-20 NA NA 1991-06-21 NA NA 1991-06-22 NA NA 1991-06-23 NA NA 1991-06-24 NA NA 1991-06-25 NA NA 1991-06-26 NA NA 1991-06-27 NA NA 1991-06-28 NA NA 1991-06-29 NA NA 1991-06-30 NA 0.1 1991-07-01 NA NA 1991-07-02 NA NA 1991-07-03 NA NA 1991-07-04 NA NA 1991-07-05 NA NA 1991-07-06 NA NA 1991-07-07 NA NA 1991-07-08 NA NA 1991-07-09 NA NA 1991-07-10 NA NA 1991-07-11 NA NA 1991-07-12 NA NA 1991-07-13 NA NA 1991-07-14 NA NA 1991-07-15 NA NA 1991-07-16 NA NA 1991-07-17 NA NA 1991-07-18 NA NA 1991-07-19 NA NA 1991-07-20 NA NA 1991-07-21 NA NA 1991-07-22 NA NA 1991-07-23 NA NA 1991-07-24 NA NA 1991-07-25 NA NA 1991-07-26 NA NA 1991-07-27 NA NA 1991-07-28 NA NA 1991-07-29 NA NA 1991-07-30 NA NA 1991-07-31 NA 0.2 - [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] RODBC and Oracle
Hi WizaRds, I'm experiencing a problem connecting to an Oracle 10g database via RODBC (I'm getting this on Microsoft XP). The same SQL queries via PL/SQL Developer work just fine, but when I pump the query through sqlQuery in RODBC then I get a data frame back with 0 rows. I cut the query down alternating between PL/SQL and RODBC until I figured that it's some kind of row limit or buffer limit thing going on. I searched the archives and Brian Ripley suggested playing with believeNRows parameter (setting it to FALSE) as Oracle is not well behaved. I've tried various combinations of this and delving into the lower level functions like odbcQuery and sqlGetResults but I'm still getting the same thing: 0 rows back unless I cut the query down to a more return a more reasonable range. Obviously I can't reproduce the problem here because of the database, but does anyone have any suggestions/tips/pointers to get me going toward a solution? Maybe it's a settings problem in the Microsoft ODBC thing under Control Panel-Administrative Tools-Data Sources Any suggestions on this? All help gladly received. Thanks! Jack. platform i386-pc-mingw32 arch i386 os mingw32 system i386, mingw32 status major 2 minor 5.1 year 2007 month 06 day27 svn rev42083 language R version.string R version 2.5.1 (2007-06-27) - [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Multivariate, multilevel regression?
Dear WizaRds, This is mostly a statistics question, but I'm figuring that R is the right solution (even before I start!) I have some bio data of heart rate versus time (rats taken from resting to maximal heart rate). I want to regress heart rate on time. The data have been normalized such that resting heart rate is zero at time=0, so that all curves intersect at the origin (and at the origin only). The regression function is monotonically increasing, but little is known about it. The data are also strictly ordinal: I have a factor such that lower-order data *must* be on a curve that is offset beneath a higher-order curve. Don't ask where the factors come from...but *given* these factors the assumption, or rather the *constraint*, is that lower orders are better (lower-order rats are fitter rats with better cardiovascular response). Most important is that these curves do not intersect because of these factors (a fitter rat can not have a worse heart-rate response than a less fit rat!). Here's a schematic to show you what I mean. It's very rough, but it gets the point across: seq(0, 1, len=100) f- 1/seq(.3,1,len=6) windows(); plot(t, sqrt(3*t), type='n', xlab='time', ylab='heart rate'); grid() for (i in seq(along=f)) lines(t, sqrt(f[i]*t), col=ceiling(2*i)) Now I'm wondering where I should start and I'm think that this is really not much different from having a y_i ~ x_i | factor_i ... the i^th response curve just like a dummy variable male/female linear regression. But in some way the factors are related (there's a dose behind it, if ya see what I mean), so they are not independent...they're all part of a system (some studies have more juice overall, so the whole system is varying from one group of rats to the next). This means that in some systems the curves will be closer or some curves within the system will be closer together. Here is my question for you guys: what is the best way to model this kind of problem, especially given that each i^th curve could have as few as 3 data points? If I can only vaguely assume the type of the regression function (monotonic, rapidly increasing from the origin kind of like sqrt(t), as above) should it be a parametric or nonparametric regression? What about those errors? Gee! It's hard to assume anything (there're so few of 'em and they're probably heteroskedastic! Where do I begin? I'd gladly accept all references, pointers, and such like. Best, and thanks for R and R-Help, all you Core guys!!! Jack. - [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.