[I am having some problems with returned email. Sorry if you get this twice.]
Well so much for pontificating. My "shining example" has an error too although this error is less serious since it will give an error message. embed(T,width) should be embed(1:T,width). But perhaps there is more to this than we realize since the subsequent (space inefficient) solution which I thought might be taking it too far does not require calculation of T except in embed.data.frame which is a reusable function that would likely be of higher quality through reuse and in any case the fact is that it did avoid this error. Date: Wed, 24 Mar 2004 00:09:19 -0500 (EST) From: Gabor Grothendieck <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]>, <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Subject: Re: [R] Operating on windows of data Far from being an example where vectorization seems to have only minor advantages, if any, over a non-vectorized approach, it turns out that this is a shining example of why vectorization produces higher quality code. Consider that the loop approach and all the other solutions so far have an error! The last time point is never included in any window. The genesis of the error is that index arithmetic for producing the loop limits and sapply index is sufficiently complex that no one noticed the error. Vectorizing *sufficiently* eliminates the calculation of these limits entirely and so avoids the possibility of mistake. The problem so far is that we just did not take the vectorized approach far enough (mea culpa). The following solution uses embed (which can be regarded as a vectorized sliding window) to get us out of considering such explicit limits. By eliminating the calculation of these limits it avoids any potential for this sort of error in the first place. (It also avoids the transpose by producing data frames at each step and rbinding them.) do.call( "rbind", apply( embed(T, width), 1, function(idx) { model <- lm(dlinrchf ~ dlusdchf + dljpychf + dldemchf, data = A, subset = idx) s <- summary.lm(model) v <- coefficients(model) data.frame(USD = v["dlusdchf"], JPY = v["djpychf"], DEM = v["dldemchf"], R2 = s$r.squared, RMSE = s$sigma) } ) ) Note that the above example loops over vectors of indices instead of single indexes. There is an opportunity to do away with indices altogether although it is at the considerable expense of space efficiency so, in practice, one would likely be content with the last solution. Nevertheless, it is of interest to display the next solution even if only for sake of completing the thought. Define embed.data.frame to produce a list of data frames that represents the sliding windows. Perform a lapply over those: embed.data.frame <- function(df,width) apply(embed(1:nrow(df),width),1,function(idx)df[idx,]) do.call( "rbind", lapply(embed.data.frame(df,width), function(df) { model <- lm(dlinrchf ~ dlusdchf + dljpychf + dldemchf, data = df) s <- summary.lm(model) v <- coefficients(model) data.frame(USD = v["dlusdchf"], JPY = v["djpychf"], DEM = v["dldemchf"], R2 = s$r.squared, RMSE = s$sigma) } ) ) This gets rid of all reference to idx and nrow(df). --- Date: Wed, 24 Mar 2004 14:56:46 +1200 (NZST) From: Richard A. O'Keefe <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Subject: Re: [R] Operating on windows of data [snip] Hmm. My "first" and "second" ways are both the same: use names rather than position. There is one more clarity improvement to recommend, and it has nothing to do with using or avoiding sapply(), at least not directly. # dfapply(X, FUN, ...) is like sapply() but # expects FUN to return c(x1=...,xn=...) vectors which it # turns into rows of the data frame that it returns. dfapply <- function (...) as.data.frame(t(sapply(...))) # Make "dat" a data frame with columns USD, JPY, DEM, R2, RMSE # and rows 1:T-width, the ith row extracted from a linear # regression on cases i:(i+width-1). dat <- dfapply(seq(T-width), function (i) { model <- lm(dlinrchf ~ dlusdchf + dljpychf + dldemchf, data = A, subset = i:(i+width-1)) s <- summary.lm(model) v <- coefficients(model) c(USD = v["dlusdchf"], JPY = v["djpychf"], DEM = v["dldemchf"], R2 = s$r.squared, RMSE = s$sigma) }) Now here's where using sapply() instead of 'for' does pay off, even here. We ask the question "where is 'i' used?" Because we're *not* using i in any visible index calculations, there is only one place that 'i' is used, and that's in the subset= argument of the lm() call. That prompts the question "is there any way to exploit the fact that the rest of the linear model is the same? Depending on the relative sizes of A and T-width, there may well be, and Statistical Models in S explains, if memory serves me, how to do this kind of thing. But without the fact that i is only used in one place, it might not be as obvious that it was worth thinking about. ______________________________________________ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html