[R] SQL Primer for R
Dear R wizards: I decided to take the advice in the R data import/export manual and want to learn how to work with SQL for large data sets. I am trying SQLite with the DBI and RSQLite database interfaces. Speed is nice. Alas, I am struggling to find a tutorial that is geared for the kind of standard operations that I would want in R. Simple things: * how to determine the number of rows in a table. (Of course, I could select a row of data and then use this.) * how to insert a new column into my existing SQL table---say, the rank of another variable---and save it back. Am I supposed to create a new data frame, then save it as a new table, then delete the old SQL table? * how to save a revised version of my table in a different sort order (with or without deleting the original table). <-- I guess this is not appropriate, as I should think of SQL tables as unordered. I guess these would make nice little text snippets in the R Data import/export manual, too. help appreciated. regards, /ivo __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] R-ish challenge for dependent ranking
Dear R wizards: First, thanks for the notes on SQL. These pointers will make it a lot easier to deal with large data sets. Sorry to have a second short query the same day. I have been staring at this for a while, but I cannot figure out how to do a dependent ranking the R-sh way. ds= data.frame( xn=rnorm(32), yn=rnorm(32), zn=rnorm(32) ) ds$drank1group= as.integer((rank( ds$xn )-1)/4) # ok, the first set of 8 groups, each with 4 elements ds$drank2.bydrank1group= ??? ## here I want within each drank1group the rank based on yn (from 1 to 4) something like "by(ds,drank1group, rank(ds$yn))". obviously, this neither works nor has same dimensional output. of course, there is a really simple, clever way to do this in R...except that it totally eludes me. before I start writing a hand iterating function, could someone please let me know how to do this? regards, /iaw __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R-ish challenge for dependent ranking
thank you everybody, again. regards, /iaw __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] SQL Primer for R
Sorry, chaps. I need one more: > dbDisconnect(con.in) Error in sqliteCloseConnection(conn, ...) : RS-DBI driver: (close the pending result sets before closing this connection) > I am pretty sure I have fetched everything there is to be fetched. I am not sure what I need to do to say goodbye (or to find out what is still pending). ?dbDisconnect doesn't tell me. PS: the documentation for dbConnect should probably add "dbDisconnect" to its 'See also' section. regards, /iaw Really irrelevant PS: the "by" function could keep the number of observations that go into each category. I know it can be computed separately, which is what I am doing now. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] error instead of warning?
dear R experts---is it possible to ask R to abort with an error instead of just giving a warning when I am mis-assigning vectors (or other data structures) that are not compatible? that is, I would like 1: In matrix(value, n, p) ... : data length [12] is not a sub-multiple or multiple of the number of columns [11] to force an error. are there any other warnings() that are really more programming errors that I could also convert into an abort? sincerely, /iaw __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] SQL Primer for R
stumped again by SQL... If I have a table named "main" in an SQLite data base, how do I get the names of all its columns? (I have a mysql book that claims the SHOW command does this sort of thing, but it does not seem to work on SQLite.) regards, /iaw PS: Thanks for the earlier emails on "warn=2". __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] SQL Primer for R
wow! the answer seems to be "pragma table_info(main);" thanks, Gabor. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] debugging
dear R wizards---I am not sure at what point I owe pennance for asking so many questions. I am now wrangling debugging. I want to write a function assert = function( condition, ... ) { if (!condition) { cat(...); cat("\n"); browser(); } stopifnot(condition); } assert( nrow(ds)==12, "My data set has ", nrow(ds), ", rows which is a bad error."); (Please ignore my semicolons.) Then, having invoked "browser()", I would like by hand to be able to move back up one stack frame so that I can look at my variables there. Alas, two "little" problems. First, the cat() does not seem to work. It prints the arguments themselves, rather than 'nrow(ds)'. Second, I have no idea how to move up one stack frame to the calling function so that I can examine better what went wrong. Is this possible? As always, advice is highly appreciated. Regards, /ivo __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] email just sent
please ignore part 1. of course, the cat works. my mistake. I just need to learn how to step up frames, please. regards, /iaw __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] basic boxplot questions
dear R experts: I am playing with boxplots for the first time. most of it is intuitive, although there was less info on the web than I had hoped. alas, for some odd reason, my R boxplots have some fat black dots, not just the hollow outlier plots. Is there a description of when R draws hollow vs. fat dots somewhere? [and what is the parameter to change just the size of these dots?] Also, let me show my fundamental ignorance: I am a little surprised that the average box boxplot would not show the mean and sdv, too, at least optionally. Is there a common way to accomplish this (e.g., in a different color), or do I just construct it myself with standard R graphics line() commands? advice appreciated. regards, /iaw __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] R 2.70 + ps2pdf14
dear R graphics experts---if anyone is running the combination of R 2.7.0 and ghostscript (2.62), could you please run the following and let me know if you get the same strange symbol size that I do, or if there is something weird on my system?regards, /ivo pdf(file = "testhere.PDF", version="1.4", pointsize=14); plot(0, xlim=c(0,26), ylim=c(-1,4.5), type="n" ); text(10, -0.5, "line 1 is plain, line 2 should be the same, except blue\nline 3 should be 1.5 times the size of line 1, but otherwise the same\nline 4 should be like line 4 except blue.\n\nthis gets weird symbol sizes at certain spots after ps2pdf14 is applied\n --- R output or ps2pdf error?", cex=0.5); points(1:25,rep(1,25), pch=1:25); points(1:25,rep(2,25), pch=1:25, col="blue"); points(1:25,rep(3,25), pch=1:25, cex=1.5); points(1:25,rep(4,25), pch=1:25, col="blue", cex=1.5); text(1, 0.2, "1 != 2", srt=90, cex=0.5); text(16, 0.2, "1 != 2", srt=90, cex=0.5); text(19, 0.2, "3 != 4, 1 != 2", srt=90, cex=0.5); text(20, 0.2, "(1,2) > (3,4)", srt=90, cex=0.5); #text(20, 0.4, "shrunk on cex=1.5", srt=90); dev.off(); retcode= system( "ps2pdf14 testhere.PDF" ); # now look at testhere.PDF.pdf , which is the ps2pdf14 output ; __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R 2.70 + ps2pdf14
thanks, berwin. yes, I meant ghostscript 8.62, of course. ps2pdf14 is the equivalent of a distiller and is needed to embed fonts. (R does not embed fonts itself afaik.) if someone knows another way to embed all the fonts, I would love to know so that I can avoid ps2pdf14 altogether (not just in this example, but generally; I use lucida fonts most of the time). if developers from the R graphics group are reading this, given that this strange output is not just my imagination, maybe it would be worthwhile to see if the R output pdf could be made more robust to avoid this "feature." I stumbled onto it deep in a program, and spend an afternoon distilling it down to the R script that I posted. It was quite puzzling. PS: semicolons are a hobby, and one explicitly allowed by R. ;-). regards, /ivo On Sun, May 18, 2008 at 4:18 AM, Berwin A Turlach <[EMAIL PROTECTED]> wrote: > G'day Ivo, > > On Sat, 17 May 2008 21:33:35 -0400 > "ivo welch" <[EMAIL PROTECTED]> wrote: > >> dear R graphics experts--- > > Not belonging to this group, but can confirm that I can see the same, > in particular the circles are changing their size. > > However, I am a bit surprised that you run ps2pdf14 on a PDF file, > according to the documentation the input should be a (E)PS file. > >> if anyone is running the combination of R 2.7.0 and ghostscript >> (2.62), > > and I guess you mean ghostscript 8.62? Ghostscript 2.62 would be > really ancient, probably from before the time that PDF was created... :) > >> could you please run the following > > I will also leave it to somebody else on the list, specifically to > people who find such coding particularly ugly if not offensive, to point > out that semicolons are not needed at the end of lines of R scripts. :) > > HTH. > > Cheers, > >Berwin > > === Full address = > Berwin A TurlachTel.: +65 6515 4416 (secr) > Dept of Statistics and Applied Probability+65 6515 6650 (self) > Faculty of Science FAX : +65 6872 3919 > National University of Singapore > 6 Science Drive 2, Blk S16, Level 7 e-mail: [EMAIL PROTECTED] > Singapore 117546http://www.stat.nus.edu.sg/~statba > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R 2.70 + ps2pdf14
thanks. I am now using R-patched 2008-05-18 r45723 . This is probably intended, but if not, I wanted to note it briefly: on the pdf output device, symbol 1 is always black, no matter what color is selected. symbols 10 and 13 contain black. symbol 19 is the replacement for symbol 1 that takes on the color. forgive the semicolons: pdf.start("test"); # just encapsulates what you would expect. NM=25; plot( 0, type="n", ylim=c(0,6), xlim=c(0,NM), xlab="0-8", ylab="0-5" ); points( 1:NM, rep(1,NM), pch=1:NM, col="black"); points( 1:NM, rep(2,NM), pch=1:NM, col="green"); text( 1:NM, rep(3,NM), 1:NM, col=1:NM, cex=0.75); pdf.end(); /iaw On Sun, May 18, 2008 at 12:51 PM, hadley wickham <[EMAIL PROTECTED]> wrote: >> if developers from the R graphics group are reading this, given that >> this strange output is not just my imagination, maybe it would be >> worthwhile to see if the R output pdf could be made more robust to >> avoid this "feature." I stumbled onto it deep in a program, and spend >> an afternoon distilling it down to the R script that I posted. It was >> quite puzzling. > > Have you tried R-patched? I think Brian Ripley fixed this some days ago. > > Hadley > > -- > http://had.co.nz/ > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] x86 SSE* Pointer Favors
Dear Statisticians--- This is not even an R question, so please forgive me. I have so much ignorance in this matter that I do not know where to begin. I hope someone can point me to documentation and/or a sample. I want to compute a covariance as quickly as non-humanly possible on an Intel core processor (up to SSE4) under linux. Alas, I have no idea how to engage CPU vectorization. Do I need to use special data types, or is "double" correct? Does SSE* understand NaN? Should I rely on gcc autodetection of the vectorized meaning of my code, or are there specific libraries that I should call? What I want to learn about is as simple as it gets: typedef double Double; // or whatever SSE* needs as close equivalent Double vector1[N], vector2[N]; // then fill them with stuff. vector3= vector_mult(vector1,vector2, N); vector4= sum(vector1, N); I just need a pointer and/or primer. PS: If someone knows of a superfast vectorized implementation of Gentleman's WLS algorithm, please point me to it, too. I am still using my old non-vectorized C routines. if this email offends as spam, apologies. regards, /iaw __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] recursive beta with cutoffs on large data set
dear R experts: I have an academic question that borders on asking for consulting help, so I hope I am not too imposing. If I am, please ignore me. My data set has 100MB data set of daily stock returns. I want to compute rolling (recursive?) betas---either bivariate or multivariate---with respect to some other data time series. Many of these regressions are "take away the first observation, add one observation at the end," which means I really have only about 30,000 unique regressions---still, quite a good number. Worse, I want to winsorize the rolling y-vector at different levels (99%&1%, 98%&2%, ...), so I want to repeat this procedure a few hundred times at different winsorization levels. The most important version of my task is bivariate regressions, which may mean that I don't even need MV overhead. I was even thinking of coding in C rather than R for speed sake, but I am now thinking that learning the intricacies of fast vector processing on x86 processors is so difficult, I would be done running in R faster before I would be done programming it. Has anyone done something like this? Any recommendations for what could help give me high-speed the I probably need for a task like this? Any thoughts? (I am right now working on getting blas-atlas to compile on my gentoo system. It just died in the compilation over something.) regards, /ivo __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] very fast OLS regression?
Dear R experts: I just tried some simple test that told me that hand computing the OLS coefficients is about 3-10 times as fast as using the built-in lm() function. (code included below.) Most of the time, I do not care, because I like the convenience, and I presume some of the time goes into saving a lot of stuff that I may or may not need. But when I do want to learn the properties of an estimator whose input contains a regression, I do care about speed. What is the recommended fastest way to get regression coefficients in R? (Is Gentlemen's weighted-least-squares algorithm implemented in a low-level C form somewhere? that one was always lightning fast for me.) regards, /ivo bybuiltin = function( y, x ) coef(lm( y ~ x -1 )); byhand = function( y, x ) { xy<-t(x)%*%y; xxi<- solve(t(x)%*%x) b<-as.vector(xxi%*%xy) ## I will need these later, too: ## res<-y-as.vector(x%*%b) ## soa[i]<-b[2] ## sigmas[i]<-sd(res) b; } MC=500; N=1; set.seed(0); x= matrix( rnorm(N*MC), nrow=N, ncol=MC ); y= matrix( rnorm(N*MC), nrow=N, ncol=MC ); ptm = proc.time() for (mc in 1:MC) byhand(y[,mc],x[,mc]); cat("By hand took ", proc.time()-ptm, "\n"); ptm = proc.time() for (mc in 1:MC) bybuiltin(y[,mc],x[,mc]); cat("By built-in took ", proc.time()-ptm, "\n"); __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] very fast OLS regression?
thanks, dimitris. I also added Bill Dunlap's "solve(qr(x),y)" function as ols5. here is what I get in terms of speed on a Mac Pro: ols1 6.779 3.591 10.37 0 0 ols2 0.515 0.21 0.725 0 0 ols3 0.576 0.403 0.971 0 0 ols4 1.143 1.251 2.395 0 0 ols5 0.683 0.565 1.248 0 0 so the naive matrix operations are fastest. I would have thought that alternatives to the naive stuff I learned in my linear algebra course would be quicker. still, ols3 and ols5 are competitive. the built-in lm() is really problematic. is ols3 (or perhaps even ols5) preferable in terms of accuracy? I think I can deal with 20% speed slow-down (but not with a factor 10 speed slow-down). regards, /iaw On Wed, Mar 25, 2009 at 5:11 PM, Dimitris Rizopoulos wrote: > check the following options: > > ols1 <- function (y, x) { > coef(lm(y ~ x - 1)) > } > > ols2 <- function (y, x) { > xy <- t(x)%*%y > xxi <- solve(t(x)%*%x) > b <- as.vector(xxi%*%xy) > b > } > > ols3 <- function (y, x) { > XtX <- crossprod(x) > Xty <- crossprod(x, y) > solve(XtX, Xty) > } > > ols4 <- function (y, x) { > lm.fit(x, y)$coefficients > } > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] package installation on OSX --- suggestion
dear R experts: I am trying to install packages in OSX, R 2.8.1. Since I do this about every 2 years, I have completely forgotten it. However, this should not be difficult: http://wiki.r-project.org/rwiki/doku.php?id=getting-started:installation:packages nice document. beautiful method. so, I start with update.packages() the final message tells me that it saved all the packages into /var/folders/Ia/IaQbr8K+GQ8DqdaGMAC18yU/-Tmp-/RtmpjRkMV7/downloaded_packages/ . not exactly user-friendly. at this point, I don't know whether they were also installed or just downloaded. the same happens when I do an install.package("plm", dependencies=T). would it not make sense if the package were installed in the standard R library location at this point, and the final message to tell me that the package was indeed installed, and not about the temporary directory? [I suspect that it actually did the install, so this is just a "final message" issue.] just a suggestion... [and thanks everybody for all the help yesterday. now back to my moments.] regards, /ivo __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] pgmm (blundell-bond) help needed
I have been playing with more examples, and I now know that with larger NF's my example code actually produces a result, instead of a singular matrix error. interestingly, stata's xtabond2 command seems ok with these sorts of data sets. either R has more stringent requirements, or stata is too casual. in any case, I find it strange that Blundell-Bond would not work on data sets in which N=20 and T=10, and there is only one parameter to estimate. there should be more than enough degrees of freedom. I will experiment more with it. regards, /iaw __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Hurwicz Bias Correction
Dear Experts---Sorry, I need some help again. I need a very fast estimator for small sample time-series in which the autocoefficient can be anything between 0 and 2 (i.e., even beyond the unit-root). I think this means that I will need to run OLS. Of course, this means that I will run into the Hurwicz bias. So I am wondering whether there is a reasonably fast approximate correction for the autocoefficient, presumably as a function of N, Var(x), and estimated a, b, and Var(e). Even a function with some reasonable amount of lookup would be ok. (I have searched google and found nothing.) Pointers appreciated. sincerely, /iaw __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] static variable?
dear R experts: does R have "static" variables that are local to functions? I know that they are usually better avoided (although they are better than globals). However, I would like to have a function print how often it was invoked when it is invoked, or at least print its name only once to STDOUT when it is invoked many times. possible without <<- ? sincerely, /iaw __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] pointers needed to expected values of fractions
I apologize in advance for a more statistical question. I am trying to find out whether a transformation of two random variables X and Y ( z= g(X,Y) ) exists whose expected value is E(X)/E(Y). obviously, it ain't E(X/Y). is there a book or place where I could learn this? (Also, I would be interested to learn more about the properties of E(X/Y) if they have been worked out (and not just when X and Y are independent), so if there is a book for this one, I would again be quite interested, too.) this is not an R topic, so please email me directly if you know where I could look. thanks in advance. again, apologies for the clutter... /iaw __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] constrOptim parameters
Dear R wizards: I am playing (and struggling) with the example in the constrOptim function. simple example. let's say I want to constrain my variables to be within -1 and 1.I believe I want a whole lot of constraints where ci is -1 and ui is either -1 or 1. That is, I have 2*N constraints. Should the following work? N=10 x= rep(1:N) ci= rep(-1, 2*N) ui= c(rep(1, N), rep(-1, N)) constrOptim( x, f, NULL, ui, ci, method="Nelder-Mead"); actually, my suggestions would be to give an example in the constrOptim docs where the number of constraints is something like this example. the current ones have 2*2 constraints, so it is harder to figure out the appropriate dimensions for different cases by extending the examples. on another note, the "non-conformable arguments" error could be a little more informative, telling the end user what the two incompatible dimensions actually are. this is not hard to find out by hand, but it would still be useful. regards, /iaw -- Ivo Welch (ivo.we...@brown.edu, ivo.we...@gmail.com) [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] 64-bit OSX binary for 2.9.2
dear R wizards: I am looking for a binary package distribution of R 2.9.2 for OSX . Looking at http://r.research.att.com/ , there seems to be only a binary for 2.9.0 . is the 2.9.2 version binary package available somewhere? (at this point, would it make sense to elevate the 64-bit version to a "standard recommended" rather than just a "boutique" version?) sincerely, /iaw -- Ivo Welch (ivo.we...@brown.edu, ivo.we...@gmail.com) [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Location of Packages?
Sorry, one more: on OSX, I deleted my old 2.9.2 R.app, and installed the 64 bit version of 2.9.0. I then did an "install.packages("car")" under my new 2.9.0. It seems to have worked, but alas, I still get an error that package 'car' was built under R version 2.9.2 . Where exactly does R under OSX install its packages? (is it a bug that another car is loaded?) PS: do I need to install the car packages under the 64-bit version, or will it be seen by the 64 bit version if I do a 32-bit install? Or do I need to do a double install? for safety, I did it under the command line version, which I presume is still 32-bit, and the 64 bit GUI. PPS: how do I learn which version of R is running? regards, /iaw -- Ivo Welch (ivo.we...@brown.edu, ivo.we...@gmail.com) [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Location of Packages?
thanks, everyone. I was a bit confused. I now think that the "car" error was because the package on the cran website itself was built under 2.9.2, not because I had an old version lying around, which my package continued to use instead of a newer version that I would just have installed. I was also ambiguous about asking about version. Sorry. I did not mean version number of R, but the bit version. the answer is that the two methods are sessionInfo() and bit64= ifelse(.Machine$sizeof.point == 8, T, F) . It would be nice if the standard R startup message would state whether the version is 64bit or 32bit, but this is just a suggestion. now, all I need is a more recent packaged R version than 2.9.0. can I ask who maintains http://r.research.att.com/ , so I send a short email to this person? regards, /iaw [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] fastest OLS w/ NA's and need for SE's
dear R wizards: apologies for two queries in one day. I have a long form data set, which identifies about 5,000 regressions, each with about 1,000 observations. unit date y x 120060101 120060102 ... 5000 20081230 5000 20081231 I need to run such regressions many many times, because they are part of an optimization. thus, getting my code to be fast is paramount. I will need to pick off the 5,000 coefficients on x (i.e., b) and the standard errors of b's. I can ignore the 5,000 intercept. by( dataset, as.factor(dataset$unit), function(x) coef(lm( y ~ x, data=x)) ) gives me the coefficients. of course, I could use the summary method to lm to pick off the coefficient standard errors, too. my guess is that this would be slow. I think the alternative would be to delete all NAs first, and then use a building block function (such as lm.fit(), or solve(qr(),y)). this would be fast for getting the coefficients, but I wonder whether there is a *FAST* way to obtain the standard error of b. (I do know slow ways, but this would defeat the purpose.) is this the right idea? or will I just end up with more code but not more speed than I would with summary(lm())? can someone tell me the "fastest" way to generate b and se(b)? is there anything else that comes to mind as a recommended way to speed this up in R, short of writing everything in C? as always, advice highly appreciated. /iaw -- Ivo Welch (ivo.we...@brown.edu, ivo.we...@gmail.com) [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] why is nrow() so slow?
dear R wizards: here is the strange question for the day. It seems to me that nrow() is very slow. Let me explain what I mean: ds= data.frame( NA, x=rnorm(1) ) ## a sample data set > system.time( { for (i in 1:1) NA } ) ## doing nothing takes virtually no time user system elapsed 0.000 0.000 0.001 ## this is something that should take time; we need to add 10,000 values 10,000 times > system.time( { for (i in 1:1) mean(ds$x) } ) user system elapsed 0.416 0.001 0.416 ## alas, this should be very fast. it is just reading off an attribute of ds. it takes almost a quarter of the time of mean()! > system.time( { for (i in 1:1) nrow(ds) } ) user system elapsed 0.124 0.001 0.125 ## here is an alternative way to implement nrows, which is already much faster: > system.time( { for (i in 1:1) length(ds$x) } ) user system elapsed 0.041 0.000 0.041 is there a faster way to learn how big a data frame is? I know this sounds silly, but this is inside a "by" statement, where I figure out how many observations are in each subset. strangely, this takes a whole lot of time. I don't believe it is possible to ask "by" to attach an attribute to the data frame that stores the number of observations that it is actually passing. pointers appreciated. regards, /iaw -- Ivo Welch (ivo.we...@brown.edu, ivo.we...@gmail.com) [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] why is nrow() so slow?
hi david---no, this time I actually know what I was asking ( ;-) ). I do need speed computed on many data sets, each of which is created by a "by" statement. so, no iterative programming on my side. thanks, hadley for the pointer to .row_names_info() in lieu of dim() or nrows(). I don't seem to understand the second (type) argument, despite reading the docs, but all of them are giving the same answer in my data frames. so, I guess I will stick to "2" for the time being. regards, /iaw [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] why is nrow() so slow?
interestingly, in my case, the opposite seems to be the case. data frames seem faster than matrices when it comes to "by" computation (which is where most of my calculations are in): ### here is my data frame and some information about it > dim(rets.subset) [1] 132508 3 > names(rets.subset) [1] "PERMNO" "RET""mdate" > length(unique(as.factor(rets.subset$PERMNO))) [1] 6832 > length((as.factor(rets.subset$PERMNO))) [1] 132508 ### calculation using data frame > system.time( { by( rets.subset, as.factor(rets.subset$PERMNO), mean) } ) user system elapsed 3.295 2.798 6.095 ### same as matrix > m=as.matrix(rets.subset) > system.time( { a=by( m, as.factor(m[,1]), mean) } ) user system elapsed 5.371 5.557 10.928 PS: Any speed suggestions are appreciated. This is "experimenting time" for me. > One note: if you're worried about speed, it almost always makes sense to use matrices rather than dataframes. If you've got mixed types this is tedious and error-prone (each type needs to be in a separate matrix), but if your data is all numeric, it's very simple, and will make things a lot faster. > > Duncan Murdoch > -- Ivo Welch (ivo.we...@brown.edu, ivo.we...@gmail.com) CV Starr Professor of Economics (Finance), Brown University http://welch.econ.brown.edu/ [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] cluster-lite
I am about to write a "cluster-lite" R solution for myself. I wanted to know whether it already exists. If not, I will probably write up how I do this, and I will make the code available. Background: we have various linux and OSX systems, which are networked, but not set up as a cluster. I have no one here to set up a cluster, so I need a "hack" that facilitates parallel programming on standard networked machines. I have accounts on all the machines, ssh access (of course password-less), and networked file directory access. what I am ultimately trying to accomplish is built around a "simple" function, that my master program would invoke: master.R: multisystem( c("R slv.R 1 20 file1.out", "R slv.R 21 40 file2.out", "ssh anotherhost R slv.R 41 80 file3.out"), announce=300) multisystem() should submit all jobs simultaneously and continue only after all are completed. it should also tell me every 300 seconds what jobs it is still waiting for, and which have completed. with basically no logic in the cluster, my master and slv programs have to make up for it. master.R must have the smarts to know where it can spawn jobs and how big each job should be. slv.R must have the smarts to place its outputs into the marked files on the networked file directory. master.R needs the smarts to combine the outputs of all jobs, and to resubmit jobs that did not complete successfully. again, the main reason for doing all of this is to avoid setting up a cluster across OSX and linux system, and still to make parallel processing across linux/osx as easy as possible. I don't think it gets much simpler than this. now, I know how to write the multisystem() in perl, but not in R. so, if I roll it myself, I will probably rely on a mixed R/perl system here. This is not desirable, but it is the only way I know how to do this. if something like multisystem() already exists in R native, please let me know and save me from reinventing the wheel. if it does not, some perl/R combo for this soon will. regards, /iaw -- Ivo Welch (ivo.we...@brown.edu, ivo.we...@gmail.com) [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.