While the subquery with a temporary table is probably the better option, you could just manually generate the subquery and pass it in with the query. As an example, if you have user_ids 1000-1005, instead of having "... where user_id in (select user_id from r_user_id)", you would have "... where user_id in (1000,1001,1002,1003,1004,1005)". With 2000 user_ids, you'd just have R generate the string, the whole query for that matter, and pass it in. Sure, it's a few thousand characters going into the database as a query, but it sure beats pulling down a few million characters that you're going to just throw away.
I have no idea how the performance of the above method would compare to using a temporary table (on the database side, of course), but I have a hard time believing that it wouldn't be within a factor of two. But I'm just making the suggestion. Coey Avram Aelony writes: > > Dear R list, > > What is the best way to efficiently marry an R dataset with a very large > (Oracle) database table? > > The goal is to only return Oracle table rows that match IDs present in the R > dataset. > I have an R data frame with 2000 user IDs analogous to: r = > data.frame(userid=round(runif(2000)*100000,0)) > > ...and I need to pull data from an Oracle table only for these 2000 IDs. > The Oracle table is quite large. Additionally, the sql query may need to > join to other tables to bring in ancillary fields. > > I currently connect to Oracle via odbc: > > library(RODBC) > connection <- odbcConnect("****", uid="****", pwd="****") > d = sqlQuery(connection, "select userid, x, y, z from largetable where > timestamp > sysdate -7") > > ...allowing me to pull data from the database table into the R object "d" > and then use the R merge function. The problem however is that if "d" is > too large it may fail due to memory limitations or be inefficient. I would > like to push the merge portion to the database and it would be very > convenient if it were possible to request that the query look to the R > object for the ID's to which it should restrict the output. > > Is there a way to do this? > Something like the following fictional code: > d = sqlQuery(connection, "select t.userid, x, y, z from largetable t where > r$userid=t.userid") > > Would sqldf (http://code.google.com/p/sqldf/) help me out here? If so, how? > This would be convenient and help me avoid needing to create a temporary > table to store the R data, join via sql, then return the data back to R. > > I am using R version 2.7.2 (2008-08-25) / i386-pc-mingw32 . > Thanks for your comments, ideas, recommendations. > > > -Avram > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.