While the subquery with a temporary table is probably the better
option, you could just manually generate the subquery and pass it in
with the query.  As an example, if you have user_ids 1000-1005,
instead of having "... where user_id in (select user_id from
r_user_id)", you would have "... where user_id in
(1000,1001,1002,1003,1004,1005)".  With 2000 user_ids, you'd just have
R generate the string, the whole query for that matter, and pass it
in.  Sure, it's a few thousand characters going into the database as a
query, but it sure beats pulling down a few million characters that
you're going to just throw away.

I have no idea how the performance of the above method would compare
to using a temporary table (on the database side, of course), but I
have a hard time believing that it wouldn't be within a factor of
two.  But I'm just making the suggestion.

Coey


Avram Aelony writes:
 > 
 > Dear R list,
 > 
 > What is the best way to efficiently marry an R dataset with a very large 
 > (Oracle) database table?  
 > 
 > The goal is to only return Oracle table rows that match IDs present in the R 
 > dataset.  
 > I have an R data frame with 2000 user IDs analogous to: r = 
 > data.frame(userid=round(runif(2000)*100000,0))
 > 
 > ...and I need to pull data from an Oracle table only for these 2000 IDs.  
 > The Oracle table is quite large. Additionally, the sql query may need to 
 > join to other tables to bring in ancillary fields.
 > 
 > I currently connect to Oracle via odbc: 
 > 
 > library(RODBC)
 > connection <- odbcConnect("****", uid="****", pwd="****")
 > d = sqlQuery(connection, "select userid, x, y, z from largetable where 
 > timestamp > sysdate -7")
 > 
 > ...allowing me to pull data from the database table into the R object "d" 
 > and then use the R merge function.  The problem however is that if "d" is 
 > too large it may fail due to memory limitations or be inefficient.  I would 
 > like to push the merge portion to the database and it would be very 
 > convenient if it were possible to request that the query look to the R 
 > object for the ID's to which it should restrict the output.  
 > 
 > Is there a way to do this?
 > Something like the following fictional code:
 > d = sqlQuery(connection, "select t.userid, x, y, z from largetable t where 
 > r$userid=t.userid")
 > 
 > Would sqldf (http://code.google.com/p/sqldf/) help me out here? If so, how?  
 >  This would be convenient and help me avoid needing to create a temporary 
 > table to store the R data, join via sql, then return the data back to R.
 > 
 > I am using R version 2.7.2 (2008-08-25) / i386-pc-mingw32 . 
 > Thanks for your comments, ideas, recommendations.
 > 
 > 
 > -Avram
 > 
 > ______________________________________________
 > R-help@r-project.org mailing list
 > https://stat.ethz.ch/mailman/listinfo/r-help
 > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 > and provide commented, minimal, self-contained, reproducible code.
 >

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to