Sorry for the lack of details. Since I run the same SQL first directly on MySQL (using the MySQL Query Browser) and then again using R through the RJDBC interface, I assume that I won't simply have a badly constructed SQL query. However, just to clear possible objection, here the SQL:
# Extracts vector of data points getData <- function(connection) { queryStart <- "SELECT id1, id2, x, y FROM `mytable` " queryEnd <- ";" query <- paste(queryStart, " WHERE id1 IN(", id1s, ") AND id2 IN(", id2s, ") AND subtype='TYPE1'", queryEnd) # execute query data = dbGetQuery(connection, query) return(data) } When running this method using either RGUI or the command line, I have a runtime that reaches an incredible 10 minutes (!) for selecting about 50k - 80k data points (which I consider not much) based on the range of IDs I choose. The table size is about 5-8 million data points total. The same SQL query directly executed in MySQL Query Browser takes about 20 seconds which I would consider fine. There are no indices created for any of the fields but since the query runs a lot faster in the query browser I don't suspect this to be the main reason. Any ideas? Best, Ralf On Wed, Jun 23, 2010 at 4:36 PM, James W. MacDonald <jmac...@med.umich.edu> wrote: > Hi Ralf, > > Ralf B wrote: >> >> I am running a simple SQL SELECT statement that involvs 50k + data >> points using R and the RJDBC interface. I am facing very slow response >> times in both the RGUI and the R console. When running this SQL >> statement directly in a SQL client I have processing times that are a >> lot lot faster (which means that the SQL statement itself is not the >> problem). >> >> Did any of you compare RJDBC vs RMySQL or is there a better, more >> efficient way to extract large data from databases using R? Would you >> recommend dumping data out completely into flat files and working with >> flat files instead? I expected that this would not be such a problem >> given that businesses maintain their data in DBs and R is supposed to >> be good in shifting around data. Am I doing something wrong? > > Well, if you don't show people what you have done, how can anybody tell if > you are doing something wrong or not? > > I have no experience with RJDBC, so cannot say anything about that. However, > I have always found RMySQL to be speedy enough. As an example: > >> library(RMySQL) > Loading required package: DBI >> con <- dbConnect("MySQL", host="genome-mysql.cse.ucsc.edu", user = >> "genome", dbname = "hg18") >> system.time(a <- dbGetQuery(con, "select name, chromEnd from snp129 where >> chrom='chr1' and chromStart between 1 and 1e8;") > + ) > user system elapsed > 7.95 0.06 38.59 >> dim(a) > [1] 508676 2 > > So 40 seconds to get half a million records. Since this is via the internet, > I have to imagine things would be much faster querying a local DB. > > But then you never say what constitutes 'slow' for you, so maybe this is > slow as well? > > Best, > > Jim > > >> >> Ralf >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > -- > James W. MacDonald, M.S. > Biostatistician > Douglas Lab > University of Michigan > Department of Human Genetics > 5912 Buhl > 1241 E. Catherine St. > Ann Arbor MI 48109-5618 > 734-615-7826 > ********************************************************** > Electronic Mail is not secure, may not be read every day, and should not be > used for urgent or sensitive issues > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.