Re: [R] RJDBC vs RMySQL vs ???
Unfortunately, I have a lot of errors with RMySQL -- but that is another thread... Ralf On Thu, Jun 24, 2010 at 10:31 AM, James W. MacDonald wrote: > Hi Ralf, > > Ralf B wrote: >> >> Sorry for the lack of details. Since I run the same SQL first directly >> on MySQL (using the MySQL Query Browser) and then again using R >> through the RJDBC interface, I assume that I won't simply have a badly >> constructed SQL query. However, just to clear possible objection, here >> the SQL: >> >> >> # Extracts vector of data points >> getData <- function(connection) { >> queryStart <- "SELECT id1, id2, x, y FROM `mytable` " >> queryEnd <- ";" >> query <- paste(queryStart, " WHERE id1 IN(", id1s, ") AND id2 IN(", >> id2s, ") AND subtype='TYPE1'", queryEnd) >> # execute query >> data = dbGetQuery(connection, query) >> return(data) >> } >> >> When running this method using either RGUI or the command line, I have >> a runtime that reaches an incredible 10 minutes (!) for selecting >> about 50k - 80k data points (which I consider not much) based on the >> range of IDs I choose. The table size is about 5-8 million data points >> total. The same SQL query directly executed in MySQL Query Browser >> takes about 20 seconds which I would consider fine. There are no >> indices created for any of the fields but since the query runs a lot >> faster in the query browser I don't suspect this to be the main >> reason. >> >> Any ideas? > > Well, the RJDBC rforge page has this note: > > Note: The current implementation of RJDBC is done entirely in R, no Java > code is used. This means that it may not be extremely efficient and could be > potentially sped up by using Java native code. However, it was sufficient > for most tasks we tested. If you have performance issues with RJDBC, please > let us know and tell us more details about your test case. > > And from my quick peek at the page, it appears RJDBC is designed to allow > one to query any DBMS. Since RMySQL is MySQL-specific, it may be more > efficient. Anyway, why don't you just try it and see? > > Best, > > Jim > > >> >> Best, >> Ralf >> >> >> >> >> On Wed, Jun 23, 2010 at 4:36 PM, James W. MacDonald >> wrote: >>> >>> Hi Ralf, >>> >>> Ralf B wrote: I am running a simple SQL SELECT statement that involvs 50k + data points using R and the RJDBC interface. I am facing very slow response times in both the RGUI and the R console. When running this SQL statement directly in a SQL client I have processing times that are a lot lot faster (which means that the SQL statement itself is not the problem). Did any of you compare RJDBC vs RMySQL or is there a better, more efficient way to extract large data from databases using R? Would you recommend dumping data out completely into flat files and working with flat files instead? I expected that this would not be such a problem given that businesses maintain their data in DBs and R is supposed to be good in shifting around data. Am I doing something wrong? >>> >>> Well, if you don't show people what you have done, how can anybody tell >>> if >>> you are doing something wrong or not? >>> >>> I have no experience with RJDBC, so cannot say anything about that. >>> However, >>> I have always found RMySQL to be speedy enough. As an example: >>> library(RMySQL) >>> >>> Loading required package: DBI con <- dbConnect("MySQL", host="genome-mysql.cse.ucsc.edu", user = "genome", dbname = "hg18") system.time(a <- dbGetQuery(con, "select name, chromEnd from snp129 where chrom='chr1' and chromStart between 1 and 1e8;") >>> >>> + ) >>> user system elapsed >>> 7.95 0.06 38.59 dim(a) >>> >>> [1] 508676 2 >>> >>> So 40 seconds to get half a million records. Since this is via the >>> internet, >>> I have to imagine things would be much faster querying a local DB. >>> >>> But then you never say what constitutes 'slow' for you, so maybe this is >>> slow as well? >>> >>> Best, >>> >>> Jim >>> >>> Ralf __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. >>> >>> -- >>> James W. MacDonald, M.S. >>> Biostatistician >>> Douglas Lab >>> University of Michigan >>> Department of Human Genetics >>> 5912 Buhl >>> 1241 E. Catherine St. >>> Ann Arbor MI 48109-5618 >>> 734-615-7826 >>> ** >>> Electronic Mail is not secure, may not be read every day, and should not >>> be >>> used for urgent or sensitive issues >>> > > -- > James W. MacDonald, M.S. > Biostatistician > Douglas Lab > University of Michigan > Department of Human Genetics > 5912 Buhl > 1241 E. Catherine St. > Ann Arbor MI 48109-5618 > 734-61
Re: [R] RJDBC vs RMySQL vs ???
Hi Ralf, Ralf B wrote: Sorry for the lack of details. Since I run the same SQL first directly on MySQL (using the MySQL Query Browser) and then again using R through the RJDBC interface, I assume that I won't simply have a badly constructed SQL query. However, just to clear possible objection, here the SQL: # Extracts vector of data points getData <- function(connection) { queryStart <- "SELECT id1, id2, x, y FROM `mytable` " queryEnd <- ";" query <- paste(queryStart, " WHERE id1 IN(", id1s, ") AND id2 IN(", id2s, ") AND subtype='TYPE1'", queryEnd) # execute query data = dbGetQuery(connection, query) return(data) } When running this method using either RGUI or the command line, I have a runtime that reaches an incredible 10 minutes (!) for selecting about 50k - 80k data points (which I consider not much) based on the range of IDs I choose. The table size is about 5-8 million data points total. The same SQL query directly executed in MySQL Query Browser takes about 20 seconds which I would consider fine. There are no indices created for any of the fields but since the query runs a lot faster in the query browser I don't suspect this to be the main reason. Any ideas? Well, the RJDBC rforge page has this note: Note: The current implementation of RJDBC is done entirely in R, no Java code is used. This means that it may not be extremely efficient and could be potentially sped up by using Java native code. However, it was sufficient for most tasks we tested. If you have performance issues with RJDBC, please let us know and tell us more details about your test case. And from my quick peek at the page, it appears RJDBC is designed to allow one to query any DBMS. Since RMySQL is MySQL-specific, it may be more efficient. Anyway, why don't you just try it and see? Best, Jim Best, Ralf On Wed, Jun 23, 2010 at 4:36 PM, James W. MacDonald wrote: Hi Ralf, Ralf B wrote: I am running a simple SQL SELECT statement that involvs 50k + data points using R and the RJDBC interface. I am facing very slow response times in both the RGUI and the R console. When running this SQL statement directly in a SQL client I have processing times that are a lot lot faster (which means that the SQL statement itself is not the problem). Did any of you compare RJDBC vs RMySQL or is there a better, more efficient way to extract large data from databases using R? Would you recommend dumping data out completely into flat files and working with flat files instead? I expected that this would not be such a problem given that businesses maintain their data in DBs and R is supposed to be good in shifting around data. Am I doing something wrong? Well, if you don't show people what you have done, how can anybody tell if you are doing something wrong or not? I have no experience with RJDBC, so cannot say anything about that. However, I have always found RMySQL to be speedy enough. As an example: library(RMySQL) Loading required package: DBI con <- dbConnect("MySQL", host="genome-mysql.cse.ucsc.edu", user = "genome", dbname = "hg18") system.time(a <- dbGetQuery(con, "select name, chromEnd from snp129 where chrom='chr1' and chromStart between 1 and 1e8;") + ) user system elapsed 7.950.06 38.59 dim(a) [1] 508676 2 So 40 seconds to get half a million records. Since this is via the internet, I have to imagine things would be much faster querying a local DB. But then you never say what constitutes 'slow' for you, so maybe this is slow as well? Best, Jim Ralf __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- James W. MacDonald, M.S. Biostatistician Douglas Lab University of Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine St. Ann Arbor MI 48109-5618 734-615-7826 ** Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues -- James W. MacDonald, M.S. Biostatistician Douglas Lab University of Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine St. Ann Arbor MI 48109-5618 734-615-7826 ** Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] RJDBC vs RMySQL vs ???
Sorry for the lack of details. Since I run the same SQL first directly on MySQL (using the MySQL Query Browser) and then again using R through the RJDBC interface, I assume that I won't simply have a badly constructed SQL query. However, just to clear possible objection, here the SQL: # Extracts vector of data points getData <- function(connection) { queryStart <- "SELECT id1, id2, x, y FROM `mytable` " queryEnd <- ";" query <- paste(queryStart, " WHERE id1 IN(", id1s, ") AND id2 IN(", id2s, ") AND subtype='TYPE1'", queryEnd) # execute query data = dbGetQuery(connection, query) return(data) } When running this method using either RGUI or the command line, I have a runtime that reaches an incredible 10 minutes (!) for selecting about 50k - 80k data points (which I consider not much) based on the range of IDs I choose. The table size is about 5-8 million data points total. The same SQL query directly executed in MySQL Query Browser takes about 20 seconds which I would consider fine. There are no indices created for any of the fields but since the query runs a lot faster in the query browser I don't suspect this to be the main reason. Any ideas? Best, Ralf On Wed, Jun 23, 2010 at 4:36 PM, James W. MacDonald wrote: > Hi Ralf, > > Ralf B wrote: >> >> I am running a simple SQL SELECT statement that involvs 50k + data >> points using R and the RJDBC interface. I am facing very slow response >> times in both the RGUI and the R console. When running this SQL >> statement directly in a SQL client I have processing times that are a >> lot lot faster (which means that the SQL statement itself is not the >> problem). >> >> Did any of you compare RJDBC vs RMySQL or is there a better, more >> efficient way to extract large data from databases using R? Would you >> recommend dumping data out completely into flat files and working with >> flat files instead? I expected that this would not be such a problem >> given that businesses maintain their data in DBs and R is supposed to >> be good in shifting around data. Am I doing something wrong? > > Well, if you don't show people what you have done, how can anybody tell if > you are doing something wrong or not? > > I have no experience with RJDBC, so cannot say anything about that. However, > I have always found RMySQL to be speedy enough. As an example: > >> library(RMySQL) > Loading required package: DBI >> con <- dbConnect("MySQL", host="genome-mysql.cse.ucsc.edu", user = >> "genome", dbname = "hg18") >> system.time(a <- dbGetQuery(con, "select name, chromEnd from snp129 where >> chrom='chr1' and chromStart between 1 and 1e8;") > + ) > user system elapsed > 7.95 0.06 38.59 >> dim(a) > [1] 508676 2 > > So 40 seconds to get half a million records. Since this is via the internet, > I have to imagine things would be much faster querying a local DB. > > But then you never say what constitutes 'slow' for you, so maybe this is > slow as well? > > Best, > > Jim > > >> >> Ralf >> >> __ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > -- > James W. MacDonald, M.S. > Biostatistician > Douglas Lab > University of Michigan > Department of Human Genetics > 5912 Buhl > 1241 E. Catherine St. > Ann Arbor MI 48109-5618 > 734-615-7826 > ** > Electronic Mail is not secure, may not be read every day, and should not be > used for urgent or sensitive issues > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] RJDBC vs RMySQL vs ???
Hi Ralf, Ralf B wrote: I am running a simple SQL SELECT statement that involvs 50k + data points using R and the RJDBC interface. I am facing very slow response times in both the RGUI and the R console. When running this SQL statement directly in a SQL client I have processing times that are a lot lot faster (which means that the SQL statement itself is not the problem). Did any of you compare RJDBC vs RMySQL or is there a better, more efficient way to extract large data from databases using R? Would you recommend dumping data out completely into flat files and working with flat files instead? I expected that this would not be such a problem given that businesses maintain their data in DBs and R is supposed to be good in shifting around data. Am I doing something wrong? Well, if you don't show people what you have done, how can anybody tell if you are doing something wrong or not? I have no experience with RJDBC, so cannot say anything about that. However, I have always found RMySQL to be speedy enough. As an example: > library(RMySQL) Loading required package: DBI > con <- dbConnect("MySQL", host="genome-mysql.cse.ucsc.edu", user = "genome", dbname = "hg18") > system.time(a <- dbGetQuery(con, "select name, chromEnd from snp129 where chrom='chr1' and chromStart between 1 and 1e8;") + ) user system elapsed 7.950.06 38.59 > dim(a) [1] 508676 2 So 40 seconds to get half a million records. Since this is via the internet, I have to imagine things would be much faster querying a local DB. But then you never say what constitutes 'slow' for you, so maybe this is slow as well? Best, Jim Ralf __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- James W. MacDonald, M.S. Biostatistician Douglas Lab University of Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine St. Ann Arbor MI 48109-5618 734-615-7826 ** Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.