On Thu, Jul 14, 2011 at 10:33 AM, Mandans <mandan...@yahoo.com> wrote: > SQldf with sqlite and H2 > > I have a large csv file (about 2GB) and wanted to import the file into R and > do some filtering and analysis. Came across sqldf ( a great idea and product) > and was trying to play around to see what would be the best method of doing > this. csv file is comma delimited with some columns having comma inside the > quoation like this "John, Doe". > > I tried this first > > ####### > library(sqldf) > sqldf("attach testdb as new") > In.File <- "C:/JP/Temp/2008.csv" > read.csv.sql(In.File, sql = "create table table1 as select * from file", > dbname = "testdb") > > It errored out with message > > NULL > Warning message: > closing unused connection 3 (C:/JP/Temp/2008.csv) > > When this failed, I converted this file from comma delimited to tab delimited > and used this command > > ######### > read.csv.sql(In.File, sql = "create table table1 as select * from file", > dbname = "testdb", sep = "\t") > > and this worked, it created testdb sqlite file with the size of 3GB > > now my question is in 3 parts. > > 1. Is it possible to create a dataframe with appropriate column classes and > use that column classes when I use the read.csv.sql command to create the > table. Something like may be create the table from that DF and then update > with read.csv.sql.? > > Any example code will be really helpful.
Here is an example of using method = "name__class". Note there are two underscores in a row. It appears I neglected to document that Date2 means convert from character representation whereas Date means convert from numeric representation. It would also be possible to use method = "raw" and then coerce the columns yourself afterwards. # create test file Lines <- 'A__Date2|B 2000-01-01|x,y 2000-01-02|c,d ' tf <- tempfile() cat(Lines, file = tf) library(sqldf) DF <- read.csv.sql(tf, sep = "|", method = "name__class") str(DF) > > 2. If we use the H2 database instead of default sqlite and use the readcsv > option, will that be faster and is there a way we can specify the above > thought of applying a DF class to table column properties and update with > CSVREAD > > library(RH2) > something like SELECT * FROM CSVREAD('C:/JP/Temp/2008.csv') > > Any example code will be really helpful. Sorry, I haven't tested the speed of this. postgresql and mysql, both supported by sqldf, also have builtin methods to read files. If I had to guess I would guess that mysql would be fastest but this would have to be tested. > > 3. How do we specify where the H2 file is saved. Saw something like this, > when I ran this example from RH2 package, couldn't find the file in the > working directory. > > con <- dbConnect(H2(), "jdbc:h2:~/test", "sa", "") ~ means your home directory so ~/test means test is in the home directory. Try normalizePath("~") normalizePath("~/test") etc. to see what they refer to. Regards. > > Sorry for the long mail. Appreciate all for building a great community and > for the wonderful software in R. > Thanks for Gabor Grothendieck for bring sqldf to this great community. > > Any help or direction you can provide in this is highly appreciated. > > Thanks all. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.