On Thu, Jul 14, 2011 at 10:33 AM, Mandans <mandan...@yahoo.com> wrote:
> SQldf with sqlite and H2
>
> I have a large csv file (about 2GB) and wanted to import the file into R and 
> do some filtering and analysis. Came across sqldf ( a great idea and product) 
> and was trying to play around to see what would be the best method of doing 
> this. csv file is comma delimited with some columns having comma inside the 
> quoation like this "John, Doe".
>
> I tried this first
>
> #######
> library(sqldf)
> sqldf("attach testdb as new")
> In.File <- "C:/JP/Temp/2008.csv"
> read.csv.sql(In.File, sql = "create table table1 as select * from file",
>  dbname = "testdb")
>
> It errored out with message
>
> NULL
> Warning message:
> closing unused connection 3 (C:/JP/Temp/2008.csv)
>
> When this failed, I converted this file from comma delimited to tab delimited 
> and used this command
>
> #########
> read.csv.sql(In.File, sql = "create table table1 as select * from file",
>  dbname = "testdb", sep = "\t")
>
> and this worked, it created testdb sqlite file with the size of 3GB
>
> now my question is in 3 parts.
>
> 1. Is it possible to create a dataframe with appropriate column classes and 
> use that column classes when I use the read.csv.sql command to create the 
> table. Something like may be create the table from that DF and then update 
> with read.csv.sql.?
>
> Any example code will be really helpful.

Here is an example of using method = "name__class".  Note there are
two underscores in a row.  It appears I neglected to document that
Date2 means convert from character representation whereas Date means
convert from numeric representation.  It would also be possible to use
method = "raw" and then coerce the columns yourself afterwards.

# create test file
Lines <- 'A__Date2|B
2000-01-01|x,y
2000-01-02|c,d
'
tf <- tempfile()
cat(Lines, file = tf)


library(sqldf)
DF <- read.csv.sql(tf, sep = "|", method = "name__class")
str(DF)

>
> 2. If we use the H2 database instead of default sqlite and use the readcsv 
> option, will that be faster and is there a way we can specify the above 
> thought of applying a DF class to table column properties and update with 
> CSVREAD
>
> library(RH2)
> something like SELECT * FROM CSVREAD('C:/JP/Temp/2008.csv')
>
> Any example code will be really helpful.

Sorry, I haven't tested the speed of this.  postgresql and mysql, both
supported by sqldf, also have builtin methods to read files. If I had
to guess I would guess that mysql would be fastest but this would have
to be tested.

>
> 3. How do we specify where the H2 file is saved. Saw something like this, 
> when I ran this example from RH2 package, couldn't find the file in the 
> working directory.
>
> con <- dbConnect(H2(), "jdbc:h2:~/test", "sa", "")

~ means your home directory so ~/test means test is in the home directory.

Try

normalizePath("~")
normalizePath("~/test")
etc.

to see what they refer to.

Regards.

>
> Sorry for the long mail. Appreciate all for building a great community and 
> for the wonderful software in R.
> Thanks for Gabor Grothendieck for bring sqldf to this great community.
>
> Any help or direction you can provide in this is highly appreciated.
>
> Thanks all.
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to