Hi Gabor, Thank you for your willingness to help me through this. The code you sent works on my machine exactly the same way as it does on yours. Unfortunately, when I run the same code on the 1.3GB file, it creates the table structure but doesn't read in a single line [confirmed with sqldf("select * from mytab",dbname="mydb")] Though I don't expect anyone to download it, the file I'm using is ss09pusa.csv from http://www2.census.gov/acs2009_1yr/pums/csv_pus.zip. I tested both sets of code on my work desktop and personal laptop, so it's not machine-specific (although it might be Windows- or 64 bit-specific).
Do you have any other ideas as to how I might diagnose what's going on here? Or, alternatively, is there some workaround that would get this giant CSV into a database? If you think there's a reasonable way to use the IMPORT command with RSQLite, that seems like it would import the fastest, but I don't know that it's compatible with DBI on Windows. Thanks again! Anthony > read.csv.sql("R:\\American Community Survey\\Data\\2009\\ss09pusa.csv", sql = "create table mytab as select * from file", dbname = "mydb") NULL Warning message: closing unused connection 3 (R:\American Community Survey\Data\2009\ss09pusa.csv) > > # how many records are in table? > N <- sqldf("select count(*) from mytab", dbname = "mydb")[[1]] > > # read in chunks and display what we have read > > k <- 4 # no of records to read at once > for(i in seq(0, N-1, k)) { + s <- sprintf("select * from mytab limit %d, %d", i, k) + print(sqldf(s, dbname = "mydb")) + } Error in seq.default(0, N - 1, k) : wrong sign in 'by' argument > N [1] 0 On Mon, Nov 15, 2010 at 12:24 PM, Gabor Grothendieck < ggrothendi...@gmail.com> wrote: > On Mon, Nov 15, 2010 at 11:46 AM, Anthony Damico <ajdam...@gmail.com> > wrote: > > Hi Gabor, > > > > Thank you for the prompt reply. I definitely looked over all of the > > examples on the code.google.com sqldf page before sending, which is why > I > > wrote the code > > > > read.csv.sql("ss09pusa.csv" , sql="create table ss09pusa as select * from > > file" , dbname="sqlite") > > > > directly pulled from their code -- read.csv.sql("~/tmp.csv", sql = > "create > > table mytab as select * from file",dbname = "mydb") > > > > ..but I don't understand why this helps me around the memory problem, > since > > I think it still all gets read into memory. Is there a way to do this > line > > by line? > > > OK. Maybe its something else. > > The reading in of the file into the database should not be a resource > problem provided you have enough disk space and appropriate > permissions. sqldf / RSQLite are used to get sqlite to do it so that > the data never goes through R at that stage so R limitations can't > affect the reading in to the sqlite database. When you read it from > the sqlite database then R limitations come into effect so you just > have to be sure not to read too much in at a time. The use of create > table ... as select ... is to prevent sqldf from deleting the table > since sqldf is normally used in a fashion where you don't want to know > about the back end databases so it tries to create them and delete > them behind the scenes but here you want to explicitly use them so > you have to work around that. > > Try this example. It should be reproducible so you just have to copy > it and paste it into your R session. Uncomment the indicated line if > you want to be able to remove any pre-existing mydb file in the > current directory. Try it in a fresh R session just to be sure that > nothing mucks it up. > > library(sqldf) > > # uncomment next line to make sure we are starting clean > # if (file.exists("mydb")) file.remove("mydb") > > # create new database > sqldf("attach 'mydb' as new") > > # create a new file. BOD is built into R and has 6 rows. > write.table(BOD, file = "tmp.csv", quote = FALSE, sep = ",") > > # read new file into database > read.csv.sql("tmp.csv", sql = "create table mytab as select * from file", > dbname = "mydb") > > # how many records are in table? > N <- sqldf("select count(*) from mytab", dbname = "mydb")[[1]] > > # read in chunks and display what we have read > > k <- 4 # no of records to read at once > for(i in seq(0, N-1, k)) { > s <- sprintf("select * from mytab limit %d, %d", i, k) > print(sqldf(s, dbname = "mydb")) > } > > On my machine I get this output: > > Time demand > 1 1 8.3 > 2 2 10.3 > 3 3 19.0 > 4 4 16.0 > Time demand > 1 5 15.6 > 2 7 19.8 > > showing that it read the 6 line BOD data frame in chunks of 4 as required. > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.