Re: [R] How to Read a Large CSV into a Database with R
Thanks On 11/16/2010 12:41 PM, Seth Falcon wrote: > Hi Abhijit, > > [I've cc'd R-help to keep the discussion on the list] > > On Tue, Nov 16, 2010 at 8:06 AM, Abhijit Dasgupta > wrote: > >> Seth, >> >> I was looking for something like this too. I've a question. If >> you're reading the data from a connection, does R start reading the >> next chunk of data right after the previous chunk, or do we need to >> keep track of things using "skip" >> > The purpose of using a file connection is to allow R to keep its place > in the file as it reads and not have to re-read or skip. This is > considerably more efficient. > > > > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to Read a Large CSV into a Database with R
Hi Abhijit, [I've cc'd R-help to keep the discussion on the list] On Tue, Nov 16, 2010 at 8:06 AM, Abhijit Dasgupta wrote: > Seth, > > I was looking for something like this too. I've a question. If > you're reading the data from a connection, does R start reading the > next chunk of data right after the previous chunk, or do we need to > keep track of things using "skip" The purpose of using a file connection is to allow R to keep its place in the file as it reads and not have to re-read or skip. This is considerably more efficient. -- Seth Falcon | @sfalcon | http://userprimary.net/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to Read a Large CSV into a Database with R
Hi Anthony, On Mon, Nov 15, 2010 at 7:07 AM, Anthony Damico wrote: > Hi, I'm working in R 2.11.1 x64 on Windows x86_64-pc-mingw32. I'm trying to > insert a very large CSV file into a SQLite database. I'm pretty new to > working with databases in R, so I apologize if I'm overlooking something > obvious here. Working with large data and doing operations in bounded memory tends not to be among the most obvious things to accomplish in R for new comers. > I need to combine both of these CSV files into one table (within a > database), so I think that I'd need a SQL manipulation technique that reads > everything line by line, instead of pulling it all into memory. Yes, that sounds like the right approach. Below is some sample code that does this and avoids reading the entire data set into memory. The key points are to use read.table on a file connection and to read in a batch of lines at a time. Each batch is inserted into the db using a prepared query. The example is complicated a bit because read.table on a file connection raises an error if no lines are available, so I chose to use tryCatch to handle that. A cleaner approach might be to check the number of rows read and break out of the while loop if the count is less than the batch size. Anyhow, see if this approach works for you. library("RSQLite") file_list <- c("ss09pusa.csv", "ss09pusb.csv") input <- file(file_list[1], "r") db <- dbConnect(SQLite(), dbname="example.sqlite") header <- readLines(input, n = 1) fields <- strsplit(header, ",")[[1]] colTypes <- rep("TEXT", length(fields)) colDecl <- paste(fields, colTypes) sql <- sprintf("CREATE TABLE ss09 (%s)", paste(colDecl, collapse = ", ")) dbGetQuery(db, sql) colClasses <- rep("character", length(fields)) sql.in <- sprintf("INSERT INTO ss09 VALUES (%s)", paste(rep("?", length(fields)), collapse = ",")) chunk_size <- 25 dbBeginTransaction(db) tryCatch({ while (TRUE) { part <- read.table(input, nrows=chunk_size, sep=",", colClasses = colClasses, comment.char = "") dbGetPreparedQuery(db, sql.in, bind.data = part) } }, error = function(e) { if (grepl("no lines available", conditionMessage(e))) TRUE else stop(conditionMessage(e)) }) dbCommit(db) -- Seth Falcon | @sfalcon | http://userprimary.net/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to Read a Large CSV into a Database with R
On Mon, Nov 15, 2010 at 1:28 PM, Anthony Damico wrote: > Hi Gabor, > > Thank you for your willingness to help me through this. The code you sent > works on my machine exactly the same way as it does on yours. > Unfortunately, when I run the same code on the 1.3GB file, it creates the > table structure but doesn't read in a single line [confirmed with > sqldf("select * from mytab",dbname="mydb")] Though I don't expect anyone to > download it, the file I'm using is ss09pusa.csv from > http://www2.census.gov/acs2009_1yr/pums/csv_pus.zip. I tested both sets of > code on my work desktop and personal laptop, so it's not machine-specific > (although it might be Windows- or 64 bit-specific). > > Do you have any other ideas as to how I might diagnose what's going on > here? Or, alternatively, is there some workaround that would get this giant > CSV into a database? If you think there's a reasonable way to use the > IMPORT command with RSQLite, that seems like it would import the fastest, > but I don't know that it's compatible with DBI on Windows. > > Thanks again! > Anthony > > I tried it and the sqlite driver issued an error. Seems like a driver or sqlite issue. Here are some other approaches: 1. Create an sqlite sql script using this R code and then the shell command runs it using sqlite3 command line processor. fn <- "ss09pusa.csv" # read first line DF <- read.csv(fn, nrow = 1) # create statement s <- sprintf(" create table %s (%s); .separator , .import %s %s ", "mytab", toString(names(DF)), fn, "mytab") # output sql file fn.out <- paste(fn, "sql", sep = ".") cat(s, file = fn.out) # run import using generated sql file # Use sqlite3 command line processor. shell(paste("sqlite3 mydb <", fn.out)) 2. Try sqldf with the h2 database rather than sqlite. See H2 section on sqldf home page. library(sqldf) library(RH2) sqldf("create table mytab as select * from CSVREAD('ss09pusa.csv')", dbname = "mydb2") -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to Read a Large CSV into a Database with R
On Mon, 15 Nov 2010 13:28:40 -0500, Anthony Damico wrote: > Do you have any other ideas as to how I might diagnose what's going on > here? Or, alternatively, is there some workaround that would get this giant > CSV into a database? If you think there's a reasonable way to use the > IMPORT command with RSQLite, that seems like it would import the fastest, > but I don't know that it's compatible with DBI on Windows. > > Thanks again! > Anthony If you are able to successfully read in one entire 1.2 GB file at a time, I would skip sqldf and do this (after setting the working directory appropriately): connSQLite = dbConnect(dbDriver("SQLite"), dbname = "acs") ss09pusa = read.csv("ss09pusa.csv", header = TRUE) dbWriteTable(connSQLite, "acs2009", ss09pusa) rm(ss09pusa) ss09pusb = read.csv("ss09pusb.csv", header = TRUE) dbWriteTable(connSQLite, "acs2009", ss09pusb, append = TRUE) rm(ss09pusb) #Test select of specific columns sql = "SELECT ST, PUMA, ADJINC FROM acs2009 WHERE ST = 33" dfIncome = dbGetQuery(connSQLite, sql) I was *not* able to load one entire table at a time, so I was able to make it work by combining sqldf to read in chunks at a time and dbWriteTable from RSQLite to write each chunk to the database. The read would then look like, for example: ss09pusa = read.csv.sql("ss09pusa.csv", sql = paste("SELECT * FROM file WHERE ST =", i)) where i is an iteration over the state FIPS codes. (You could just use 1 to 56, even though there's a few missing numbers in there. The searches for nonexistent records will take time but will otherwise be harmless.) The dbWriteTable would be the same, with every write after the first one using append = TRUE. Also, I assume you will want to do the read from csv / write to SQLite *once*, then maintain the SQLite database for other sessions. We also use large census and IPUMS data sets, and we keep everything in a Postgres backend, which we access with RPostgreSQL. Much easier to keep everything organized in an RDBMS than to pass around these monstrous csv files. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to Read a Large CSV into a Database with R
> From: ajdam...@gmail.com > Date: Mon, 15 Nov 2010 13:28:40 -0500 > To: ggrothendi...@gmail.com; r-help@r-project.org > Subject: Re: [R] How to Read a Large CSV into a Database with R > > Hi Gabor, > > Thank you for your willingness to help me through this. The code you sent > works on my machine exactly the same way as it does on yours. > Unfortunately, when I run the same code on the 1.3GB file, it creates the > table structure but doesn't read in a single line [confirmed with > sqldf("select * from mytab",dbname="mydb")] Though I don't expect anyone to > download it, the file I'm using is ss09pusa.csv from > http://www2.census.gov/acs2009_1yr/pums/csv_pus.zip. I tested both sets of > code on my work desktop and personal laptop, so it's not machine-specific > > Do you have any other ideas as to how I might diagnose what's going on > here? Or, alternatively, is there some workaround that would get this giant > CSV into a database? If you think there's a reasonable way to use the > IMPORT command with RSQLite, that seems like it would import the fastest, I think someone else suggested external aproaches and indeed I hae loaded census tiger filers into db for making maps and mobile apps etc. I wold mention again that this may eliminate a memory limit and let you limp along but presumably you want a streaming source or something if your analysis has preidctable access patterns and this data will not be used as part of a hotel reservation system. Structured input data, I think TIGER was line oriented, should be easy to load into a db with bash script or java app. > > Thanks again! > Anthony > > > > read.csv.sql("R:\\American Community Survey\\Data\\2009\\ss09pusa.csv", > sql = "create table mytab as select * from file", dbname = "mydb") > NULL > Warning message: > closing unused connection 3 (R:\American Community > Survey\Data\2009\ss09pusa.csv) > > > > # how many records are in table? > > N <- sqldf("select count(*) from mytab", dbname = "mydb")[[1]] > > > > # read in chunks and display what we have read > > > > k <- 4 # no of records to read at once > > for(i in seq(0, N-1, k)) { > + s <- sprintf("select * from mytab limit %d, %d", i, k) > + print(sqldf(s, dbname = "mydb")) > + } > Error in seq.default(0, N - 1, k) : wrong sign in 'by' argument > > N > [1] 0 > > > > > On Mon, Nov 15, 2010 at 12:24 PM, Gabor Grothendieck < > ggrothendi...@gmail.com> wrote: > > > On Mon, Nov 15, 2010 at 11:46 AM, Anthony Damico > > wrote: > > > Hi Gabor, > > > > > > Thank you for the prompt reply. I definitely looked over all of the > > > examples on the code.google.com sqldf page before sending, which is why > > I > > > wrote the code > > > > > > read.csv.sql("ss09pusa.csv" , sql="create table ss09pusa as select * from > > > file" , dbname="sqlite") > > > > > > directly pulled from their code -- read.csv.sql("~/tmp.csv", sql = > > "create > > > table mytab as select * from file",dbname = "mydb") > > > > > > ..but I don't understand why this helps me around the memory problem, > > since > > > I think it still all gets read into memory. Is there a way to do this > > line > > > by line? > > > > > > OK. Maybe its something else. > > > > The reading in of the file into the database should not be a resource > > problem provided you have enough disk space and appropriate > > permissions. sqldf / RSQLite are used to get sqlite to do it so that > > the data never goes through R at that stage so R limitations can't > > affect the reading in to the sqlite database. When you read it from > > the sqlite database then R limitations come into effect so you just > > have to be sure not to read too much in at a time. The use of create > > table ... as select ... is to prevent sqldf from deleting the table > > since sqldf is normally used in a fashion where you don't want to know > > about the back end databases so it tries to create them and delete > > them behind the scenes but here you want to explicitly use them so > > you have to work around that. > > > > Try this example. It should be reproducible so you just have to copy > > it and paste it into your R session. Uncomment the indicated line if > > you want to be able to remove any pre-existing mydb file in the > > current directory. Try
Re: [R] How to Read a Large CSV into a Database with R
Hi Gabor, Thank you for your willingness to help me through this. The code you sent works on my machine exactly the same way as it does on yours. Unfortunately, when I run the same code on the 1.3GB file, it creates the table structure but doesn't read in a single line [confirmed with sqldf("select * from mytab",dbname="mydb")] Though I don't expect anyone to download it, the file I'm using is ss09pusa.csv from http://www2.census.gov/acs2009_1yr/pums/csv_pus.zip. I tested both sets of code on my work desktop and personal laptop, so it's not machine-specific (although it might be Windows- or 64 bit-specific). Do you have any other ideas as to how I might diagnose what's going on here? Or, alternatively, is there some workaround that would get this giant CSV into a database? If you think there's a reasonable way to use the IMPORT command with RSQLite, that seems like it would import the fastest, but I don't know that it's compatible with DBI on Windows. Thanks again! Anthony > read.csv.sql("R:\\American Community Survey\\Data\\2009\\ss09pusa.csv", sql = "create table mytab as select * from file", dbname = "mydb") NULL Warning message: closing unused connection 3 (R:\American Community Survey\Data\2009\ss09pusa.csv) > > # how many records are in table? > N <- sqldf("select count(*) from mytab", dbname = "mydb")[[1]] > > # read in chunks and display what we have read > > k <- 4 # no of records to read at once > for(i in seq(0, N-1, k)) { +s <- sprintf("select * from mytab limit %d, %d", i, k) +print(sqldf(s, dbname = "mydb")) + } Error in seq.default(0, N - 1, k) : wrong sign in 'by' argument > N [1] 0 On Mon, Nov 15, 2010 at 12:24 PM, Gabor Grothendieck < ggrothendi...@gmail.com> wrote: > On Mon, Nov 15, 2010 at 11:46 AM, Anthony Damico > wrote: > > Hi Gabor, > > > > Thank you for the prompt reply. I definitely looked over all of the > > examples on the code.google.com sqldf page before sending, which is why > I > > wrote the code > > > > read.csv.sql("ss09pusa.csv" , sql="create table ss09pusa as select * from > > file" , dbname="sqlite") > > > > directly pulled from their code -- read.csv.sql("~/tmp.csv", sql = > "create > > table mytab as select * from file",dbname = "mydb") > > > > ..but I don't understand why this helps me around the memory problem, > since > > I think it still all gets read into memory. Is there a way to do this > line > > by line? > > > OK. Maybe its something else. > > The reading in of the file into the database should not be a resource > problem provided you have enough disk space and appropriate > permissions. sqldf / RSQLite are used to get sqlite to do it so that > the data never goes through R at that stage so R limitations can't > affect the reading in to the sqlite database. When you read it from > the sqlite database then R limitations come into effect so you just > have to be sure not to read too much in at a time. The use of create > table ... as select ... is to prevent sqldf from deleting the table > since sqldf is normally used in a fashion where you don't want to know > about the back end databases so it tries to create them and delete > them behind the scenes but here you want to explicitly use them so > you have to work around that. > > Try this example. It should be reproducible so you just have to copy > it and paste it into your R session. Uncomment the indicated line if > you want to be able to remove any pre-existing mydb file in the > current directory. Try it in a fresh R session just to be sure that > nothing mucks it up. > > library(sqldf) > > # uncomment next line to make sure we are starting clean > # if (file.exists("mydb")) file.remove("mydb") > > # create new database > sqldf("attach 'mydb' as new") > > # create a new file. BOD is built into R and has 6 rows. > write.table(BOD, file = "tmp.csv", quote = FALSE, sep = ",") > > # read new file into database > read.csv.sql("tmp.csv", sql = "create table mytab as select * from file", >dbname = "mydb") > > # how many records are in table? > N <- sqldf("select count(*) from mytab", dbname = "mydb")[[1]] > > # read in chunks and display what we have read > > k <- 4 # no of records to read at once > for(i in seq(0, N-1, k)) { >s <- sprintf("select * from mytab limit %d, %d", i, k) >print(sqldf(s, dbname = "mydb")) > } > > On my machine I get this output: > > Time demand > 118.3 > 22 10.3 > 33 19.0 > 44 16.0 > Time demand > 15 15.6 > 27 19.8 > > showing that it read the 6 line BOD data frame in chunks of 4 as required. > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide comm
Re: [R] How to Read a Large CSV into a Database with R
On Mon, Nov 15, 2010 at 11:46 AM, Anthony Damico wrote: > Hi Gabor, > > Thank you for the prompt reply. I definitely looked over all of the > examples on the code.google.com sqldf page before sending, which is why I > wrote the code > > read.csv.sql("ss09pusa.csv" , sql="create table ss09pusa as select * from > file" , dbname="sqlite") > > directly pulled from their code -- read.csv.sql("~/tmp.csv", sql = "create > table mytab as select * from file",dbname = "mydb") > > ..but I don't understand why this helps me around the memory problem, since > I think it still all gets read into memory. Is there a way to do this line > by line? OK. Maybe its something else. The reading in of the file into the database should not be a resource problem provided you have enough disk space and appropriate permissions. sqldf / RSQLite are used to get sqlite to do it so that the data never goes through R at that stage so R limitations can't affect the reading in to the sqlite database. When you read it from the sqlite database then R limitations come into effect so you just have to be sure not to read too much in at a time. The use of create table ... as select ... is to prevent sqldf from deleting the table since sqldf is normally used in a fashion where you don't want to know about the back end databases so it tries to create them and delete them behind the scenes but here you want to explicitly use them so you have to work around that. Try this example. It should be reproducible so you just have to copy it and paste it into your R session. Uncomment the indicated line if you want to be able to remove any pre-existing mydb file in the current directory. Try it in a fresh R session just to be sure that nothing mucks it up. library(sqldf) # uncomment next line to make sure we are starting clean # if (file.exists("mydb")) file.remove("mydb") # create new database sqldf("attach 'mydb' as new") # create a new file. BOD is built into R and has 6 rows. write.table(BOD, file = "tmp.csv", quote = FALSE, sep = ",") # read new file into database read.csv.sql("tmp.csv", sql = "create table mytab as select * from file", dbname = "mydb") # how many records are in table? N <- sqldf("select count(*) from mytab", dbname = "mydb")[[1]] # read in chunks and display what we have read k <- 4 # no of records to read at once for(i in seq(0, N-1, k)) { s <- sprintf("select * from mytab limit %d, %d", i, k) print(sqldf(s, dbname = "mydb")) } On my machine I get this output: Time demand 118.3 22 10.3 33 19.0 44 16.0 Time demand 15 15.6 27 19.8 showing that it read the 6 line BOD data frame in chunks of 4 as required. -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to Read a Large CSV into a Database with R
Hi Gabor, Thank you for the prompt reply. I definitely looked over all of the examples on the code.google.com sqldf page before sending, which is why I wrote the code read.csv.sql("ss09pusa.csv" , sql="create table ss09pusa as select * from file" , dbname="sqlite") directly pulled from their code -- read.csv.sql("~/tmp.csv", sql = "create table mytab as select * from file",dbname = "mydb") ..but I don't understand why this helps me around the memory problem, since I think it still all gets read into memory. Is there a way to do this line by line? I would prefer to use SQLite than sqldf, but I could not get the IMPORT command (or .IMPORT) functioning at all. I tried these with both dbGetQuery and dbSendQuery. library(RSQLite) setwd("R:\\American Community Survey\\Data\\2009") out_db <- dbConnect(SQLite(), dbname="sqlite.db") dbGetQuery(out_db , "create table test (hello integer, world text)") dbGetQuery(out_db , "mode csv") dbGetQuery(out_db , "import test.csv test") When I hit the mode and import commands, it gives me an error that makes me think it's handling these files in a completely different way. > dbGetQuery(out_db , "mode csv") Error in sqliteExecStatement(con, statement, bind.data) : RS-DBI driver: (error in statement: near "mode": syntax error) I suppose I could just run sqlite3 commands from the system() function, but I was hoping there might be a way to accomplish this task entirely within R? Thanks again! On Mon, Nov 15, 2010 at 10:41 AM, Gabor Grothendieck < ggrothendi...@gmail.com> wrote: > On Mon, Nov 15, 2010 at 10:07 AM, Anthony Damico > wrote: > > Hi, I'm working in R 2.11.1 x64 on Windows x86_64-pc-mingw32. I'm trying > to > > insert a very large CSV file into a SQLite database. I'm pretty new to > > working with databases in R, so I apologize if I'm overlooking something > > obvious here. > > > > I'm trying to work with the American Community Survey data, which is two > > 1.3GB csv files. I have enough RAM to read one of them into memory, but > not > > both at the same time. So, in order to analyze them, I'm trying to get > them > > into a SQLite database so I can use the R survey package's > database-backed > > survey objects capabilities ( > > http://faculty.washington.edu/tlumley/survey/svy-dbi.html). > > > > I need to combine both of these CSV files into one table (within a > > database), so I think that I'd need a SQL manipulation technique that > reads > > everything line by line, instead of pulling it all into memory. > > > > I've tried using read.csv.sql, but it finishes without an error and then > > only shows me the table structure when I run the final select statement. > > When I run these exact same commands on a smaller CSV file, they work > fine. > > I imagine this is not working because the csv is so large, but I'm not > sure > > how to confirm that or what to change if it is. I do want to get all > > columns from the CSV into the data table, so I don't want to filter > > anything. > > > > library(sqldf) > > setwd("R:\\American Community Survey\\Data\\2009") > > sqldf("attach 'sqlite' as new") > > read.csv.sql("ss09pusa.csv" , sql="create table ss09pusa as select * from > > file" , dbname="sqlite") > > sqldf("select * from ss09pusa limit 3",dbname="sqlite") > > > > What the above code does, which is unlikely to be what you intended, > is to create an sqlite database called 'sqlite' and then read in the > indicated file into sqlite, read it in into R from sqlite (clearly > this step will fail if the data is too big for R but it its not then > you are ok) and then delete the table from the database so your sqldf > statement should give an error since there is no such table or else if > you have a data frame in your R workspace called ss09pusa the sqldf > statement will load that into a database table and the retrieve its > first three rows and then delete the table. > > This sort of task is probably more suitable for RSQLite than sqldf > but if you wish to do it with sqldf you need to follow example 9 or > example 10 on the sqldf home page: > > In example 9, > > http://code.google.com/p/sqldf/#Example_9.__Working_with_Databases > > its very important to note that sqldf automatically deletes any table > that it created after the sqldf or read.csv.sql statement is done so > to not have the table dropped is to make sure you issue an sql > statement that creates the table, "create table mytab as select ..." > rather than sqldf. > > In example 10, > > http://code.google.com/p/sqldf/#Example_10._Persistent_Connections > > persistent connections are illustrated which represents an alternate > way to do this in sqldf. > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting gui
Re: [R] How to Read a Large CSV into a Database with R
On Mon, Nov 15, 2010 at 10:07 AM, Anthony Damico wrote: > Hi, I'm working in R 2.11.1 x64 on Windows x86_64-pc-mingw32. I'm trying to > insert a very large CSV file into a SQLite database. I'm pretty new to > working with databases in R, so I apologize if I'm overlooking something > obvious here. > > I'm trying to work with the American Community Survey data, which is two > 1.3GB csv files. I have enough RAM to read one of them into memory, but not > both at the same time. So, in order to analyze them, I'm trying to get them > into a SQLite database so I can use the R survey package's database-backed > survey objects capabilities ( > http://faculty.washington.edu/tlumley/survey/svy-dbi.html). > > I need to combine both of these CSV files into one table (within a > database), so I think that I'd need a SQL manipulation technique that reads > everything line by line, instead of pulling it all into memory. > > I've tried using read.csv.sql, but it finishes without an error and then > only shows me the table structure when I run the final select statement. > When I run these exact same commands on a smaller CSV file, they work fine. > I imagine this is not working because the csv is so large, but I'm not sure > how to confirm that or what to change if it is. I do want to get all > columns from the CSV into the data table, so I don't want to filter > anything. > > library(sqldf) > setwd("R:\\American Community Survey\\Data\\2009") > sqldf("attach 'sqlite' as new") > read.csv.sql("ss09pusa.csv" , sql="create table ss09pusa as select * from > file" , dbname="sqlite") > sqldf("select * from ss09pusa limit 3",dbname="sqlite") > What the above code does, which is unlikely to be what you intended, is to create an sqlite database called 'sqlite' and then read in the indicated file into sqlite, read it in into R from sqlite (clearly this step will fail if the data is too big for R but it its not then you are ok) and then delete the table from the database so your sqldf statement should give an error since there is no such table or else if you have a data frame in your R workspace called ss09pusa the sqldf statement will load that into a database table and the retrieve its first three rows and then delete the table. This sort of task is probably more suitable for RSQLite than sqldf but if you wish to do it with sqldf you need to follow example 9 or example 10 on the sqldf home page: In example 9, http://code.google.com/p/sqldf/#Example_9.__Working_with_Databases its very important to note that sqldf automatically deletes any table that it created after the sqldf or read.csv.sql statement is done so to not have the table dropped is to make sure you issue an sql statement that creates the table, "create table mytab as select ..." rather than sqldf. In example 10, http://code.google.com/p/sqldf/#Example_10._Persistent_Connections persistent connections are illustrated which represents an alternate way to do this in sqldf. -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to Read a Large CSV into a Database with R
Anthony-107 wrote: > > Hi, I'm working in R 2.11.1 x64 on Windows x86_64-pc-mingw32. I'm trying > to > insert a very large CSV file into a SQLite database. > Better use an external utility if this is a one-time import for this job: http://sqlitebrowser.sourceforge.net/ Dieter -- View this message in context: http://r.789695.n4.nabble.com/How-to-Read-a-Large-CSV-into-a-Database-with-R-tp3043209p3043226.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] How to Read a Large CSV into a Database with R
Hi, I'm working in R 2.11.1 x64 on Windows x86_64-pc-mingw32. I'm trying to insert a very large CSV file into a SQLite database. I'm pretty new to working with databases in R, so I apologize if I'm overlooking something obvious here. I'm trying to work with the American Community Survey data, which is two 1.3GB csv files. I have enough RAM to read one of them into memory, but not both at the same time. So, in order to analyze them, I'm trying to get them into a SQLite database so I can use the R survey package's database-backed survey objects capabilities ( http://faculty.washington.edu/tlumley/survey/svy-dbi.html). I need to combine both of these CSV files into one table (within a database), so I think that I'd need a SQL manipulation technique that reads everything line by line, instead of pulling it all into memory. I've tried using read.csv.sql, but it finishes without an error and then only shows me the table structure when I run the final select statement. When I run these exact same commands on a smaller CSV file, they work fine. I imagine this is not working because the csv is so large, but I'm not sure how to confirm that or what to change if it is. I do want to get all columns from the CSV into the data table, so I don't want to filter anything. library(sqldf) setwd("R:\\American Community Survey\\Data\\2009") sqldf("attach 'sqlite' as new") read.csv.sql("ss09pusa.csv" , sql="create table ss09pusa as select * from file" , dbname="sqlite") sqldf("select * from ss09pusa limit 3",dbname="sqlite") I've also tried using the SQL IMPORT command, which I couldn't get working properly, even on a tiny two-field, five-row CSV file. library(RSQLite) setwd("R:\\American Community Survey\\Data\\2009") in_csv <- file("test.csv") out_db <- dbConnect(SQLite(), dbname="sqlite.db") dbGetQuery(out_db , "create table test (hello integer, world text)") dbGetQuery(out_db , "import in_csv test") Any advice would be sincerely appreciated. Thanks! Anthony Damico Kaiser Family Foundation [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.