Re: [R] How to Read a Large CSV into a Database with R

2010-11-16 Thread Abhijit Dasgupta
Thanks
On 11/16/2010 12:41 PM, Seth Falcon wrote:
> Hi Abhijit,
>
> [I've cc'd R-help to keep the discussion on the list]
>
> On Tue, Nov 16, 2010 at 8:06 AM, Abhijit Dasgupta
>   wrote:
>
>> Seth,
>>
>> I was looking for something like this too. I've a question. If
>> you're reading the data from a connection, does R start reading the
>> next chunk of data right after the previous chunk, or do we need to
>> keep track of things using "skip"
>>  
> The purpose of using a file connection is to allow R to keep its place
> in the file as it reads and not have to re-read or skip.  This is
> considerably more efficient.
>
>
>
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to Read a Large CSV into a Database with R

2010-11-16 Thread Seth Falcon
Hi Abhijit,

[I've cc'd R-help to keep the discussion on the list]

On Tue, Nov 16, 2010 at 8:06 AM, Abhijit Dasgupta
 wrote:
> Seth,
>
> I was looking for something like this too. I've a question. If
> you're reading the data from a connection, does R start reading the
> next chunk of data right after the previous chunk, or do we need to
> keep track of things using "skip"

The purpose of using a file connection is to allow R to keep its place
in the file as it reads and not have to re-read or skip.  This is
considerably more efficient.



-- 
Seth Falcon | @sfalcon | http://userprimary.net/

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to Read a Large CSV into a Database with R

2010-11-15 Thread Seth Falcon
Hi Anthony,

On Mon, Nov 15, 2010 at 7:07 AM, Anthony Damico  wrote:
> Hi, I'm working in R 2.11.1 x64 on Windows x86_64-pc-mingw32.  I'm trying to
> insert a very large CSV file into a SQLite database.  I'm pretty new to
> working with databases in R, so I apologize if I'm overlooking something
> obvious here.

Working with large data and doing operations in bounded memory tends
not to be among the most obvious things to accomplish in R for new
comers.

> I need to combine both of these CSV files into one table (within a
> database), so I think that I'd need a SQL manipulation technique that reads
> everything line by line, instead of pulling it all into memory.

Yes, that sounds like the right approach.  Below is some sample code
that does this and avoids reading the entire data set into memory.

The key points are to use read.table on a file connection and to read
in a batch of lines at a time.  Each batch is inserted into the db
using a prepared query.  The example is complicated a bit because
read.table on a file connection raises an error if no lines are
available, so I chose to use tryCatch to handle that.  A cleaner
approach might be to check the number of rows read and break out of
the while loop if the count is less than the batch size.

Anyhow, see if this approach works for you.


library("RSQLite")

file_list <- c("ss09pusa.csv", "ss09pusb.csv")
input <- file(file_list[1], "r")
db <- dbConnect(SQLite(), dbname="example.sqlite")

header <- readLines(input, n = 1)
fields <- strsplit(header, ",")[[1]]
colTypes <- rep("TEXT", length(fields))
colDecl <- paste(fields, colTypes)
sql <- sprintf("CREATE TABLE ss09 (%s)", paste(colDecl, collapse = ", "))
dbGetQuery(db, sql)

colClasses <- rep("character", length(fields))
sql.in <- sprintf("INSERT INTO ss09 VALUES (%s)",
  paste(rep("?", length(fields)), collapse = ","))
chunk_size <- 25
dbBeginTransaction(db)
tryCatch({
while (TRUE) {
part <- read.table(input, nrows=chunk_size, sep=",",
   colClasses = colClasses,
   comment.char = "")
dbGetPreparedQuery(db, sql.in, bind.data = part)
}
}, error = function(e) {
if (grepl("no lines available", conditionMessage(e)))
TRUE
else
stop(conditionMessage(e))
})
dbCommit(db)



-- 
Seth Falcon | @sfalcon | http://userprimary.net/

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to Read a Large CSV into a Database with R

2010-11-15 Thread Gabor Grothendieck
On Mon, Nov 15, 2010 at 1:28 PM, Anthony Damico  wrote:
> Hi Gabor,
>
> Thank you for your willingness to help me through this.  The code you sent
> works on my machine exactly the same way as it does on yours.
> Unfortunately, when I run the same code on the 1.3GB file, it creates the
> table structure but doesn't read in a single line [confirmed with
> sqldf("select * from mytab",dbname="mydb")]  Though I don't expect anyone to
> download it, the file I'm using is ss09pusa.csv from
> http://www2.census.gov/acs2009_1yr/pums/csv_pus.zip.  I tested both sets of
> code on my work desktop and personal laptop, so it's not machine-specific
> (although it might be Windows- or 64 bit-specific).
>
> Do you have any other ideas as to how I might diagnose what's going on
> here?  Or, alternatively, is there some workaround that would get this giant
> CSV into a database?  If you think there's a reasonable way to use the
> IMPORT command with RSQLite, that seems like it would import the fastest,
> but I don't know that it's compatible with DBI on Windows.
>
> Thanks again!
> Anthony
>
>

I tried it and the sqlite driver issued an error.   Seems like a
driver or sqlite issue.  Here are some other approaches:

1. Create an sqlite sql script using this R code and then the shell
command runs it using sqlite3 command line processor.

fn <- "ss09pusa.csv"

# read first line
DF <- read.csv(fn, nrow = 1)

# create statement
s <- sprintf("
create table %s (%s);
.separator ,
.import %s %s
", "mytab", toString(names(DF)), fn, "mytab")

# output sql file
fn.out <- paste(fn, "sql", sep = ".")
cat(s, file = fn.out)

# run import using generated sql file
# Use sqlite3 command line processor.
shell(paste("sqlite3 mydb <", fn.out))

2. Try sqldf with the h2 database rather than sqlite.  See H2 section
on sqldf home page.

library(sqldf)
library(RH2)

sqldf("create table mytab as select * from CSVREAD('ss09pusa.csv')",
dbname = "mydb2")


-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to Read a Large CSV into a Database with R

2010-11-15 Thread Lee Hachadoorian
On Mon, 15 Nov 2010 13:28:40 -0500, Anthony Damico wrote:

> Do you have any other ideas as to how I might diagnose what's going on
> here?  Or, alternatively, is there some workaround that would get this giant
> CSV into a database?  If you think there's a reasonable way to use the
> IMPORT command with RSQLite, that seems like it would import the fastest,
> but I don't know that it's compatible with DBI on Windows.
>
> Thanks again!
> Anthony

If you are able to successfully read in one entire 1.2 GB file at a
time, I would skip
sqldf and do this (after setting the working directory appropriately):

connSQLite = dbConnect(dbDriver("SQLite"),
  dbname = "acs")

ss09pusa = read.csv("ss09pusa.csv", header = TRUE)
dbWriteTable(connSQLite, "acs2009", ss09pusa)
rm(ss09pusa)

ss09pusb = read.csv("ss09pusb.csv", header = TRUE)
dbWriteTable(connSQLite, "acs2009", ss09pusb, append = TRUE)
rm(ss09pusb)

#Test select of specific columns
sql = "SELECT ST, PUMA, ADJINC FROM acs2009 WHERE ST = 33"
dfIncome = dbGetQuery(connSQLite, sql)

I was *not* able to load one entire table at a time, so I was able to make
it work by combining sqldf to read in chunks
at a time and dbWriteTable from RSQLite to write each chunk to the
database. The read would then look like, for example:

ss09pusa = read.csv.sql("ss09pusa.csv",
  sql = paste("SELECT * FROM file WHERE ST =", i))

where i is an iteration over the state FIPS codes. (You could just use 1
to 56, even though there's a few missing numbers in there. The searches
for nonexistent records will take time but will otherwise be harmless.)
The dbWriteTable would be the same, with every write after the first one
using append = TRUE.

Also, I assume you will want to do the read from csv / write to SQLite
*once*, then maintain the SQLite database for other sessions. We also use
large census and IPUMS data sets, and we keep everything in a Postgres
backend, which we access with RPostgreSQL. Much easier to keep
everything organized in an RDBMS than to pass
around these monstrous csv files.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to Read a Large CSV into a Database with R

2010-11-15 Thread Mike Marchywka









> From: ajdam...@gmail.com
> Date: Mon, 15 Nov 2010 13:28:40 -0500
> To: ggrothendi...@gmail.com; r-help@r-project.org
> Subject: Re: [R] How to Read a Large CSV into a Database with R
>
> Hi Gabor,
>
> Thank you for your willingness to help me through this. The code you sent
> works on my machine exactly the same way as it does on yours.
> Unfortunately, when I run the same code on the 1.3GB file, it creates the
> table structure but doesn't read in a single line [confirmed with
> sqldf("select * from mytab",dbname="mydb")] Though I don't expect anyone to
> download it, the file I'm using is ss09pusa.csv from
> http://www2.census.gov/acs2009_1yr/pums/csv_pus.zip. I tested both sets of
> code on my work desktop and personal laptop, so it's not machine-specific

>
> Do you have any other ideas as to how I might diagnose what's going on
> here? Or, alternatively, is there some workaround that would get this giant
> CSV into a database? If you think there's a reasonable way to use the
> IMPORT command with RSQLite, that seems like it would import the fastest,



I think someone else suggested external aproaches and indeed I hae
loaded census tiger filers into db for making maps and mobile apps etc.
I wold mention again that this may eliminate a memory limit and let you 
limp along but presumably you want a streaming source or something
if your analysis has preidctable access patterns and this data will not
be used as part of a hotel reservation system.

Structured input data, I think TIGER was line oriented, should be easy
to load into a db with bash script or java app.


>
> Thanks again!
> Anthony
>
>
> > read.csv.sql("R:\\American Community Survey\\Data\\2009\\ss09pusa.csv",
> sql = "create table mytab as select * from file", dbname = "mydb")
> NULL
> Warning message:
> closing unused connection 3 (R:\American Community
> Survey\Data\2009\ss09pusa.csv)
> >
> > # how many records are in table?
> > N <- sqldf("select count(*) from mytab", dbname = "mydb")[[1]]
> >
> > # read in chunks and display what we have read
> >
> > k <- 4 # no of records to read at once
> > for(i in seq(0, N-1, k)) {
> + s <- sprintf("select * from mytab limit %d, %d", i, k)
> + print(sqldf(s, dbname = "mydb"))
> + }
> Error in seq.default(0, N - 1, k) : wrong sign in 'by' argument
> > N
> [1] 0
>
>
>
>
> On Mon, Nov 15, 2010 at 12:24 PM, Gabor Grothendieck <
> ggrothendi...@gmail.com> wrote:
>
> > On Mon, Nov 15, 2010 at 11:46 AM, Anthony Damico 
> > wrote:
> > > Hi Gabor,
> > >
> > > Thank you for the prompt reply. I definitely looked over all of the
> > > examples on the code.google.com sqldf page before sending, which is why
> > I
> > > wrote the code
> > >
> > > read.csv.sql("ss09pusa.csv" , sql="create table ss09pusa as select * from
> > > file" , dbname="sqlite")
> > >
> > > directly pulled from their code -- read.csv.sql("~/tmp.csv", sql =
> > "create
> > > table mytab as select * from file",dbname = "mydb")
> > >
> > > ..but I don't understand why this helps me around the memory problem,
> > since
> > > I think it still all gets read into memory. Is there a way to do this
> > line
> > > by line?
> >
> >
> > OK. Maybe its something else.
> >
> > The reading in of the file into the database should not be a resource
> > problem provided you have enough disk space and appropriate
> > permissions. sqldf / RSQLite are used to get sqlite to do it so that
> > the data never goes through R at that stage so R limitations can't
> > affect the reading in to the sqlite database. When you read it from
> > the sqlite database then R limitations come into effect so you just
> > have to be sure not to read too much in at a time. The use of create
> > table ... as select ... is to prevent sqldf from deleting the table
> > since sqldf is normally used in a fashion where you don't want to know
> > about the back end databases so it tries to create them and delete
> > them behind the scenes but here you want to explicitly use them so
> > you have to work around that.
> >
> > Try this example. It should be reproducible so you just have to copy
> > it and paste it into your R session. Uncomment the indicated line if
> > you want to be able to remove any pre-existing mydb file in the
> > current directory. Try

Re: [R] How to Read a Large CSV into a Database with R

2010-11-15 Thread Anthony Damico
Hi Gabor,

Thank you for your willingness to help me through this.  The code you sent
works on my machine exactly the same way as it does on yours.
Unfortunately, when I run the same code on the 1.3GB file, it creates the
table structure but doesn't read in a single line [confirmed with
sqldf("select * from mytab",dbname="mydb")]  Though I don't expect anyone to
download it, the file I'm using is ss09pusa.csv from
http://www2.census.gov/acs2009_1yr/pums/csv_pus.zip.  I tested both sets of
code on my work desktop and personal laptop, so it's not machine-specific
(although it might be Windows- or 64 bit-specific).

Do you have any other ideas as to how I might diagnose what's going on
here?  Or, alternatively, is there some workaround that would get this giant
CSV into a database?  If you think there's a reasonable way to use the
IMPORT command with RSQLite, that seems like it would import the fastest,
but I don't know that it's compatible with DBI on Windows.

Thanks again!
Anthony


> read.csv.sql("R:\\American Community Survey\\Data\\2009\\ss09pusa.csv",
sql = "create table mytab as select * from file", dbname = "mydb")
NULL
Warning message:
closing unused connection 3 (R:\American Community
Survey\Data\2009\ss09pusa.csv)
>
> # how many records are in table?
> N <- sqldf("select count(*) from mytab", dbname = "mydb")[[1]]
>
> # read in chunks and display what we have read
>
> k <- 4 # no of records to read at once
> for(i in seq(0, N-1, k)) {
+s <- sprintf("select * from mytab limit %d, %d", i, k)
+print(sqldf(s, dbname = "mydb"))
+ }
Error in seq.default(0, N - 1, k) : wrong sign in 'by' argument
> N
[1] 0




On Mon, Nov 15, 2010 at 12:24 PM, Gabor Grothendieck <
ggrothendi...@gmail.com> wrote:

> On Mon, Nov 15, 2010 at 11:46 AM, Anthony Damico 
> wrote:
> > Hi Gabor,
> >
> > Thank you for the prompt reply.  I definitely looked over all of the
> > examples on the code.google.com sqldf page before sending, which is why
> I
> > wrote the code
> >
> > read.csv.sql("ss09pusa.csv" , sql="create table ss09pusa as select * from
> > file" , dbname="sqlite")
> >
> > directly pulled from their code -- read.csv.sql("~/tmp.csv", sql =
> "create
> > table mytab as select * from file",dbname = "mydb")
> >
> > ..but I don't understand why this helps me around the memory problem,
> since
> > I think it still all gets read into memory.  Is there a way to do this
> line
> > by line?
>
>
> OK. Maybe its something else.
>
> The reading in of the file into the database should not be a resource
> problem provided you have enough disk space and appropriate
> permissions. sqldf / RSQLite are used to get sqlite to do it so that
> the data never goes through R at that stage so R limitations can't
> affect the reading in to the sqlite database.   When you read it from
> the sqlite database then R limitations come into effect so you just
> have to be sure not to read too much in at a time.  The use of create
> table ... as select ... is to prevent sqldf from deleting the table
> since sqldf is normally used in a fashion where you don't want to know
> about the back end databases so it tries to create them and delete
> them behind the scenes but here you want to explicitly use them so
> you have to work around that.
>
> Try this example. It should be reproducible so you just have to copy
> it and paste it into your R session.  Uncomment the indicated line if
> you want to be able to remove any pre-existing mydb file in the
> current directory.  Try it in a fresh R session just to be sure that
> nothing mucks it up.
>
> library(sqldf)
>
> # uncomment next line to make sure we are starting clean
> # if (file.exists("mydb")) file.remove("mydb")
>
> # create new database
> sqldf("attach 'mydb' as new")
>
> # create a new file.  BOD is built into R and has 6 rows.
> write.table(BOD, file = "tmp.csv", quote = FALSE, sep = ",")
>
> # read new file into database
> read.csv.sql("tmp.csv", sql = "create table mytab as select * from file",
>dbname = "mydb")
>
> # how many records are in table?
> N <- sqldf("select count(*) from mytab", dbname = "mydb")[[1]]
>
> # read in chunks and display what we have read
>
> k <- 4 # no of records to read at once
> for(i in seq(0, N-1, k)) {
>s <- sprintf("select * from mytab limit %d, %d", i, k)
>print(sqldf(s, dbname = "mydb"))
> }
>
> On my machine I get this output:
>
>  Time demand
> 118.3
> 22   10.3
> 33   19.0
> 44   16.0
>  Time demand
> 15   15.6
> 27   19.8
>
> showing that it read the 6 line BOD data frame in chunks of 4 as required.
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide comm

Re: [R] How to Read a Large CSV into a Database with R

2010-11-15 Thread Gabor Grothendieck
On Mon, Nov 15, 2010 at 11:46 AM, Anthony Damico  wrote:
> Hi Gabor,
>
> Thank you for the prompt reply.  I definitely looked over all of the
> examples on the code.google.com sqldf page before sending, which is why I
> wrote the code
>
> read.csv.sql("ss09pusa.csv" , sql="create table ss09pusa as select * from
> file" , dbname="sqlite")
>
> directly pulled from their code -- read.csv.sql("~/tmp.csv", sql = "create
> table mytab as select * from file",dbname = "mydb")
>
> ..but I don't understand why this helps me around the memory problem, since
> I think it still all gets read into memory.  Is there a way to do this line
> by line?


OK. Maybe its something else.

The reading in of the file into the database should not be a resource
problem provided you have enough disk space and appropriate
permissions. sqldf / RSQLite are used to get sqlite to do it so that
the data never goes through R at that stage so R limitations can't
affect the reading in to the sqlite database.   When you read it from
the sqlite database then R limitations come into effect so you just
have to be sure not to read too much in at a time.  The use of create
table ... as select ... is to prevent sqldf from deleting the table
since sqldf is normally used in a fashion where you don't want to know
about the back end databases so it tries to create them and delete
them behind the scenes but here you want to explicitly use them so
you have to work around that.

Try this example. It should be reproducible so you just have to copy
it and paste it into your R session.  Uncomment the indicated line if
you want to be able to remove any pre-existing mydb file in the
current directory.  Try it in a fresh R session just to be sure that
nothing mucks it up.

library(sqldf)

# uncomment next line to make sure we are starting clean
# if (file.exists("mydb")) file.remove("mydb")

# create new database
sqldf("attach 'mydb' as new")

# create a new file.  BOD is built into R and has 6 rows.
write.table(BOD, file = "tmp.csv", quote = FALSE, sep = ",")

# read new file into database
read.csv.sql("tmp.csv", sql = "create table mytab as select * from file",
dbname = "mydb")

# how many records are in table?
N <- sqldf("select count(*) from mytab", dbname = "mydb")[[1]]

# read in chunks and display what we have read

k <- 4 # no of records to read at once
for(i in seq(0, N-1, k)) {
s <- sprintf("select * from mytab limit %d, %d", i, k)
print(sqldf(s, dbname = "mydb"))
}

On my machine I get this output:

  Time demand
118.3
22   10.3
33   19.0
44   16.0
  Time demand
15   15.6
27   19.8

showing that it read the 6 line BOD data frame in chunks of 4 as required.
-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to Read a Large CSV into a Database with R

2010-11-15 Thread Anthony Damico
Hi Gabor,

Thank you for the prompt reply.  I definitely looked over all of the
examples on the code.google.com sqldf page before sending, which is why I
wrote the code

read.csv.sql("ss09pusa.csv" , sql="create table ss09pusa as select * from
file" , dbname="sqlite")

directly pulled from their code -- read.csv.sql("~/tmp.csv", sql = "create
table mytab as select * from file",dbname = "mydb")

..but I don't understand why this helps me around the memory problem, since
I think it still all gets read into memory.  Is there a way to do this line
by line?

I would prefer to use SQLite than sqldf, but I could not get the IMPORT
command (or .IMPORT) functioning at all.  I tried these with both dbGetQuery
and dbSendQuery.

library(RSQLite)
setwd("R:\\American Community Survey\\Data\\2009")
out_db <- dbConnect(SQLite(), dbname="sqlite.db")
dbGetQuery(out_db , "create table test (hello integer, world text)")
dbGetQuery(out_db , "mode csv")
dbGetQuery(out_db , "import test.csv test")

When I hit the mode and import commands, it gives me an error that makes me
think it's handling these files in a completely different way.

> dbGetQuery(out_db , "mode csv")
Error in sqliteExecStatement(con, statement, bind.data) :
  RS-DBI driver: (error in statement: near "mode": syntax error)


I suppose I could just run sqlite3 commands from the system() function, but
I was hoping there might be a way to accomplish this task entirely within R?



Thanks again!



On Mon, Nov 15, 2010 at 10:41 AM, Gabor Grothendieck <
ggrothendi...@gmail.com> wrote:

> On Mon, Nov 15, 2010 at 10:07 AM, Anthony Damico 
> wrote:
> > Hi, I'm working in R 2.11.1 x64 on Windows x86_64-pc-mingw32.  I'm trying
> to
> > insert a very large CSV file into a SQLite database.  I'm pretty new to
> > working with databases in R, so I apologize if I'm overlooking something
> > obvious here.
> >
> > I'm trying to work with the American Community Survey data, which is two
> > 1.3GB csv files.  I have enough RAM to read one of them into memory, but
> not
> > both at the same time.  So, in order to analyze them, I'm trying to get
> them
> > into a SQLite database so I can use the R survey package's
> database-backed
> > survey objects capabilities (
> > http://faculty.washington.edu/tlumley/survey/svy-dbi.html).
> >
> > I need to combine both of these CSV files into one table (within a
> > database), so I think that I'd need a SQL manipulation technique that
> reads
> > everything line by line, instead of pulling it all into memory.
> >
> > I've tried using read.csv.sql, but it finishes without an error and then
> > only shows me the table structure when I run the final select statement.
> > When I run these exact same commands on a smaller CSV file, they work
> fine.
> > I imagine this is not working because the csv is so large, but I'm not
> sure
> > how to confirm that or what to change if it is.  I do want to get all
> > columns from the CSV into the data table, so I don't want to filter
> > anything.
> >
> > library(sqldf)
> > setwd("R:\\American Community Survey\\Data\\2009")
> > sqldf("attach 'sqlite' as new")
> > read.csv.sql("ss09pusa.csv" , sql="create table ss09pusa as select * from
> > file" , dbname="sqlite")
> > sqldf("select * from ss09pusa limit 3",dbname="sqlite")
> >
>
> What the above code does, which is unlikely to be what you intended,
> is to create an sqlite database called 'sqlite' and then read in the
> indicated file into sqlite, read it in into R from sqlite (clearly
> this step will fail if the data is too big for R but it its not then
> you are ok) and then delete the table from the database so your sqldf
> statement should give an error since there is no such table or else if
> you have a data frame in your R workspace called ss09pusa the sqldf
> statement will load that into a database table and the retrieve its
> first three rows and then delete the table.
>
> This sort of task is probably more suitable for RSQLite than sqldf
> but if you wish to do it with sqldf you need to follow example 9 or
> example 10 on the sqldf home page:
>
> In example 9,
>
> http://code.google.com/p/sqldf/#Example_9.__Working_with_Databases
>
> its very important to note that sqldf automatically deletes any table
> that it created after the sqldf or read.csv.sql statement is done so
> to not have the table dropped is to make sure you issue an sql
> statement that creates the table, "create table mytab as select ..."
> rather than sqldf.
>
> In example 10,
>
> http://code.google.com/p/sqldf/#Example_10._Persistent_Connections
>
> persistent connections are illustrated which represents an alternate
> way to do this in sqldf.
>
>
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting gui

Re: [R] How to Read a Large CSV into a Database with R

2010-11-15 Thread Gabor Grothendieck
On Mon, Nov 15, 2010 at 10:07 AM, Anthony Damico  wrote:
> Hi, I'm working in R 2.11.1 x64 on Windows x86_64-pc-mingw32.  I'm trying to
> insert a very large CSV file into a SQLite database.  I'm pretty new to
> working with databases in R, so I apologize if I'm overlooking something
> obvious here.
>
> I'm trying to work with the American Community Survey data, which is two
> 1.3GB csv files.  I have enough RAM to read one of them into memory, but not
> both at the same time.  So, in order to analyze them, I'm trying to get them
> into a SQLite database so I can use the R survey package's database-backed
> survey objects capabilities (
> http://faculty.washington.edu/tlumley/survey/svy-dbi.html).
>
> I need to combine both of these CSV files into one table (within a
> database), so I think that I'd need a SQL manipulation technique that reads
> everything line by line, instead of pulling it all into memory.
>
> I've tried using read.csv.sql, but it finishes without an error and then
> only shows me the table structure when I run the final select statement.
> When I run these exact same commands on a smaller CSV file, they work fine.
> I imagine this is not working because the csv is so large, but I'm not sure
> how to confirm that or what to change if it is.  I do want to get all
> columns from the CSV into the data table, so I don't want to filter
> anything.
>
> library(sqldf)
> setwd("R:\\American Community Survey\\Data\\2009")
> sqldf("attach 'sqlite' as new")
> read.csv.sql("ss09pusa.csv" , sql="create table ss09pusa as select * from
> file" , dbname="sqlite")
> sqldf("select * from ss09pusa limit 3",dbname="sqlite")
>

What the above code does, which is unlikely to be what you intended,
is to create an sqlite database called 'sqlite' and then read in the
indicated file into sqlite, read it in into R from sqlite (clearly
this step will fail if the data is too big for R but it its not then
you are ok) and then delete the table from the database so your sqldf
statement should give an error since there is no such table or else if
you have a data frame in your R workspace called ss09pusa the sqldf
statement will load that into a database table and the retrieve its
first three rows and then delete the table.

This sort of task is probably more suitable for RSQLite than sqldf
but if you wish to do it with sqldf you need to follow example 9 or
example 10 on the sqldf home page:

In example 9,

http://code.google.com/p/sqldf/#Example_9.__Working_with_Databases

its very important to note that sqldf automatically deletes any table
that it created after the sqldf or read.csv.sql statement is done so
to not have the table dropped is to make sure you issue an sql
statement that creates the table, "create table mytab as select ..."
rather than sqldf.

In example 10,

http://code.google.com/p/sqldf/#Example_10._Persistent_Connections

persistent connections are illustrated which represents an alternate
way to do this in sqldf.


-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to Read a Large CSV into a Database with R

2010-11-15 Thread Dieter Menne


Anthony-107 wrote:
> 
> Hi, I'm working in R 2.11.1 x64 on Windows x86_64-pc-mingw32.  I'm trying
> to
> insert a very large CSV file into a SQLite database. 
> 

Better use an external utility if this is a one-time import for this job:

http://sqlitebrowser.sourceforge.net/


Dieter

-- 
View this message in context: 
http://r.789695.n4.nabble.com/How-to-Read-a-Large-CSV-into-a-Database-with-R-tp3043209p3043226.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] How to Read a Large CSV into a Database with R

2010-11-15 Thread Anthony Damico
Hi, I'm working in R 2.11.1 x64 on Windows x86_64-pc-mingw32.  I'm trying to
insert a very large CSV file into a SQLite database.  I'm pretty new to
working with databases in R, so I apologize if I'm overlooking something
obvious here.

I'm trying to work with the American Community Survey data, which is two
1.3GB csv files.  I have enough RAM to read one of them into memory, but not
both at the same time.  So, in order to analyze them, I'm trying to get them
into a SQLite database so I can use the R survey package's database-backed
survey objects capabilities (
http://faculty.washington.edu/tlumley/survey/svy-dbi.html).

I need to combine both of these CSV files into one table (within a
database), so I think that I'd need a SQL manipulation technique that reads
everything line by line, instead of pulling it all into memory.

I've tried using read.csv.sql, but it finishes without an error and then
only shows me the table structure when I run the final select statement.
When I run these exact same commands on a smaller CSV file, they work fine.
I imagine this is not working because the csv is so large, but I'm not sure
how to confirm that or what to change if it is.  I do want to get all
columns from the CSV into the data table, so I don't want to filter
anything.

library(sqldf)
setwd("R:\\American Community Survey\\Data\\2009")
sqldf("attach 'sqlite' as new")
read.csv.sql("ss09pusa.csv" , sql="create table ss09pusa as select * from
file" , dbname="sqlite")
sqldf("select * from ss09pusa limit 3",dbname="sqlite")


I've also tried using the SQL IMPORT command, which I couldn't get working
properly, even on a tiny two-field, five-row CSV file.

library(RSQLite)
setwd("R:\\American Community Survey\\Data\\2009")
in_csv <- file("test.csv")
out_db <- dbConnect(SQLite(), dbname="sqlite.db")
dbGetQuery(out_db , "create table test (hello integer, world text)")
dbGetQuery(out_db , "import in_csv test")


Any advice would be sincerely appreciated.

Thanks!

Anthony Damico
Kaiser Family Foundation

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.