On Wed, Jan 27, 2010 at 8:56 AM, Matthew Dowle <mdo...@mdowle.plus.com> wrote: > How many columns, and of what type are the columns ? As Olga asked too, it > would be useful to know more about what you're really trying to do. > > 3.5m rows is not actually that many rows, even for 32bit R. Its depends on > the columns and what you want to do with those columns. > > At the risk of suggesting something before we know the full facts, one > possibility is to load the data from flat file into data.table. Use setkey() > to set your keys. Use tables() to summarise your various tables. Then do > your joins etc all-in-R. data.table has fast ways to do those sorts of > joins (but we need more info about your task). > > Alternatively, you could check out the sqldf website. There is an > sqlread.csv (or similar name) which can read your files directly into SQL
read.csv.sql > instead of going via R. Gabor has some nice examples there about that and > its faster. > > You use some buzzwords which makes me think that SQL may not be appropriate > for your task though. Can't say for sure (because we don't have enough > information) but its possible you are struggling because SQL has no row > ordering concept built in. That might be why you've created an increment In the SQLite database it automatically assigns a self incrementing hidden column called rowid to each row. e.g. using SQLite via the sqldf package on CRAN and the BOD data frame which is built into R we can display the rowid column explicitly by referring to it in our select statement: > library(sqldf) > BOD Time demand 1 1 8.3 2 2 10.3 3 3 19.0 4 4 16.0 5 5 15.6 6 7 19.8 > sqldf("select rowid, * from BOD") rowid Time demand 1 1 1 8.3 2 2 2 10.3 3 3 3 19.0 4 4 4 16.0 5 5 5 15.6 6 6 7 19.8 > field? Do your queries include "order by incrementing field"? SQL is not > good at "first" and "last" type logic. An all-in-R solution may well be In SQLite you can get the top 3 values, say, like this (continuing the prior example): > sqldf("select * from BOD order by Time desc limit 3") Time demand 1 7 19.8 2 5 15.6 3 4 16.0 > better, since R is very good with ordered vectors. A 1GB data.table (or > data.frame) for example, at 3.5m rows, could have 76 integer columns, or > 38 double columns. 1GB is well within 32bit and allows some space for > working copies, depending on what you want to do with the data. If you have > 38 or less columns, or you have 64bit, then an all-in-R solution *might* > get your task done quicker, depending on what your real goal is. > > If this sounds plausible, you could post more details and, if its > appropriate, and luck is on your side, someone might even sketch out how to > do an all-in-R solution. > > > "Nathan S. Watson-Haigh" <nathan.watson-ha...@csiro.au> wrote in message > news:4b5fde1b.10...@csiro.au... >>I have a table (contact) with several fields and it's PK is an auto >>increment field. I'm bulk loading data to this table from files which if >>successful will be about 3.5million rows (approx 16000 rows per file). >>However, I have a linking table (an_contact) to resolve a m:m relationship >>between the an and contact tables. How can I retrieve the PK's for the data >>bulk loaded into contact so I can insert the relevant data into an_contact. >> >> I currently load the data into contact using: >> dbWriteTable(con, "contact", dat, append=TRUE, row.names=FALSE) >> >> But I then need to get all the PK's which this dbWriteTable() appended to >> the contact table so I can load the data into my an_contact link table. I >> don't want to issue a separate INSERT query for each row in dat and then >> use MySQLs LAST_INSERT_ID() function....not when I have 3.5million rows to >> insert! >> >> Any pointers welcome, >> Nathan >> >> -- >> -------------------------------------------------------- >> Dr. Nathan S. Watson-Haigh >> OCE Post Doctoral Fellow >> CSIRO Livestock Industries >> University Drive >> Townsville, QLD 4810 >> Australia >> >> Tel: +61 (0)7 4753 8548 >> Fax: +61 (0)7 4753 8600 >> Web: http://www.csiro.au/people/Nathan.Watson-Haigh.html >> > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.