Hi,
Jason Thibodeau wrote:
I am attempting to perform some simple data manipulation on a large data
set. I have a snippet of the whole data set, and my small snippet is 2GB in
CSV.
Is there a way I can read my csv, select a few columns, and write it to an
output file in real time? This is what I do right now to a small test file:
data <- read.csv('data.csv', header = FALSE)
data_filter <- data[c(1,3,4)]
write.table(data_filter, file = "filter_data.csv", sep = ",", row.names =
FALSE, col.names = FALSE)
in this case, I think R is not the best tool for the job. I would rather
suggest to use an implementation of the awk language (e.g. gawk).
I just tried the following on WinXP (zipped file (87MB zipped, 1.2GB
unzipped), piped into gawk)
unzip -p myzipfile.zip | gawk '{print $1, $3, $4}' > myfiltereddata.txt
and it took about 90 seconds.
Please note that you might need to specify your delimiter (field
separator (FS) and output field separator (OFS)) =>
gawk '{FS=","; OFS=","} {print $1, $3, $4}' data.csv > filter_data.scv
I hope this helps (despite not encouraging the usage of R),
Roland
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.