On 12/6/05, John McHenry <[EMAIL PROTECTED]> wrote: > > Everything has slowed down with #1 and #3 by about 50%. Can't do #2 & #4 : > > > ta.num <- lapply(ta0, scan, sep = ",") > Error in file(file, "r") : unable to open connection > scan seems to want a file or a connection ...
Building on Andy's variation: n <- length(ta) ta.sub <- sub("^[^,]*,[^.]*,", "", ta) ta.con <- textConnection(ta.sub) out <- replicate(n, scan(ta.con, nlines = 1, sep = ",")) close(ta.con) Also consider writing ta.sub back out and defining ta.con as a file connection to that file but testing both would be needed to determine which is faster. > > > Gabor Grothendieck <[EMAIL PROTECTED]> wrote: > Could you time these and see how each of these do: > > # 1 > ta.split <- strsplit(ta, split = ",") > ta.num <- lapply(ta.split, function(x) as.numeric(x[-(1:2)])) > > # 2 > ta0 <- sub("^[^,]*,[^.]*,", "", ta) > ta.num <- lapply(ta0, scan, sep = ",") > > # 3 - loop version of #1 > n <- length(ta) > ta.split <- strsplit(ta, split = ",") > ta.num <- list(length = n) > for(i in 1:n) ta.num[[i]] <- as.numeric(ta.split[[i]][-(1:2)]) > > # 4 - loop version of #2 > n <- length(ta) > ta0 <- sub("^[^,]*,[^.]*,", "", ta) > ta.num <- list(length = n) > for(i in 1:n) ta.num[[i]] <- scan(t0[[i]) > > > > On 12/6/05, John McHenry wrote: > > I should have mentioned that I already tried the readLines() approach: > > > > ta<-readLines("foo.csv") > > ptm<-proc.time() > > f<-character(length(ta)) > > for (k in 2:length(ta)) { > f[k-1]<-(strsplit(ta[k],",")[[1]])[3] }# <- PARSING EACH > LINE AT THIS LEVEL IS WHERE THE REAL INEFFICIENCY IS > > (proc.time()-ptm)[3] > > [1] 102.75 > > > > on a 62M file, so I'm guessing that on my 1GB files this will be about > > > > > (102.75*(1000/61))/60 > > [1] 28.07377 > > > > minutes...which is way, way too long. > > > > I'm new to R but I'm kind of surprised that this problem isn't well known > (couldn't find anything after a long hunt). > > > > As I mentioned, MATLAB does it using textread which makes a call to its > dll dataread. The data are read using something like: > > > > [name, startMonth, data]=textread(fileName,'%s%n%[^\n]', > 'delimiter',',', 'bufsize', 1000000, 'headerlines',1); > > > > which is kind of fscanf-like. data in the above is then a cell array with > each cell being the variable-length data. > > > > "Liaw, Andy" wrote: > > Use file() connection in conjunction with readLines() and strsplit() > should > > do it. I would try to count the number of lines in the file first, and > > create a list with that many components, then fill it in. I believe the > > "array of cells" in Matlab is sort of equivalent to a list in R, but > that's > > beyond my knowledge of Matlab... > > > > Andy > > > > From: John McHenry > > > > > > I have very large csv files (up to 1GB each of ASCII text). > > > I'd like to be able to read them directly in to R. The > > > problem I am having is with the variable length of the data > > > in each record. > > > > > > Here's a (simplified) example: > > > > > > $ cat foo.csv > > > Name,Start Month,Data > > > Foo,10,-0.5615,2.3065,0.1589,-0.3649,1.5955 > > > Bar,21,0.0880,0.5733,0.0081,2.0253,-0.7602,0.7765,0.2810,1.854 > > > 6,0.2696,0.3316,0.1565,-0.4847,-0.1325,0.0454,-1.2114 > > > > > > The records consist of rows with some set comma-separated > > > fields (e.g. the "Name" & "Start Month" fields in the above) > > > and then the data follow as a variable-length list of > > > comma-separated values until a new line is encountered. > > > > > > Now I can use e.g. > > > > > > fileName="foo.csv" > > > ta<-read.csv(fileName, header=F, skip=1, sep=",", dec=".", fill=T) > > > > > > which does the job nicely: > > > > > > V1 V2 V3 V4 V5 V6 V7 V8 V9 > > > V10 V11 V12 V13 V14 V15 V16 V17 > > > 1 Foo 10 -0.5615 2.3065 0.1589 -0.3649 1.5955 NA NA > > > NA NA NA NA NA NA NA NA > > > 2 Bar 21 0.0880 0.5733 0.0081 2.0253 -0.7602 0.7765 0.281 > > > 1.8546 0.2696 0.3316 0.1565 -0.4847 -0.1325 0.0454 -1.2114 > > > > > > > > > but the problem is with files on the order of 1GB this > > > either crunches for ever or runs out of memory trying ... > > > plus having all those NAs isn't too pretty to look at. > > > > > > (I have a MATLAB version that can read this stuff into an > > > array of cells in about 3 minutes). > > > > > > I really want a fast way to read the data part into a list; > > > that way I can access data in the array of lists containing > > > the records by doing something ta[[i]]$data. > > > > > > Ideas? > > > > > > Thanks, > > > > > > Jack. > > > > > > > > > --------------------------------- > > > > > > > > > [[alternative HTML version deleted]] > > > > > > ______________________________________________ > > > R-help@stat.math.ethz.ch mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide! > > > http://www.R-project.org/posting-guide.html > > > > > > > > > > > > > ------------------------------------------------------------------------------ > > > > > ------------------------------------------------------------------------------ > > > > > > > > > > --------------------------------- > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help@stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > > > > > > ________________________________ > Yahoo! Shopping > Find Great Deals on Gifts at Yahoo! Shopping > > ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html