Everything has slowed down with #1 and #3 by about 50%. Can't do #2 & #4 : > ta.num <- lapply(ta0, scan, sep = ",") Error in file(file, "r") : unable to open connection
scan seems to want a file or a connection ... Gabor Grothendieck <[EMAIL PROTECTED]> wrote: Could you time these and see how each of these do: # 1 ta.split <- strsplit(ta, split = ",") ta.num <- lapply(ta.split, function(x) as.numeric(x[-(1:2)])) # 2 ta0 <- sub("^[^,]*,[^.]*,", "", ta) ta.num <- lapply(ta0, scan, sep = ",") # 3 - loop version of #1 n <- length(ta) ta.split <- strsplit(ta, split = ",") ta.num <- list(length = n) for(i in 1:n) ta.num[[i]] <- as.numeric(ta.split[[i]][-(1:2)]) # 4 - loop version of #2 n <- length(ta) ta0 <- sub("^[^,]*,[^.]*,", "", ta) ta.num <- list(length = n) for(i in 1:n) ta.num[[i]] <- scan(t0[[i]) On 12/6/05, John McHenry wrote: > I should have mentioned that I already tried the readLines() approach: > > ta<-readLines("foo.csv") > ptm<-proc.time() > f<-character(length(ta)) > for (k in 2:length(ta)) { f[k-1]<-(strsplit(ta[k],",")[[1]])[3] }# <- PARSING > EACH LINE AT THIS LEVEL IS WHERE THE REAL INEFFICIENCY IS > (proc.time()-ptm)[3] > [1] 102.75 > > on a 62M file, so I'm guessing that on my 1GB files this will be about > > > (102.75*(1000/61))/60 > [1] 28.07377 > > minutes...which is way, way too long. > > I'm new to R but I'm kind of surprised that this problem isn't well known > (couldn't find anything after a long hunt). > > As I mentioned, MATLAB does it using textread which makes a call to its dll > dataread. The data are read using something like: > > [name, startMonth, data]=textread(fileName,'%s%n%[^\n]', 'delimiter',',', > 'bufsize', 1000000, 'headerlines',1); > > which is kind of fscanf-like. data in the above is then a cell array with > each cell being the variable-length data. > > "Liaw, Andy" wrote: > Use file() connection in conjunction with readLines() and strsplit() should > do it. I would try to count the number of lines in the file first, and > create a list with that many components, then fill it in. I believe the > "array of cells" in Matlab is sort of equivalent to a list in R, but that's > beyond my knowledge of Matlab... > > Andy > > From: John McHenry > > > > I have very large csv files (up to 1GB each of ASCII text). > > I'd like to be able to read them directly in to R. The > > problem I am having is with the variable length of the data > > in each record. > > > > Here's a (simplified) example: > > > > $ cat foo.csv > > Name,Start Month,Data > > Foo,10,-0.5615,2.3065,0.1589,-0.3649,1.5955 > > Bar,21,0.0880,0.5733,0.0081,2.0253,-0.7602,0.7765,0.2810,1.854 > > 6,0.2696,0.3316,0.1565,-0.4847,-0.1325,0.0454,-1.2114 > > > > The records consist of rows with some set comma-separated > > fields (e.g. the "Name" & "Start Month" fields in the above) > > and then the data follow as a variable-length list of > > comma-separated values until a new line is encountered. > > > > Now I can use e.g. > > > > fileName="foo.csv" > > ta<-read.csv(fileName, header=F, skip=1, sep=",", dec=".", fill=T) > > > > which does the job nicely: > > > > V1 V2 V3 V4 V5 V6 V7 V8 V9 > > V10 V11 V12 V13 V14 V15 V16 V17 > > 1 Foo 10 -0.5615 2.3065 0.1589 -0.3649 1.5955 NA NA > > NA NA NA NA NA NA NA NA > > 2 Bar 21 0.0880 0.5733 0.0081 2.0253 -0.7602 0.7765 0.281 > > 1.8546 0.2696 0.3316 0.1565 -0.4847 -0.1325 0.0454 -1.2114 > > > > > > but the problem is with files on the order of 1GB this > > either crunches for ever or runs out of memory trying ... > > plus having all those NAs isn't too pretty to look at. > > > > (I have a MATLAB version that can read this stuff into an > > array of cells in about 3 minutes). > > > > I really want a fast way to read the data part into a list; > > that way I can access data in the array of lists containing > > the records by doing something ta[[i]]$data. > > > > Ideas? > > > > Thanks, > > > > Jack. > > > > > > --------------------------------- > > > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help@stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide! > > http://www.R-project.org/posting-guide.html > > > > > > > ------------------------------------------------------------------------------ > > ------------------------------------------------------------------------------ > > > > > --------------------------------- > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html > --------------------------------- [[alternative HTML version deleted]] ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html