I've been trying to figure out how to read in a large file for a few days
now, and after extensive research I'm still not sure what to do.

I have a large comma delimited text file that contains 59 fields in each
record.
There is also a header every 121 records

This function works well for smallish records
getcsv=function(fname){
    ff=file(description = fname)
    x <- readLines(ff)
    closeAllConnections()
    x <- x[x != ""]          # REMOVE BLANKS
    x=x[grep("^[-0-9]", x)]  # REMOVE ALL TEXT

    spl=strsplit(x,',')      # THIS PART IS SLOW, BUT MANAGABLE

xx=t(sapply(1:length(spl),function(temp)as.vector(na.omit(as.numeric(spl[[temp]])))))
    return(xx)
}
It's not elegant, but it works.
For 121,000 records it completes in 2.3 seconds
For 121,000*5 records it completes in 63 seconds
For 121,000*10 records it doesn't complete

When I try other methods to read the file in chunks (using scan), the
process breaks down because I have to start at the beginning of the file on
every iteration.
For example:
fnn=function(n,col){
    a=122*(n-1)+2
    xx=scan(fname,skip=a-1,nlines=121,sep=',',quiet=TRUE,what=character(0))
    xx=xx[xx!='']
    xx=matrix(xx,ncol=49,byrow=TRUE)
    xx[,col]
}
system.time(sapply(1:10,fnn,c=26))     # 0.31 Seconds
system.time(sapply(91:90,fnn,c=26))    # 1.09 Seconds
system.time(sapply(901:910,fnn,c=26))  # 5.78 Seconds

Even though I'm only getting the 26th column for 10 sets of records, it
takes a lot longer the further into the file I go.

How can I tell scan to pick up where it left off, without it starting at the
beginning??  There must be a good example somewhere.

I have done a lot of research (in fact, thank you to Michael J. Crawley and
others for your help thus far)

Thanks,

Gene

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to