[R] Speeding reading of large file

Fisher Dennis Wed, 28 Nov 2012 09:44:49 -0800

R 2.15.1
OS X and Windows

Colleagues,


I have a file that looks that this:
TABLE NO.  1
 PTID        TIME        AMT         FORM        PERIOD      IPRED       CWRES  
     EVID        CP          PRED        RES         WRES
  2.0010E+03  3.9375E-01  5.0000E+03  2.0000E+00  0.0000E+00  0.0000E+00  
0.0000E+00  1.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00
  2.0010E+03  8.9583E-01  5.0000E+03  2.0000E+00  0.0000E+00  3.3389E+00  
0.0000E+00  1.0000E+00  0.0000E+00  3.5321E+00  0.0000E+00  0.0000E+00
  2.0010E+03  1.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00  5.8164E+00  
0.0000E+00  1.0000E+00  0.0000E+00  5.9300E+00  0.0000E+00  0.0000E+00
  2.0010E+03  1.9167E+00  5.0000E+03  2.0000E+00  0.0000E+00  8.3633E+00  
0.0000E+00  1.0000E+00  0.0000E+00  8.7011E+00  0.0000E+00  0.0000E+00
  2.0010E+03  2.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00  1.0092E+01  
0.0000E+00  1.0000E+00  0.0000E+00  1.0324E+01  0.0000E+00  0.0000E+00
  2.0010E+03  2.9375E+00  5.0000E+03  2.0000E+00  0.0000E+00  1.1490E+01  
0.0000E+00  1.0000E+00  0.0000E+00  1.1688E+01  0.0000E+00  0.0000E+00
  2.0010E+03  3.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00  1.2940E+01  
0.0000E+00  1.0000E+00  0.0000E+00  1.3236E+01  0.0000E+00  0.0000E+00
  2.0010E+03  4.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00  1.1267E+01  
0.0000E+00  1.0000E+00  0.0000E+00  1.1324E+01  0.0000E+00  0.0000E+00

The file is reasonably large (> 10^6 lines) and the two line header is repeated 
periodically in the file.  
I need to read this file in as a data frame.  Note that the number of columns, 
the column headers, and the number of replicates of the headers are not known 
in advance.

I have tried two approaches to this:
        First Approach:  
                1.  readLines(FILENAME) to read in the file
                2.  use grep to find the repeat headers; strip out the repeat 
headers
                3.  write() the object to tempfile, read in that temporary file 
using read.table(tempfile, header=TRUE, skip=1) [an alternative is to use 
textConnection but that does not appear to speed things]

        Second Approach:
                1.  TEMP        <- read.table(FILENAME, header=TRUE, skip=1, 
fill=TRUE, as.is=TRUE)
                2.  get rid of the errant entries with:
                        TEMP[!is.na(as.numeric(TEMP[,1])),]
                3.  reading of the character entries forced all columns to 
character mode.  Therefore, I convert each column to numeric:
                        for (COL in 1:ncol(TEMP)) TEMP[,COL] <- 
as.numeric(TEMP[,COL]) 
The second approach is ~ 20% faster than the first.  With the second approach, 
the conversion to numeric occupies 50% of the elapsed time.

Is there some approach that would be much faster?  For example, would a 
vectorized approach to conversion to numeric improve throughput?  Or, is there 
some means to ensure that all data are read as numeric (I tried to use 
colClasses but that triggered an error when the text string was encountered).

############################
A dput version of the data is:
c("TABLE NO.  1", " PTID        TIME        AMT         FORM        PERIOD      
IPRED       CWRES       EVID        CP          PRED        RES         WRES", 
"  2.0010E+03  3.9375E-01  5.0000E+03  2.0000E+00  0.0000E+00  0.0000E+00  
0.0000E+00  1.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00", 
"  2.0010E+03  8.9583E-01  5.0000E+03  2.0000E+00  0.0000E+00  3.3389E+00  
0.0000E+00  1.0000E+00  0.0000E+00  3.5321E+00  0.0000E+00  0.0000E+00", 
"  2.0010E+03  1.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00  5.8164E+00  
0.0000E+00  1.0000E+00  0.0000E+00  5.9300E+00  0.0000E+00  0.0000E+00", 
"  2.0010E+03  1.9167E+00  5.0000E+03  2.0000E+00  0.0000E+00  8.3633E+00  
0.0000E+00  1.0000E+00  0.0000E+00  8.7011E+00  0.0000E+00  0.0000E+00", 
"  2.0010E+03  2.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00  1.0092E+01  
0.0000E+00  1.0000E+00  0.0000E+00  1.0324E+01  0.0000E+00  0.0000E+00", 
"  2.0010E+03  2.9375E+00  5.0000E+03  2.0000E+00  0.0000E+00  1.1490E+01  
0.0000E+00  1.0000E+00  0.0000E+00  1.1688E+01  0.0000E+00  0.0000E+00", 
"  2.0010E+03  3.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00  1.2940E+01  
0.0000E+00  1.0000E+00  0.0000E+00  1.3236E+01  0.0000E+00  0.0000E+00", 
"  2.0010E+03  4.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00  1.1267E+01  
0.0000E+00  1.0000E+00  0.0000E+00  1.1324E+01  0.0000E+00  0.0000E+00"
)

This can be assembled into a large dataset and written to a file named FILENAME 
with the following code:
cat(c("TABLE NO.  1", " PTID        TIME        AMT         FORM        PERIOD  
    IPRED       CWRES       EVID        CP          PRED        RES         
WRES", 
"  2.0010E+03  3.9375E-01  5.0000E+03  2.0000E+00  0.0000E+00  0.0000E+00  
0.0000E+00  1.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00", 
"  2.0010E+03  8.9583E-01  5.0000E+03  2.0000E+00  0.0000E+00  3.3389E+00  
0.0000E+00  1.0000E+00  0.0000E+00  3.5321E+00  0.0000E+00  0.0000E+00", 
"  2.0010E+03  1.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00  5.8164E+00  
0.0000E+00  1.0000E+00  0.0000E+00  5.9300E+00  0.0000E+00  0.0000E+00", 
"  2.0010E+03  1.9167E+00  5.0000E+03  2.0000E+00  0.0000E+00  8.3633E+00  
0.0000E+00  1.0000E+00  0.0000E+00  8.7011E+00  0.0000E+00  0.0000E+00", 
"  2.0010E+03  2.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00  1.0092E+01  
0.0000E+00  1.0000E+00  0.0000E+00  1.0324E+01  0.0000E+00  0.0000E+00", 
"  2.0010E+03  2.9375E+00  5.0000E+03  2.0000E+00  0.0000E+00  1.1490E+01  
0.0000E+00  1.0000E+00  0.0000E+00  1.1688E+01  0.0000E+00  0.0000E+00", 
"  2.0010E+03  3.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00  1.2940E+01  
0.0000E+00  1.0000E+00  0.0000E+00  1.3236E+01  0.0000E+00  0.0000E+00", 
"  2.0010E+03  4.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00  1.1267E+01  
0.0000E+00  1.0000E+00  0.0000E+00  1.1324E+01  0.0000E+00  0.0000E+00"
)[rep(1:10, 1000)], file="FILENAME", sep="\n")


Dennis


Dennis Fisher MD
P < (The "P Less Than" Company)
Phone: 1-866-PLessThan (1-866-753-7784)
Fax: 1-866-PLessThan (1-866-753-7784)
www.PLessThan.com

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Speeding reading of large file

Reply via email to