R 2.15.1
OS X and Windows
Colleagues,
I have a file that looks that this:
TABLE NO. 1
PTID TIME AMT FORM PERIOD IPRED CWRES
EVID CP PRED RES WRES
2.0010E+03 3.9375E-01 5.0000E+03 2.0000E+00 0.0000E+00 0.0000E+00
0.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00
2.0010E+03 8.9583E-01 5.0000E+03 2.0000E+00 0.0000E+00 3.3389E+00
0.0000E+00 1.0000E+00 0.0000E+00 3.5321E+00 0.0000E+00 0.0000E+00
2.0010E+03 1.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 5.8164E+00
0.0000E+00 1.0000E+00 0.0000E+00 5.9300E+00 0.0000E+00 0.0000E+00
2.0010E+03 1.9167E+00 5.0000E+03 2.0000E+00 0.0000E+00 8.3633E+00
0.0000E+00 1.0000E+00 0.0000E+00 8.7011E+00 0.0000E+00 0.0000E+00
2.0010E+03 2.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.0092E+01
0.0000E+00 1.0000E+00 0.0000E+00 1.0324E+01 0.0000E+00 0.0000E+00
2.0010E+03 2.9375E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.1490E+01
0.0000E+00 1.0000E+00 0.0000E+00 1.1688E+01 0.0000E+00 0.0000E+00
2.0010E+03 3.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.2940E+01
0.0000E+00 1.0000E+00 0.0000E+00 1.3236E+01 0.0000E+00 0.0000E+00
2.0010E+03 4.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.1267E+01
0.0000E+00 1.0000E+00 0.0000E+00 1.1324E+01 0.0000E+00 0.0000E+00
The file is reasonably large (> 10^6 lines) and the two line header is repeated
periodically in the file.
I need to read this file in as a data frame. Note that the number of columns,
the column headers, and the number of replicates of the headers are not known
in advance.
I have tried two approaches to this:
First Approach:
1. readLines(FILENAME) to read in the file
2. use grep to find the repeat headers; strip out the repeat
headers
3. write() the object to tempfile, read in that temporary file
using read.table(tempfile, header=TRUE, skip=1) [an alternative is to use
textConnection but that does not appear to speed things]
Second Approach:
1. TEMP <- read.table(FILENAME, header=TRUE, skip=1,
fill=TRUE, as.is=TRUE)
2. get rid of the errant entries with:
TEMP[!is.na(as.numeric(TEMP[,1])),]
3. reading of the character entries forced all columns to
character mode. Therefore, I convert each column to numeric:
for (COL in 1:ncol(TEMP)) TEMP[,COL] <-
as.numeric(TEMP[,COL])
The second approach is ~ 20% faster than the first. With the second approach,
the conversion to numeric occupies 50% of the elapsed time.
Is there some approach that would be much faster? For example, would a
vectorized approach to conversion to numeric improve throughput? Or, is there
some means to ensure that all data are read as numeric (I tried to use
colClasses but that triggered an error when the text string was encountered).
############################
A dput version of the data is:
c("TABLE NO. 1", " PTID TIME AMT FORM PERIOD
IPRED CWRES EVID CP PRED RES WRES",
" 2.0010E+03 3.9375E-01 5.0000E+03 2.0000E+00 0.0000E+00 0.0000E+00
0.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00",
" 2.0010E+03 8.9583E-01 5.0000E+03 2.0000E+00 0.0000E+00 3.3389E+00
0.0000E+00 1.0000E+00 0.0000E+00 3.5321E+00 0.0000E+00 0.0000E+00",
" 2.0010E+03 1.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 5.8164E+00
0.0000E+00 1.0000E+00 0.0000E+00 5.9300E+00 0.0000E+00 0.0000E+00",
" 2.0010E+03 1.9167E+00 5.0000E+03 2.0000E+00 0.0000E+00 8.3633E+00
0.0000E+00 1.0000E+00 0.0000E+00 8.7011E+00 0.0000E+00 0.0000E+00",
" 2.0010E+03 2.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.0092E+01
0.0000E+00 1.0000E+00 0.0000E+00 1.0324E+01 0.0000E+00 0.0000E+00",
" 2.0010E+03 2.9375E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.1490E+01
0.0000E+00 1.0000E+00 0.0000E+00 1.1688E+01 0.0000E+00 0.0000E+00",
" 2.0010E+03 3.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.2940E+01
0.0000E+00 1.0000E+00 0.0000E+00 1.3236E+01 0.0000E+00 0.0000E+00",
" 2.0010E+03 4.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.1267E+01
0.0000E+00 1.0000E+00 0.0000E+00 1.1324E+01 0.0000E+00 0.0000E+00"
)
This can be assembled into a large dataset and written to a file named FILENAME
with the following code:
cat(c("TABLE NO. 1", " PTID TIME AMT FORM PERIOD
IPRED CWRES EVID CP PRED RES
WRES",
" 2.0010E+03 3.9375E-01 5.0000E+03 2.0000E+00 0.0000E+00 0.0000E+00
0.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00",
" 2.0010E+03 8.9583E-01 5.0000E+03 2.0000E+00 0.0000E+00 3.3389E+00
0.0000E+00 1.0000E+00 0.0000E+00 3.5321E+00 0.0000E+00 0.0000E+00",
" 2.0010E+03 1.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 5.8164E+00
0.0000E+00 1.0000E+00 0.0000E+00 5.9300E+00 0.0000E+00 0.0000E+00",
" 2.0010E+03 1.9167E+00 5.0000E+03 2.0000E+00 0.0000E+00 8.3633E+00
0.0000E+00 1.0000E+00 0.0000E+00 8.7011E+00 0.0000E+00 0.0000E+00",
" 2.0010E+03 2.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.0092E+01
0.0000E+00 1.0000E+00 0.0000E+00 1.0324E+01 0.0000E+00 0.0000E+00",
" 2.0010E+03 2.9375E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.1490E+01
0.0000E+00 1.0000E+00 0.0000E+00 1.1688E+01 0.0000E+00 0.0000E+00",
" 2.0010E+03 3.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.2940E+01
0.0000E+00 1.0000E+00 0.0000E+00 1.3236E+01 0.0000E+00 0.0000E+00",
" 2.0010E+03 4.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.1267E+01
0.0000E+00 1.0000E+00 0.0000E+00 1.1324E+01 0.0000E+00 0.0000E+00"
)[rep(1:10, 1000)], file="FILENAME", sep="\n")
Dennis
Dennis Fisher MD
P < (The "P Less Than" Company)
Phone: 1-866-PLessThan (1-866-753-7784)
Fax: 1-866-PLessThan (1-866-753-7784)
www.PLessThan.com
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.