On Thu, Jan 27, 2011 at 11:23 PM, H Roark <hrbuil...@hotmail.com> wrote:
>
> I need to import a large number of simple, space-delimited text files with a 
> few columns of data each. The one quirk is that some rows are missing data 
> and some contain junk text at the end of each line. A typical file might look 
> like:
>
> a b c d
> 1 2 3 x
> 4 5 6
> 7 8 9 x
> 1 2 3 x c c
> 4 5 6 x
> 7 8 9 x
>
> I'm trying to avoid having to pre-process the text files, as they all sit on 
> an ftp site that I don't manage.  My initial approach was just to read the 
> files using a read.table() statement with the arguments flush and fill set to 
> TRUE. For example, to import the above text file I tried:
>
> read.table(file="ftp://ftp.example.dta";, header=T, row.names=NULL, fill=T, 
> flush=T)
>
> However, R throws the error "more columns than column names" and won't import 
> the file.
>
> Interestingly, if I move the extra text "c c" from line 5 to line 6 in the 
> data file, read.table() reads the file just fine, and ignores the "c c".  So, 
> my first question is, why does simply moving these data down a row solve this 
> problem?
>
> Next, I decided to try reading the file with the scan() function and it 
> worked perfectly:
>
> data.frame(scan(file="ftp://ftp.example.dta";, what=list(a=0, b=0, c=0, d=""), 
> sep=" ", skip=1, flush=T, fill=T))
>
> I'm new to R, but as I understand it read.table() is based on the scan() 
> function. This makes me wonder if there is an additional argument I can add 
> to read.table() to make it import the file successfully, as scan() was able 
> to do.  Any help in this regard would be very much appreciated.  I'd also 
> really like to hear folks' perspectives on the merits of scan() versus 
> read.table() (e.g. when is scan() the best option?).
>

Read the header into nms and then the data into DF and then put them together:

con <- file("myfile.dat")
nms <- scan(con, what = "", nlines = 1)
DF <- read.table(con, fill = TRUE)
DF <- setNames(DF[seq_along(nms)], nms)

or just read it twice: first the one line of the header and then the data:

nms <- unlist(read.table("myfile.dat", nrows = 1))
DF <- read.table("myfile.dat", fill = TRUE, skip = 1)
DF <- setNames(DF[seq_along(nms)], nms)


-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to