I am trying to understand the behaviour of read.table() reading delimited files (with header=TRUE and fill=TRUE) when there are more (possibly spurious) columns than headings. I give below four small data files, all of which have one or two extra columns added to one line. Reading the first file produces an error message, the second produces a column of NA, the third adds an extra row, the fourth ignores the extra columns with no message and no NA. Most unintuitive! Here are my attempts to understand this, with questions interpolated.
The behaviour on the first file seems self-explanatory. The number of headings determines the number of columns, and extra data columns are not allowed. (On the other hand, the help ?read.table says that the number of columns is determined from the first five rows, which suggests that the header line is not the only determiner. If headers, when present, are indeed the only determiner, perhaps this should be mentioned in the help. Are headers actually equivalent to specifying the same set of names using the col.names argument?) For the second file, the first column is being taken as row names. This agrees with the help which says if "the header line has one less entry than the number of columns, the first column is taken to be the row names". OK, perhaps not the ideal solution for this data file, but clearly documented behaviour. In the third file, the extra columns are being taken to be a new row. This seems wrong, because the help says that cases correspond to lines. There is no suggestion in the documentation that a line of the file could contain multiple cases. This is the result I have most trouble with. I guess could prevent this behaviour by flush=TRUE. File 4 is curious. Here the number of columns has been determined, using the first 5 rows of the file, to be two. The extra column on line 6 can't change this, so the first column doesn't become row names. But in that case, shouldn't the extra column found on line 6 produce an error message, same as for file 1? Specifying colClasses to be a vector of length more than 2 when reading file 3 will produce a result similar to file 4, but with a warning. It is not clear to me why colClasses should have an influence, since it doesn't change the determination of the number of columns. Why a warning here, but an error for file 1 and no message for file 4? Any comments gratefully received. Gordon X,Y a,2 b,4,, c,6 X,Y a,2 b,4, c,6 X,Y a,2 b,4 c,6 d,8 e,10,, f,12 X,Y a,2 b,4 c,6 d,8 e,10, f,12 > read.csv("test1.txt") Error in read.table(file = file, header = header, sep = sep, quote = quote, : more columns than column names > read.csv("test2.txt") X Y a 2 NA b 4 NA c 6 NA > read.csv("test3.txt") X Y 1 a 2 2 b 4 3 c 6 4 d 8 5 e 10 6 NA 7 f 12 > read.csv("test4.txt") X Y 1 a 2 2 b 4 3 c 6 4 d 8 5 e 10 6 f 12 > read.csv("test3.txt",colClasses=c(NA,NA)) X Y 1 a 2 2 b 4 3 c 6 4 d 8 5 e 10 6 NA 7 f 12 > read.csv("test3.txt",colClasses=c(NA,NA,NA,NA)) X Y 1 a 2 2 b 4 3 c 6 4 d 8 5 e 10 6 f 12 Warning message: cols = 2 != length(data) = 4 in: read.table(file = file, header = header, sep = sep, quote = quote, > sessionInfo() R version 2.4.0 Under development (unstable) (2006-07-25 r38698) i386-pc-mingw32 locale: LC_COLLATE=English_Australia.1252;LC_CTYPE=English_Australia.1252;LC_MONETARY=English_Australia.1252;LC_NUMERIC=C;LC_TIME=English_Australia.1252 attached base packages: [1] "methods" "stats" "graphics" "grDevices" "utils" "datasets" "base" ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel