On Sun, Jun 14, 2009 at 09:21:24PM +0100, Ted Harding wrote: > On 14-Jun-09 18:56:01, Gabor Grothendieck wrote: > > If read.csv's colClasses= argument is NOT used then read.csv accepts > > double quoted numerics: > > > > 1: > read.csv(stdin()) > > 0: A,B > > 1: "1",1 > > 2: "2",2 > > 3: > > A B > > 1 1 1 > > 2 2 2 > > > > However, if colClasses is used then it seems that it does not: > > > >> read.csv(stdin(), colClasses = "numeric") > > 0: A,B > > 1: "1",1 > > 2: "2",2 > > 3: > > Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, > > na.strings, : > > scan() expected 'a real', got '"1"' > > > > Is this really intended? I would have expected that a csv file > > in which each field is surrounded with double quotes is acceptable > > in both cases. This may be documented as is yet seems undesirable > > from both a consistency viewpoint and the viewpoint that it should > > be possible to double quote fields in a csv file. > > Well, the default for colClasses is NA, for which ?read.csv says: > [...] > Possible values are 'NA' (when 'type.convert' is used), > [...] > and then ?type.convert says: > This is principally a helper function for 'read.table'. Given a > character vector, it attempts to convert it to logical, integer, > numeric or complex, and failing that converts it to factor unless > 'as.is = TRUE'. The first type that can accept all the non-missing > values is chosen. > > It would seem that type 'logical' won't accept integer (naively one > might expect 1 --> TRUE, but see experiment below), so the first > acceptable type for "1" is integer, and that is what happens. > So it is indeed documented (in the R[ecursive] sense of "documented" :)) > > However, presumably when colClasses is used then type.convert() is > not called, in which case R sees itself being asked to assign a > character entity to a destination which it has been told shall be > integer, and therefore, since the default for as.is is > as.is = !stringsAsFactors > but for this ?read.csv says that stringsAsFactors "is overridden > bu [sic] 'as.is' and 'colClasses', both of which allow finer > control.", so that wouldn't come to the rescue either. > > Experiment: > X <-logical(10) > class(X) > # [1] "logical" > X[1]<-1 > X > # [1] 1 0 0 0 0 0 0 0 0 0 > class(X) > # [1] "numeric" > so R has converted X from class 'logical' to class 'numeric' > on being asked to assign a number to a logical; but in this > case its hands were not tied by colClasses. > > Or am I missing something?!!
In my opinion, you explain, how it happens that there is a difference in the behavior between read.csv(stdin(), colClasses = "numeric") and read.csv(stdin()) but not, why it is so. The algorithm "use the smallest type, which accepts all non-missing values" may well be applied to the input values either literally or after removing the quotes. Is there a reason, why read.csv(stdin(), colClasses = "numeric") removes quotes from the input values and read.csv(stdin()) does not? Using double-quote characters is a part of the definition of CSV file, see, for example http://en.wikipedia.org/wiki/Comma_separated_values where one may find Fields may always be enclosed within double-quote characters, whether necessary or not. Petr. ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel