On Mon, Sep 30, 2013 at 9:45 AM, Milan Bouchet-Valat <nalimi...@club.fr> wrote: > Le lundi 30 septembre 2013 à 08:38 -0500, Joshua Ulrich a écrit : >> On Mon, Sep 30, 2013 at 7:33 AM, Milan Bouchet-Valat <nalimi...@club.fr> >> wrote: >> > Hi! >> > >> > >> > It seems that read.table() in R 3.0.1 (Linux 64-bit) does not consider >> > quoted integers as an acceptable value for columns for which >> > colClasses="integer". But when colClasses is omitted, these columns are >> > read as integer anyway. >> > >> > For example, let's consider a file named file.dat, containing: >> > "1" >> > "2" >> > >> >> read.table("file.dat", colClasses="integer") >> > Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, >> > : >> > scan() expected 'an integer' and got '"1"' >> > >> > But: >> >> str(read.table("file.dat")) >> > 'data.frame': 2 obs. of 1 variable: >> > $ V1: int 1 2 >> > >> > The latter result is indeed documented in ?read.table: >> > Unless ‘colClasses’ is specified, all columns are read as >> > character columns and then converted using ‘type.convert’ to >> > logical, integer, numeric, complex or (depending on ‘as.is’) >> > factor as appropriate. Quotes are (by default) interpreted in all >> > fields, so a column of values like ‘"42"’ will result in an >> > integer column. >> > >> > >> > Should the former behavior be considered a bug? >> > >> No. If you tell read.table the column is integer and it's actually >> character on disk, it should be an error. > All values in a CSV file are stored as characters on disk, disregarding > the fact that they are surrounded by quotes or not. 1 is saved as > 00110001 (ASCII character #49), not 00000001, nor 00000000 00000000 > 00000000 00000001 (as would for example imply a 32 bit storage of > integers). > Yes, I'm aware that write.table creates a character representation of the data on disk. That's its purpose. writeBin is for writing actual binary representations. I thought you would understand that by "actually character on disk" I meant "actually a quoted value". I assumed you would understand my intent.
read.table uses scan to read the file. ?scan says: The allowed input for a numeric field is optional whitespace followed either ‘NA’ or an optional sign followed by a decimal or hexadecimal constant (see NumericConstants), or ‘NaN’, ‘Inf’ or ‘infinity’ (ignoring case). Out-of-range values are recorded as ‘Inf’, ‘-Inf’ or ‘0’. For an integer field the allowed input is optional whitespace, followed by either ‘NA’ or an optional sign and one or more digits (‘0-9’): all out-of-range values are converted to ‘NA_integer_’. There's no mention of quotes being allowed. > So, with all due respect, please refrain from formulating such blatantly > erroneous statements. > So, with all due respect, please refrain from formulating such blatantly pedantic responses to someone trying to help you. > > Regards > > >> > This creates problems when combined with read.table.ffdf from package >> > ff, since this function tries to guess the column classes by reading the >> > first rows of the file, and then passes colClasses to read.table to read >> > the remaining rows by chunks. A column of quoted integers is correctly >> > detected as integer in the first read, but read.table() fails in >> > subsequent reads. >> > >> This sounds like a issue with read.table.ffdf. The column of quoted >> integers is *incorrectly* detected as integer because they're actually >> character on disk. read.table.ffdf should rely on how the data are >> actually stored on disk (via as.is=TRUE), not how read.table might >> convert them once they're read into R. >> >> > >> > Regards >> > >> > ______________________________________________ >> > R-devel@r-project.org mailing list >> > https://stat.ethz.ch/mailman/listinfo/r-devel >> >> -- >> Joshua Ulrich | about.me/joshuaulrich >> FOSS Trading | www.fosstrading.com > ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel